MATH/COSC 3570 Introduction to Data Science
mutate
: create new columns from the existing1
filter
: pick rows matching criteria
slice
: pick rows using index(es)
distinct
: filter for unique rowsselect
: pick columns by name
summarise
: reduce variables to valuesgroup_by
: for grouped operationsarrange
: reorder rowsFirst argument is always a data frame
Subsequent arguments say what to do with that data frame
Always return a data frame
Don’t modify in place
# A tibble: 51 Ă— 5
state abb region population total
<chr> <chr> <chr> <dbl> <dbl>
1 Alabama AL South 4779736 135
2 Alaska AK West 710231 19
3 Arizona AZ West 6392017 232
4 Arkansas AR South 2915918 93
5 California CA West 37253956 1257
6 Colorado CO West 5029196 65
# â„ą 45 more rows
mutate()
dplyr::mutate()
takes
name = values
.# A tibble: 51 Ă— 6
state abb region population total rate
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Alabama AL South 4779736 135 2.82
2 Alaska AK West 710231 19 2.68
3 Arizona AZ West 6392017 232 3.63
4 Arkansas AR South 2915918 93 3.19
5 California CA West 37253956 1257 3.37
6 Colorado CO West 5029196 65 1.29
# â„ą 45 more rows
total
and population
inside the function are not defined in our R environment.
dplyr
functions know to look for variables in the data frame provided in the 1st argument.
filter()
dplyr::filter()
takes a
# filter the data table to only show the entries for which
# the murder rate is lower than 0.7
murders |>
filter(rate < 0.7) #<<
# A tibble: 5 Ă— 6
state abb region population total rate
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Hawaii HI West 1360301 7 0.515
2 Iowa IA North Central 3046355 21 0.689
3 New Hampshire NH Northeast 1316470 5 0.380
4 North Dakota ND North Central 672591 4 0.595
5 Vermont VT Northeast 625741 2 0.320
filter()
for Many Conditions at Onceoperator | definition | operator | definition |
---|---|---|---|
< |
less than |
x | y
|
x OR y
|
<= |
less than or equal to | is.na(x) |
if x is NA
|
> |
greater than | !is.na(x) |
if x is not NA
|
>= |
greater than or equal to | x %in% y |
if x is in y
|
== |
exactly equal to | !(x %in% y) |
if x is not in y
|
!= |
not equal to | !x |
not x
|
x & y |
x AND y
|
slice()
for Certain Rows using Indexes# A tibble: 4 Ă— 6
state abb region population total rate
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Arizona AZ West 6392017 232 3.63
2 Arkansas AR South 2915918 93 3.19
3 California CA West 37253956 1257 3.37
4 Colorado CO West 5029196 65 1.29
How do we subset rows using matrix indexing?
distinct()
to Filter for Unique Rows# A tibble: 4 Ă— 1
region
<chr>
1 South
2 West
3 Northeast
4 North Central
# A tibble: 4 Ă— 6
state abb region population total rate
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Alabama AL South 4779736 135 2.82
2 Alaska AK West 710231 19 2.68
3 Connecticut CT Northeast 3574097 97 2.71
4 Illinois IL North Central 12830632 364 2.84
distinct()
Grabs First Row of The Unique Value# A tibble: 4 Ă— 6
state abb region population total rate
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Alabama AL South 4779736 135 2.82
2 Alaska AK West 710231 19 2.68
3 Connecticut CT Northeast 3574097 97 2.71
4 Illinois IL North Central 12830632 364 2.84
select()
dplyr::select()
, the 1st argument is a data frame, followed by variable names being selected in the data.select()
to Exclude Variables# A tibble: 51 Ă— 5
state abb region total rate
<chr> <chr> <chr> <dbl> <dbl>
1 Alabama AL South 135 2.82
2 Alaska AK West 19 2.68
3 Arizona AZ West 232 3.63
4 Arkansas AR South 93 3.19
5 California CA West 1257 3.37
6 Colorado CO West 65 1.29
# â„ą 45 more rows
select()
a Range of Variables[1] "state" "abb" "region" "population" "total"
[6] "rate"
# A tibble: 51 Ă— 4
region population total rate
<chr> <dbl> <dbl> <dbl>
1 South 4779736 135 2.82
2 West 710231 19 2.68
3 West 6392017 232 3.63
4 South 2915918 93 3.19
5 West 37253956 1257 3.37
6 West 5029196 65 1.29
# â„ą 45 more rows
select()
Variables with Certain Characteristicsstarts_with()
is a tidy-select helper function.select()
Variables with Certain Characteristicsends_with()
is a tidy-select helper function.starts_with()
: Starts with a prefixends_with()
: Ends with a suffixcontains()
: Contains a literal stringnum_range()
: Matches a numerical range like x01, x02, x03one_of()
: Matches variable names in a character vectoreverything()
: Matches all variableslast_col()
: Select last variable, possibly with an offsetmatches()
: Matches a regular expression (a sequence of symbols/characters expressing a string/pattern to be searched for within text)How do we show three variables (state, region, rate) for states that have murder rates below 0.7?
new_table
How do we show three variables (state, region, rate) for states that have murder rates below 0.7?
data > select() > data after selecting > filter() > data after selecting and filtering
summarize()
summarize()
provides a data frame that summarizes the statistics we compute.
summarize()
(s <- heights |>
filter(sex == "Female") |>
summarize(avg = mean(height),
stdev = sd(height),
median = median(height),
minimum = min(height)))
# A tibble: 1 Ă— 4
avg stdev median minimum
<dbl> <dbl> <dbl> <dbl>
1 64.9 3.76 65.0 51
[1] 64.9
[1] 51
summarise()
produces a new data frame that is not any variant of the original data frame.
summarize()
quans
that has 3 values. The output is a 3 by 1 data frame.group_by()
# A tibble: 1,050 Ă— 2
# Groups: sex [2]
sex height
<chr> <dbl>
1 Male 75
2 Male 70
3 Male 68
4 Male 74
5 Male 61
6 Female 65
# â„ą 1,044 more rows
heights_group
is a grouped data frame.
Tibbles are similar, but see Groups: sex [2]
after grouping data by sex
.
summarize()
behaves differently when acting on grouped_df
.
group_by()
+ summarize()
summarize()
applies the summarization to each group separately.arrange()
arrange()
orders entire data tables.# A tibble: 51 Ă— 6
state abb region population total rate
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Wyoming WY West 563626 5 0.887
2 District of Columbia DC South 601723 99 16.5
3 Vermont VT Northeast 625741 2 0.320
4 North Dakota ND North Central 672591 4 0.595
5 Alaska AK West 710231 19 2.68
6 South Dakota SD North Central 814180 8 0.983
# â„ą 45 more rows
15-dplyr
In lab.qmd ## Lab 15
section, import the murders.csv
data and
Add (mutate) the variable rate = total / population * 100000
to murders
data (as I did).
Filter states that are in region Northeast or West and their murder rate is less than 1.
Select variables state
, region
, rate
.
Print the output table after you do 1. to 3., and save it as object my_states
.
Group my_states
by region
. Then summarize data by creating variables avg
and stdev
that compute the mean and standard deviation of rate
.
Arrange the summarized table by avg
.
state region rate
1 Hawaii West 0.515
2 Idaho West 0.766
3 Maine Northeast 0.828
4 New Hampshire Northeast 0.380
5 Oregon West 0.940
6 Utah West 0.796
7 Vermont Northeast 0.320
8 Wyoming West 0.887
# A tibble: 2 Ă— 3
region avg std_dev
<fct> <dbl> <dbl>
1 Northeast 0.509 0.278
2 West 0.781 0.164
.assign
Have to use murders.total
and murders.population
instead of total
and popution
.
.query
Conditions must be a string to be evaluated!
Cannot write murders.rate
, and should use rate
.
.filter
Have to be strings
region rate state
0 South 2.82 Alabama
1 West 2.68 Alaska
2 West 3.63 Arizona
3 South 3.19 Arkansas
4 West 3.37 California
5 West 1.29 Colorado
6 Northeast 2.71 Connecticut
7 South 4.23 Delaware
8 South 16.45 District of Columbia
9 South 3.40 Florida
10 South 3.79 Georgia
11 West 0.51 Hawaii
12 West 0.77 Idaho
13 North Central 2.84 Illinois
14 North Central 2.19 Indiana
15 North Central 0.69 Iowa
16 North Central 2.21 Kansas
17 South 2.67 Kentucky
18 South 7.74 Louisiana
19 Northeast 0.83 Maine
20 South 5.07 Maryland
21 Northeast 1.80 Massachusetts
22 North Central 4.18 Michigan
23 North Central 1.00 Minnesota
24 South 4.04 Mississippi
25 North Central 5.36 Missouri
26 West 1.21 Montana
27 North Central 1.75 Nebraska
28 West 3.11 Nevada
29 Northeast 0.38 New Hampshire
30 Northeast 2.80 New Jersey
31 West 3.25 New Mexico
32 Northeast 2.67 New York
33 South 3.00 North Carolina
34 North Central 0.59 North Dakota
35 North Central 2.69 Ohio
36 South 2.96 Oklahoma
37 West 0.94 Oregon
38 Northeast 3.60 Pennsylvania
39 Northeast 1.52 Rhode Island
40 South 4.48 South Carolina
41 North Central 0.98 South Dakota
42 South 3.45 Tennessee
43 South 3.20 Texas
44 West 0.80 Utah
45 Northeast 0.32 Vermont
46 South 3.12 Virginia
47 West 1.38 Washington
48 South 1.46 West Virginia
49 North Central 1.71 Wisconsin
50 West 0.89 Wyoming
.groupby + .agg
dplyr::group_by() + dplyr::summarize()
.sort_values
state abb region population total rate
50 Wyoming WY West 563626 5 0.89
8 District of Columbia DC South 601723 99 16.45
45 Vermont VT Northeast 625741 2 0.32
34 North Dakota ND North Central 672591 4 0.59
1 Alaska AK West 710231 19 2.68
dplyr::arrange(desc())
state abb region population total rate
8 District of Columbia DC South 601723 99 16.45
18 Louisiana LA South 4533372 351 7.74
25 Missouri MO North Central 5988927 321 5.36
20 Maryland MD South 5773552 293 5.07
40 South Carolina SC South 4625364 207 4.48