Data Wrangling - one data frame 🛠

MATH/COSC 3570 Introduction to Data Science

Dr. Cheng-Han Yu
Department of Mathematical and Statistical Sciences
Marquette University

Grammar of Data Manipulation

Starting this week, we are going to talk about Grammar of Data Manipulation or Grammar of Data Wrangling.
In particular, we are going to use the tidyverse package dplyr to manipulate or transform our data.
If you become a data scientist after college, you will find that you spend most of the time cleaning or wrangling your data because the raw data usually look ugly, big and unorganized, and you have to grab what you really need for your analysis. Once the data set is ready, fitting a machine learning model or run an algorithm is just couple lines of code. Very easy because nowadays, many models and algorithms have been wrapped in a function. You just need to call the function.
You may feel a little bored the coming two weeks because I think data wrangling is not as fun as data visualization. But we have no choice. It is a must-have skill to be a data scientist, and I believe you will not learn this in other courses, but in this course only, or maybe in my course only.
All right. Let’s begin.

Grammar of Data Wrangling: dplyr 📦

based on the concepts of functions as verbs that manipulate data frames

mutate: create new columns from the existing¹
filter: pick rows matching criteria
slice: pick rows using index(es)
distinct: filter for unique rows
select: pick columns by name
summarise: reduce variables to values
group_by: for grouped operations
arrange: reorder rows
… (many more)

Rules of dplyr Functions

First argument is always a data frame
Subsequent arguments say what to do with that data frame
Always return a data frame
Don’t modify in place

Data: US gun murders by state for 2010

(murders <- read_csv("./data/murders.csv"))

# A tibble: 51 × 5
  state      abb   region population total
  <chr>      <chr> <chr>       <dbl> <dbl>
1 Alabama    AL    South     4779736   135
2 Alaska     AK    West       710231    19
3 Arizona    AZ    West      6392017   232
4 Arkansas   AR    South     2915918    93
5 California CA    West     37253956  1257
6 Colorado   CO    West      5029196    65
# ℹ 45 more rows

Adding a New Variable (Column) with `mutate()`

dplyr::mutate() takes
- a data frame as the 1st argument
- the name and values of the variable as the 2nd argument using format name = values.

(murders <- murders |>  
     mutate(rate = total / population * 100000)) #<<

# A tibble: 51 × 6
  state      abb   region population total  rate
  <chr>      <chr> <chr>       <dbl> <dbl> <dbl>
1 Alabama    AL    South     4779736   135  2.82
2 Alaska     AK    West       710231    19  2.68
3 Arizona    AZ    West      6392017   232  3.63
4 Arkansas   AR    South     2915918    93  3.19
5 California CA    West     37253956  1257  3.37
6 Colorado   CO    West      5029196    65  1.29
# ℹ 45 more rows

total and population inside the function are not defined in our R environment.
dplyr functions know to look for variables in the data frame provided in the 1st argument.

OK now let’s begin playing the the data set.
First, we can add a New Variable into the existing data set with the function mutate().
Remember, variables are stored by columns, and so adding new variables means add more columns to the data set.
The function mutate() takes the data frame as a first argument and the name and values of the variable as a second argument using the convention name = values.
Here, I compute the murder rate as the total number of murders divided by population and times 100,000. So the rate means the incidence rate per 100,000 people
total and population inside the function are not defined in our R environment, but we didn’t get an error.
Functions in dplyr know to look for variables in the data frame provided in the first argument. total will have the values in murders_tbl$total.
Now you can see the new data set has a new column variable rate shown in the last column.
Use relocate() to change column positions, using the same syntax as select() to make it easy to move blocks of columns at once.

mutate() creates new columns that are functions of existing variables. It can also modify (if the name is the same as an existing column) and delete columns (by setting their value to NULL). mutate( .data, …, .by = NULL, .keep = c(“all”, “used”, “unused”, “none”), .before = NULL, .after = NULL )

Filtering Observations (Rows) with `filter()`

dplyr::filter() takes a
- data frame as the 1st argument
- conditional statement as the 2nd. (pick rows matching criteria)

# filter the data table to only show the entries for which 
# the murder rate is lower than 0.7
murders |> 
    filter(rate < 0.7) #<<

# A tibble: 5 × 6
  state         abb   region        population total  rate
  <chr>         <chr> <chr>              <dbl> <dbl> <dbl>
1 Hawaii        HI    West             1360301     7 0.515
2 Iowa          IA    North Central    3046355    21 0.689
3 New Hampshire NH    Northeast        1316470     5 0.380
4 North Dakota  ND    North Central     672591     4 0.595
5 Vermont       VT    Northeast         625741     2 0.320

`filter()` for Many Conditions at Once

murders |>  
    filter(rate > 0.1 & rate < 0.7,  #<<
           region == "Northeast")  #<<

# A tibble: 2 × 6
  state         abb   region    population total  rate
  <chr>         <chr> <chr>          <dbl> <dbl> <dbl>
1 New Hampshire NH    Northeast    1316470     5 0.380
2 Vermont       VT    Northeast     625741     2 0.320

Logical Operators

operator	definition	operator	definition
`<`	less than	`x` \| `y`	`x` OR `y`
`<=`	less than or equal to	`is.na(x)`	if `x` is `NA`
`>`	greater than	`!is.na(x)`	if `x` is not `NA`
`>=`	greater than or equal to	`x %in% y`	if `x` is in `y`
`==`	exactly equal to	`!(x %in% y)`	if `x` is not in `y`
`!=`	not equal to	`!x`	not `x`
`x & y`	`x` AND `y`

`slice()` for Certain Rows using Indexes

# 3rd to 6th row
murders |> 
    slice(3:6)

# A tibble: 4 × 6
  state      abb   region population total  rate
  <chr>      <chr> <chr>       <dbl> <dbl> <dbl>
1 Arizona    AZ    West      6392017   232  3.63
2 Arkansas   AR    South     2915918    93  3.19
3 California CA    West     37253956  1257  3.37
4 Colorado   CO    West      5029196    65  1.29

How do we subset rows using matrix indexing?

murders[3:6, ]

# A tibble: 4 × 6
  state      abb   region population total  rate
  <chr>      <chr> <chr>       <dbl> <dbl> <dbl>
1 Arizona    AZ    West      6392017   232  3.63
2 Arkansas   AR    South     2915918    93  3.19
3 California CA    West     37253956  1257  3.37
4 Colorado   CO    West      5029196    65  1.29

`distinct()` to Filter for Unique Rows

# Select only unique/distinct rows from a data frame
murders |> distinct(region)  ## default

# A tibble: 4 × 1
  region       
  <chr>        
1 South        
2 West         
3 Northeast    
4 North Central

murders |> distinct(region, .keep_all = TRUE) ## keep all other variables

# A tibble: 4 × 6
  state       abb   region        population total  rate
  <chr>       <chr> <chr>              <dbl> <dbl> <dbl>
1 Alabama     AL    South            4779736   135  2.82
2 Alaska      AK    West              710231    19  2.68
3 Connecticut CT    Northeast        3574097    97  2.71
4 Illinois    IL    North Central   12830632   364  2.84

We can use distinct() function to select only unique/distinct rows from a data frame.
For example here, we use variable region to decide the unique rows. By default, the function only gives us the unique values of the variable region.
Do you wonder why we have the order South, West, Northeast, and North Central?
Actually, these values are the values of some rows.
If we keep all variables, we see that the south is the south for Alabama, west for alaska, northeast for connecticut, and North central for illinois.
distinct() functions grabs rows that first has each of the unique values of region variable.
So Alabama is the first observation that has value South, Alaska is the first observation that has value West, Connecticut is the first observation that has value Northeast, and Illinois is the first observation that has value North Central.
microbenchmark::microbenchmark(distinct(murders_tbl, region), as_tibble(unique(murders_tbl$region)))

`distinct()` Grabs First Row of The Unique Value

murders |> distinct(region, .keep_all = TRUE)

# A tibble: 4 × 6
  state       abb   region        population total  rate
  <chr>       <chr> <chr>              <dbl> <dbl> <dbl>
1 Alabama     AL    South            4779736   135  2.82
2 Alaska      AK    West              710231    19  2.68
3 Connecticut CT    Northeast        3574097    97  2.71
4 Illinois    IL    North Central   12830632   364  2.84

murders |> slice(1:5)

# A tibble: 5 × 6
  state      abb   region population total  rate
  <chr>      <chr> <chr>       <dbl> <dbl> <dbl>
1 Alabama    AL    South     4779736   135  2.82
2 Alaska     AK    West       710231    19  2.68
3 Arizona    AZ    West      6392017   232  3.63
4 Arkansas   AR    South     2915918    93  3.19
5 California CA    West     37253956  1257  3.37

Selecting Columns with `select()`

In dplyr::select(), the 1st argument is a data frame, followed by variable names being selected in the data.
The order of variable names matters!

names(murders)

[1] "state"      "abb"        "region"     "population" "total"     
[6] "rate"

# select three columns, assign this to a new object
murders |> select(region, rate, state)

# A tibble: 51 × 3
  region  rate state     
  <chr>  <dbl> <chr>     
1 South   2.82 Alabama   
2 West    2.68 Alaska    
3 West    3.63 Arizona   
4 South   3.19 Arkansas  
5 West    3.37 California
6 West    1.29 Colorado  
# ℹ 45 more rows

`select()` to Exclude Variables

## exclude variable population
murders |> select(-population)

# A tibble: 51 × 5
  state      abb   region total  rate
  <chr>      <chr> <chr>  <dbl> <dbl>
1 Alabama    AL    South    135  2.82
2 Alaska     AK    West      19  2.68
3 Arizona    AZ    West     232  3.63
4 Arkansas   AR    South     93  3.19
5 California CA    West    1257  3.37
6 Colorado   CO    West      65  1.29
# ℹ 45 more rows

`select()` a Range of Variables

names(murders)

[1] "state"      "abb"        "region"     "population" "total"     
[6] "rate"

## from region to rate
murders |> select(region:rate)

# A tibble: 51 × 4
  region population total  rate
  <chr>       <dbl> <dbl> <dbl>
1 South     4779736   135  2.82
2 West       710231    19  2.68
3 West      6392017   232  3.63
4 South     2915918    93  3.19
5 West     37253956  1257  3.37
6 West      5029196    65  1.29
# ℹ 45 more rows

`select()` Variables with Certain Characteristics

starts_with() is a tidy-select helper function.

murders |> select(starts_with("r"))

# A tibble: 51 × 2
  region  rate
  <chr>  <dbl>
1 South   2.82
2 West    2.68
3 West    3.63
4 South   3.19
5 West    3.37
6 West    1.29
# ℹ 45 more rows

`select()` Variables with Certain Characteristics

ends_with() is a tidy-select helper function.

murders |> select(ends_with("ion"))

# A tibble: 51 × 2
  region population
  <chr>       <dbl>
1 South     4779736
2 West       710231
3 West      6392017
4 South     2915918
5 West     37253956
6 West      5029196
# ℹ 45 more rows

tidy-select Helpers¹

starts_with(): Starts with a prefix
ends_with(): Ends with a suffix
contains(): Contains a literal string
num_range(): Matches a numerical range like x01, x02, x03
one_of(): Matches variable names in a character vector
everything(): Matches all variables
last_col(): Select last variable, possibly with an offset
matches(): Matches a regular expression (a sequence of symbols/characters expressing a string/pattern to be searched for within text)

Rationale for Pipe Operator

How do we show three variables (state, region, rate) for states that have murder rates below 0.7?

Method 1: Define the intermediate object new_table

new_table <- select(murders, state, region, rate) 
filter(new_table, rate < 0.7)

# A tibble: 5 × 3
  state         region         rate
  <chr>         <chr>         <dbl>
1 Hawaii        West          0.515
2 Iowa          North Central 0.689
3 New Hampshire Northeast     0.380
4 North Dakota  North Central 0.595
5 Vermont       Northeast     0.320

Remember we talked about the pipe operator, right? But we haven’t really used it often.
But it kinda make much sense to use the pipe operator for data manipulation using dplyr functions.
Because the first argument of dplyr function is always a data frame and its function output is also a data frame, which can be the input of another dplyr function.
So if we wanna manipulate our data set via several different actions using the dplyr functions step by step, pipe operator can be very useful. Let’s see why.
How do we show three variables (state, region, rate) for states that have murder rate below 0.71?
We can first select three variables state, region, rate, and save the resulting output to an object new_table, and then apply filter() function on the new_table to get the observations having murder rate less than 0.71
In fact, the object new_table is unnecessary. The table is not what we want, and any object created in the R environment occupies some memory space.

Rationale for Pipe Operator

How do we show three variables (state, region, rate) for states that have murder rates below 0.7?

Method 2: Apply one function onto the other with no intermediate object

## not so easy to read and understand
filter(select(murders, state, region, rate), rate < 0.7)

# A tibble: 5 × 3
  state         region         rate
  <chr>         <chr>         <dbl>
1 Hawaii        West          0.515
2 Iowa          North Central 0.689
3 New Hampshire Northeast     0.380
4 North Dakota  North Central 0.595
5 Vermont       Northeast     0.320

Rationale for Pipe Operator

The code that looks like a verbal description of what we want to do without intermediate objects:

data > select() > data after selecting > filter() > data after selecting and filtering

murders |> 
    select(state, region, rate) |>  
    filter(rate < 0.7)

# A tibble: 5 × 3
  state         region         rate
  <chr>         <chr>         <dbl>
1 Hawaii        West          0.515
2 Iowa          North Central 0.689
3 New Hampshire Northeast     0.380
4 North Dakota  North Central 0.595
5 Vermont       Northeast     0.320

Summarizing Data – `summarize()`

summarize() provides a data frame that summarizes the statistics we compute.

heights <- read_csv("./data/heights.csv")
head(heights)

# A tibble: 6 × 2
  sex    height
  <chr>   <dbl>
1 Male       75
2 Male       70
3 Male       68
4 Male       74
5 Male       61
6 Female     65

Summarizing Data – `summarize()`

(s <- heights |> 
    filter(sex == "Female") |> 
    summarize(avg = mean(height),  
              stdev = sd(height),
              median = median(height), 
              minimum = min(height)))

# A tibble: 1 × 4
    avg stdev median minimum
  <dbl> <dbl>  <dbl>   <dbl>
1  64.9  3.76   65.0      51

s$avg; s$minimum

[1] 64.9

[1] 51

summarise() produces a new data frame that is not any variant of the original data frame.

Suppose we want to find the mean and standard deviation of female height and store the two values as a data frame, we can start with the data set heights, then filter or grab the observations that have sex Female, and then with the filtered data, use the summarize function to create a data frame of two variables “average” and “standard_deviation” computed from the variable height.
You see that the resulting object s is a data frame with two variables.
Notice that summarise() changes the data frame entirely, it collapses rows down to a single summary statistic, and removes all columns that are irrelevant to the calculation. Basically, the resulting df is not any variant of the original df anymore.

summarise() changes the data frame entirely, it collapses rows down to a single summary statistic, and removes all columns that are irrelevant to the calculation.

Summarizing Data – `summarize()`

One variable quans that has 3 values. The output is a 3 by 1 data frame.

(s2 <- heights |>  
    filter(sex == "Female") |> 
    summarize(quans = quantile(height, c(0.1, 0.5, 0.9))))

# A tibble: 3 × 1
  quans
  <dbl>
1  61  
2  65.0
3  69

Grouping – `group_by()`

(heights_group <- heights |> 
     group_by(sex))  #<<

# A tibble: 1,050 × 2
# Groups:   sex [2]
  sex    height
  <chr>   <dbl>
1 Male       75
2 Male       70
3 Male       68
4 Male       74
5 Male       61
6 Female     65
# ℹ 1,044 more rows

class(heights_group)

[1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

heights_group is a grouped data frame.
Tibbles are similar, but see Groups: sex [2] after grouping data by sex.
summarize() behaves differently when acting on grouped_df.

Group and Summarize: `group_by()` + `summarize()`

summarize() applies the summarization to each group separately.

heights |> 
    group_by(sex) |> 
    summarize(avg = mean(height), stdev = sd(height), 
              median = median(height), minimum = min(height))

# A tibble: 2 × 5
  sex      avg stdev median minimum
  <chr>  <dbl> <dbl>  <dbl>   <dbl>
1 Female  64.9  3.76   65.0      51
2 Male    69.3  3.61   69        50

murders |>  
    group_by(region) |> 
    summarize(median_rate = median(rate))

# A tibble: 4 × 2
  region        median_rate
  <chr>               <dbl>
1 North Central        1.97
2 Northeast            1.80
3 South                3.40
4 West                 1.29

Sorting Rows in Data Frames – `arrange()`

arrange() orders entire data tables.

## order the states by population size
murders |>  
    arrange(population)

# A tibble: 51 × 6
  state                abb   region        population total   rate
  <chr>                <chr> <chr>              <dbl> <dbl>  <dbl>
1 Wyoming              WY    West              563626     5  0.887
2 District of Columbia DC    South             601723    99 16.5  
3 Vermont              VT    Northeast         625741     2  0.320
4 North Dakota         ND    North Central     672591     4  0.595
5 Alaska               AK    West              710231    19  2.68 
6 South Dakota         SD    North Central     814180     8  0.983
# ℹ 45 more rows

15-dplyr

In lab.qmd ## Lab 15 section, import the murders.csv data and

Add (mutate) the variable rate = total / population * 100000 to murders data (as I did).
Filter states that are in region Northeast or West and their murder rate is less than 1.
Select variables state, region, rate.

Print the output table after you do 1. to 3., and save it as object my_states.
Group my_states by region. Then summarize data by creating variables avg and stdev that compute the mean and standard deviation of rate.
Arrange the summarized table by avg.

_______ <- _______ |> 
    mutate(_______) |> 
    filter(_______) |> 
    select(_______)

_______ |>  
    group_by(______) |> 
    summarize(______) |> 
    arrange(_______)

          state    region  rate
1        Hawaii      West 0.515
2         Idaho      West 0.766
3         Maine Northeast 0.828
4 New Hampshire Northeast 0.380
5        Oregon      West 0.940
6          Utah      West 0.796
7       Vermont Northeast 0.320
8       Wyoming      West 0.887

# A tibble: 2 × 3
  region      avg std_dev
  <fct>     <dbl>   <dbl>
1 Northeast 0.509   0.278
2 West      0.781   0.164

Data Manipulation

import numpy as np
import pandas as pd

murders = pd.read_csv('./data/murders.csv')
murders.head(5)

        state abb region  population  total
0     Alabama  AL  South     4779736    135
1      Alaska  AK   West      710231     19
2     Arizona  AZ   West     6392017    232
3    Arkansas  AR  South     2915918     93
4  California  CA   West    37253956   1257

New Variables `.assign`

dplyr::mutate()
Have to use murders.total and murders.population instead of total and popution.

murders = murders.assign(
    rate = round(murders.total / murders.population * 100000, 2))

murders.head(5)

        state abb region  population  total  rate
0     Alabama  AL  South     4779736    135  2.82
1      Alaska  AK   West      710231     19  2.68
2     Arizona  AZ   West     6392017    232  3.63
3    Arkansas  AR  South     2915918     93  3.19
4  California  CA   West    37253956   1257  3.37

Filter Rows `.query`

dplyr::filter()
Conditions must be a string to be evaluated!
Cannot write murders.rate, and should use rate.

murders.query("rate < 0.7")

            state abb         region  population  total  rate
11         Hawaii  HI           West     1360301      7  0.51
15           Iowa  IA  North Central     3046355     21  0.69
29  New Hampshire  NH      Northeast     1316470      5  0.38
34   North Dakota  ND  North Central      672591      4  0.59
45        Vermont  VT      Northeast      625741      2  0.32

Select Columns `.filter`

dplyr::select()
Have to be strings

murders.filter(items = ['region', 'rate', 'state'])

           region   rate                 state
0           South   2.82               Alabama
1            West   2.68                Alaska
2            West   3.63               Arizona
3           South   3.19              Arkansas
4            West   3.37            California
5            West   1.29              Colorado
6       Northeast   2.71           Connecticut
7           South   4.23              Delaware
8           South  16.45  District of Columbia
9           South   3.40               Florida
10          South   3.79               Georgia
11           West   0.51                Hawaii
12           West   0.77                 Idaho
13  North Central   2.84              Illinois
14  North Central   2.19               Indiana
15  North Central   0.69                  Iowa
16  North Central   2.21                Kansas
17          South   2.67              Kentucky
18          South   7.74             Louisiana
19      Northeast   0.83                 Maine
20          South   5.07              Maryland
21      Northeast   1.80         Massachusetts
22  North Central   4.18              Michigan
23  North Central   1.00             Minnesota
24          South   4.04           Mississippi
25  North Central   5.36              Missouri
26           West   1.21               Montana
27  North Central   1.75              Nebraska
28           West   3.11                Nevada
29      Northeast   0.38         New Hampshire
30      Northeast   2.80            New Jersey
31           West   3.25            New Mexico
32      Northeast   2.67              New York
33          South   3.00        North Carolina
34  North Central   0.59          North Dakota
35  North Central   2.69                  Ohio
36          South   2.96              Oklahoma
37           West   0.94                Oregon
38      Northeast   3.60          Pennsylvania
39      Northeast   1.52          Rhode Island
40          South   4.48        South Carolina
41  North Central   0.98          South Dakota
42          South   3.45             Tennessee
43          South   3.20                 Texas
44           West   0.80                  Utah
45      Northeast   0.32               Vermont
46          South   3.12              Virginia
47           West   1.38            Washington
48          South   1.46         West Virginia
49  North Central   1.71             Wisconsin
50           West   0.89               Wyoming

Grouping `.groupby + .agg`

dplyr::group_by() + dplyr::summarize()

heights = pd.read_csv('./data/heights.csv')

heights.groupby('sex')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x145164fd0>

## a data frame
heights.groupby('sex').agg(['mean', 'std', 'median', 'min'])

           height                           
             mean       std     median   min
sex                                         
Female  64.939424  3.760656  64.980315  51.0
Male    69.314755  3.611024  69.000000  50.0

Sorting `.sort_values`

dplyr::arrange()

murders.sort_values('population').head(5)

                   state abb         region  population  total   rate
50               Wyoming  WY           West      563626      5   0.89
8   District of Columbia  DC          South      601723     99  16.45
45               Vermont  VT      Northeast      625741      2   0.32
34          North Dakota  ND  North Central      672591      4   0.59
1                 Alaska  AK           West      710231     19   2.68

dplyr::arrange(desc())

murders.sort_values('rate', ascending = False).head(5)

                   state abb         region  population  total   rate
8   District of Columbia  DC          South      601723     99  16.45
18             Louisiana  LA          South     4533372    351   7.74
25              Missouri  MO  North Central     5988927    321   5.36
20              Maryland  MD          South     5773552    293   5.07
40        South Carolina  SC          South     4625364    207   4.48

Data Wrangling - one data frame 🛠

Grammar of Data Manipulation

Grammar of Data Wrangling: dplyr 📦

Rules of dplyr Functions

Data: US gun murders by state for 2010

Adding a New Variable (Column) with mutate()

Filtering Observations (Rows) with filter()

filter() for Many Conditions at Once

Logical Operators

slice() for Certain Rows using Indexes

distinct() to Filter for Unique Rows

distinct() Grabs First Row of The Unique Value

Selecting Columns with select()

select() to Exclude Variables

select() a Range of Variables

select() Variables with Certain Characteristics

select() Variables with Certain Characteristics

tidy-select Helpers1

Rationale for Pipe Operator

Rationale for Pipe Operator

Rationale for Pipe Operator

Summarizing Data – summarize()

Summarizing Data – summarize()

Summarizing Data – summarize()

Grouping – group_by()

Group and Summarize: group_by() + summarize()

Sorting Rows in Data Frames – arrange()

Data Manipulation

New Variables .assign

Filter Rows .query

Select Columns .filter

Grouping .groupby + .agg

Sorting .sort_values

Adding a New Variable (Column) with `mutate()`

Filtering Observations (Rows) with `filter()`

`filter()` for Many Conditions at Once

`slice()` for Certain Rows using Indexes

`distinct()` to Filter for Unique Rows

`distinct()` Grabs First Row of The Unique Value

Selecting Columns with `select()`

`select()` to Exclude Variables

`select()` a Range of Variables

`select()` Variables with Certain Characteristics

`select()` Variables with Certain Characteristics

tidy-select Helpers¹

Summarizing Data – `summarize()`

Summarizing Data – `summarize()`

Summarizing Data – `summarize()`

Grouping – `group_by()`

Group and Summarize: `group_by()` + `summarize()`

Sorting Rows in Data Frames – `arrange()`

New Variables `.assign`

Filter Rows `.query`

Select Columns `.filter`

Grouping `.groupby + .agg`

Sorting `.sort_values`