Tidying Data 🧹

MATH/COSC 3570 Introduction to Data Science

Dr. Cheng-Han Yu
Department of Mathematical and Statistical Sciences
Marquette University

Grammar of Data Tidying

Have data organised in an unideal way for our analysis
Want to re-organise the data to carry on with our analysis

The goal of tidyr is to help you tidy your data via

pivoting for going between wider and longer data
separating and uniting character columns
clarifying how NAs should be treated
nesting and unnesting columns

Pivoting

To tidy your data,
- first, figure out what the (column) variables and (row) observations are
- second is to resolve one of two common problems:
  - One (column) variable might be spread across multiple columns.
  - One (row) observation might be scattered across multiple rows.

customers <- read_csv("./data/sales/customers.csv")

wider (\(2 \times 4\))

more columns

customers

# A tibble: 2 × 4
  customer_id item_1 item_2       item_3
        <dbl> <chr>  <chr>        <chr> 
1           1 bread  milk         banana
2           2 milk   toilet paper <NA>

longer (\(6 \times 3\))

more rows by pivot_longer()

# A tibble: 6 × 3
  customer_id item_no item        
        <dbl> <chr>   <chr>       
1           1 item_1  bread       
2           1 item_2  milk        
3           1 item_3  banana      
4           2 item_1  milk        
5           2 item_2  toilet paper
6           2 item_3  <NA>

`pivot_longer()` and `pivot_wider()`

To fix these problems, we’ll need pivot_longer() and pivot_wider()
Starts with a data set,
- pivot_longer() add more rows and decreases the number of columns.
- pivot_wider() add more columns and decreases the number of rows.

`pivot_longer()` and `pivot_wider()`

One variable spreads across multiple columns

One subject is scattered across multiple rows

`pivot_longer()` and `pivot_wider()`

Data: sales/customers.csv

customers <- read_csv("data/sales/customers.csv")

wider (\(2 \times 4\))

more columns

customers

# A tibble: 2 × 4
  customer_id item_1 item_2       item_3
        <dbl> <chr>  <chr>        <chr> 
1           1 bread  milk         banana
2           2 milk   toilet paper <NA>

longer (\(6 \times 3\))

more rows by pivot_longer()

# A tibble: 6 × 3
  customer_id item_no item        
        <dbl> <chr>   <chr>       
1           1 item_1  bread       
2           1 item_2  milk        
3           1 item_3  banana      
4           2 item_1  milk        
5           2 item_2  toilet paper
6           2 item_3  <NA>

`pivot_longer()`

data: data frame

pivot_longer(
    data,
    cols, 
    names_to = "name", 
    values_to = "value"
    )

`pivot_longer()`

data: data frame
cols: columns to pivot into longer format

pivot_longer(
    data,
    cols, 
    names_to = "name", 
    values_to = "value"
    )

`pivot_longer()`

data: data frame
cols: columns to pivot into longer format
names_to: name of the column where column names of pivoted variables go (character string)

pivot_longer(
    data,
    cols, 
    names_to = "name", 
    values_to = "value"
    )

`pivot_longer()`

data: data frame
cols: columns to pivot into longer format
names_to: name of the column where column names of pivoted variables go (character string)
values_to: name of the column where data values in pivoted variables go (character string)

pivot_longer(
    data,
    cols, 
    names_to = "name", 
    values_to = "value"
    )

customers \(\rightarrow\) purchases

customers

# A tibble: 2 × 4
  customer_id item_1 item_2       item_3
        <dbl> <chr>  <chr>        <chr> 
1           1 bread  milk         banana
2           2 milk   toilet paper <NA>

purchases <- customers |> 
  pivot_longer(
    # variables item_1 to item_3 
    # to be pivoted into longer format 
    cols = item_1:item_3,
    
    # col name of the names 
    # item_1 item_2 and item_3 
    names_to = "item_no",
    
    # col name of the values in the cols
    # item_1 item_2 and item_3 
    values_to = "item"
    )

purchases

# A tibble: 6 × 3
  customer_id item_no item        
        <dbl> <chr>   <chr>       
1           1 item_1  bread       
2           1 item_2  milk        
3           1 item_3  banana      
4           2 item_1  milk        
5           2 item_2  toilet paper
6           2 item_3  <NA>

In customers data,

Names item_1, item_2, item_3 are values of variable item_no in purchases
Values bread, milk, etc are values of variable item in purchases

Let’s see how the function actually works in the customers example.
Suppose we start with the customers data set which is in a wider format.
And we wanna pivot it into a longer format.
the columns I wanna pivot are the ones that have information about the items that customers bought, which are item1, item 2, and item3.
So I write item1:item3 (we can also use any helper function to select columns)
OK Now, the column names item_1 item_2, item_3 should be in a new column in the transformed longer data set whose column name is called item_no.
ANd the values in those three columns (the cells in the original data) should go into another new column in the in the transformed longer data set whose column name is called item that basically shows what the customers had purchased.
So again the column names of the original wider data go into a new column called item_no
and the values of the column names of the original wider data go into a new column called item.
And I end up with a longer data set called purchases with 6 rows and 3 columns

Why Pivot?

The next step of your analysis needs it.
The new purchases data set and the prices data can now be joined together with the common key variable item.

prices <- read_csv("./data/sales/prices.csv")
prices

# A tibble: 5 × 2
  item         price
  <chr>        <dbl>
1 avocado       0.5 
2 banana        0.15
3 bread         1   
4 milk          0.8 
5 toilet paper  3

purchases |> 
    left_join(prices) #<<

# A tibble: 6 × 4
  customer_id item_no item         price
        <dbl> <chr>   <chr>        <dbl>
1           1 item_1  bread         1   
2           1 item_2  milk          0.8 
3           1 item_3  banana        0.15
4           2 item_1  milk          0.8 
5           2 item_2  toilet paper  3   
6           2 item_3  <NA>         NA

purchases \(\rightarrow\) customers

data: data frame
names_from: which column variable in the long format contains the what should be column names in the wide format
values_from: which column variable in the long format contains the what should be (cell) values in the new columns in the wide format

purchases

# A tibble: 6 × 3
  customer_id item_no item        
        <dbl> <chr>   <chr>       
1           1 item_1  bread       
2           1 item_2  milk        
3           1 item_3  banana      
4           2 item_1  milk        
5           2 item_2  toilet paper
6           2 item_3  <NA>

purchases |> 
    pivot_wider(              
        names_from = item_no, 
        values_from = item    
    )

# A tibble: 2 × 4
  customer_id item_1 item_2       item_3
        <dbl> <chr>  <chr>        <chr> 
1           1 bread  milk         banana
2           2 milk   toilet paper <NA>

17-tidyr (Present your work!)

In lab.qmd ## Lab 17 section,

Import trump.csv. Call it trump_data as below on the left.
Use pivot_longer() to transform trump_data into the data set trump_longer on the right.

trump_data

# A tibble: 2,702 × 4
  subgroup date       approval disapproval
  <chr>    <date>        <dbl>       <dbl>
1 Voters   2020-10-04     44.7        52.2
2 Adults   2020-10-04     43.2        52.6
3 Adults   2020-10-03     43.2        52.6
4 Voters   2020-10-03     45.0        51.7
5 Adults   2020-10-02     43.3        52.4
6 Voters   2020-10-02     44.5        52.1
# ℹ 2,696 more rows

trump_longer <- ______________
    pivot_longer(
        cols = ____________,
        names_to = _______________,
        values_to = _______________
    )

# A tibble: 5,404 × 4
  subgroup date       rating_type rating_value
  <chr>    <date>     <chr>              <dbl>
1 Voters   2020-10-04 approval            44.7
2 Voters   2020-10-04 disapproval         52.2
3 Adults   2020-10-04 approval            43.2
4 Adults   2020-10-04 disapproval         52.6
5 Adults   2020-10-03 approval            43.2
6 Adults   2020-10-03 disapproval         52.6
# ℹ 5,398 more rows

BONUS 💳: Use trump_longer to generate a plot like the one below.

Pivoting

pd.wide_to_long()

pd.pivot()

pd.wide_to_long()

import numpy as np
import pandas as pd

customers = pd.read_csv('./data/sales/customers.csv')
purchases = pd.wide_to_long(df = customers,
                            stubnames = ['item'], 
                            i = 'customer_id', 
                            j = 'item_no', sep = '_')
purchases

                             item
customer_id item_no              
1           1               bread
2           1                milk
1           2                milk
2           2        toilet paper
1           3              banana
2           3                 NaN

pd.pivot()

purchases = purchases.reset_index()
purchases

   customer_id  item_no          item
0            1        1         bread
1            2        1          milk
2            1        2          milk
3            2        2  toilet paper
4            1        3        banana
5            2        3           NaN

purchases.pivot(index = "customer_id", columns = "item_no", values = "item")

item_no          1             2       3
customer_id                             
1            bread          milk  banana
2             milk  toilet paper     NaN

Tidying Data 🧹

Grammar of Data Tidying

Grammar of Data Tidying

Pivoting

wider (\(2 \times 4\))

longer (\(6 \times 3\))

pivot_longer() and pivot_wider()

pivot_longer() and pivot_wider()

pivot_longer() and pivot_wider()

Data: sales/customers.csv

wider (\(2 \times 4\))

longer (\(6 \times 3\))

pivot_longer()

pivot_longer()

pivot_longer()

pivot_longer()

customers \(\rightarrow\) purchases

Why Pivot?

purchases \(\rightarrow\) customers

Pivoting

pd.wide_to_long()

pd.pivot()

pd.wide_to_long()

pd.pivot()

`pivot_longer()` and `pivot_wider()`

`pivot_longer()` and `pivot_wider()`

`pivot_longer()` and `pivot_wider()`

`pivot_longer()`

`pivot_longer()`

`pivot_longer()`

`pivot_longer()`