Tidying Data 🧹

MATH/COSC 3570 Introduction to Data Science

Dr. Cheng-Han Yu
Department of Mathematical and Statistical Sciences
Marquette University

Grammar of Data Tidying

Grammar of Data Tidying

  • Have data organised in an unideal way for our analysis
  • Want to re-organise the data to carry on with our analysis


The goal of tidyr is to help you tidy your data via

  • pivoting for going between wider and longer data
  • separating and uniting character columns
  • clarifying how NAs should be treated
  • nesting and unnesting columns

Pivoting

  • To tidy your data,
    • first, figure out what the (column) variables and (row) observations are
    • second is to resolve one of two common problems:
      • One (column) variable might be spread across multiple columns.
      • One (row) observation might be scattered across multiple rows.
customers <- read_csv("./data/sales/customers.csv")

wider (\(2 \times 4\))

more columns

customers
# A tibble: 2 × 4
  customer_id item_1 item_2       item_3
        <dbl> <chr>  <chr>        <chr> 
1           1 bread  milk         banana
2           2 milk   toilet paper <NA>  

longer (\(6 \times 3\))

more rows by pivot_longer()

# A tibble: 6 × 3
  customer_id item_no item        
        <dbl> <chr>   <chr>       
1           1 item_1  bread       
2           1 item_2  milk        
3           1 item_3  banana      
4           2 item_1  milk        
5           2 item_2  toilet paper
6           2 item_3  <NA>        

pivot_longer() and pivot_wider()

  • To fix these problems, we’ll need pivot_longer() and pivot_wider()

  • Starts with a data set,

    • pivot_longer() add more rows and decreases the number of columns.
    • pivot_wider() add more columns and decreases the number of rows.

pivot_longer() and pivot_wider()

One variable spreads across multiple columns

One subject is scattered across multiple rows

pivot_longer() and pivot_wider()

Data: sales/customers.csv

customers <- read_csv("data/sales/customers.csv")

wider (\(2 \times 4\))

more columns

customers
# A tibble: 2 × 4
  customer_id item_1 item_2       item_3
        <dbl> <chr>  <chr>        <chr> 
1           1 bread  milk         banana
2           2 milk   toilet paper <NA>  

longer (\(6 \times 3\))

more rows by pivot_longer()

# A tibble: 6 × 3
  customer_id item_no item        
        <dbl> <chr>   <chr>       
1           1 item_1  bread       
2           1 item_2  milk        
3           1 item_3  banana      
4           2 item_1  milk        
5           2 item_2  toilet paper
6           2 item_3  <NA>        

pivot_longer()

  • data: data frame
pivot_longer(
    data,
    cols, 
    names_to = "name", 
    values_to = "value"
    )

pivot_longer()

  • data: data frame

  • cols: columns to pivot into longer format

pivot_longer(
    data,
    cols, 
    names_to = "name", 
    values_to = "value"
    )

pivot_longer()

  • data: data frame

  • cols: columns to pivot into longer format

  • names_to: name of the column where column names of pivoted variables go (character string)

pivot_longer(
    data,
    cols, 
    names_to = "name", 
    values_to = "value"
    )

pivot_longer()

  • data: data frame

  • cols: columns to pivot into longer format

  • names_to: name of the column where column names of pivoted variables go (character string)

  • values_to: name of the column where data values in pivoted variables go (character string)

pivot_longer(
    data,
    cols, 
    names_to = "name", 
    values_to = "value"
    )

customers \(\rightarrow\) purchases

customers
# A tibble: 2 × 4
  customer_id item_1 item_2       item_3
        <dbl> <chr>  <chr>        <chr> 
1           1 bread  milk         banana
2           2 milk   toilet paper <NA>  
purchases <- customers |> 
  pivot_longer(
    # variables item_1 to item_3 
    # to be pivoted into longer format 
    cols = item_1:item_3,
    
    # col name of the names 
    # item_1 item_2 and item_3 
    names_to = "item_no",
    
    # col name of the values in the cols
    # item_1 item_2 and item_3 
    values_to = "item"
    )
purchases
# A tibble: 6 × 3
  customer_id item_no item        
        <dbl> <chr>   <chr>       
1           1 item_1  bread       
2           1 item_2  milk        
3           1 item_3  banana      
4           2 item_1  milk        
5           2 item_2  toilet paper
6           2 item_3  <NA>        

In customers data,

  • Names item_1, item_2, item_3 are values of variable item_no in purchases

  • Values bread, milk, etc are values of variable item in purchases

Why Pivot?

  • The next step of your analysis needs it.
  • The new purchases data set and the prices data can now be joined together with the common key variable item.
prices <- read_csv("./data/sales/prices.csv")
prices
# A tibble: 5 × 2
  item         price
  <chr>        <dbl>
1 avocado       0.5 
2 banana        0.15
3 bread         1   
4 milk          0.8 
5 toilet paper  3   
purchases |> 
    left_join(prices) #<<
# A tibble: 6 × 4
  customer_id item_no item         price
        <dbl> <chr>   <chr>        <dbl>
1           1 item_1  bread         1   
2           1 item_2  milk          0.8 
3           1 item_3  banana        0.15
4           2 item_1  milk          0.8 
5           2 item_2  toilet paper  3   
6           2 item_3  <NA>         NA   

purchases \(\rightarrow\) customers

  • data: data frame
  • names_from: which column variable in the long format contains the what should be column names in the wide format
  • values_from: which column variable in the long format contains the what should be (cell) values in the new columns in the wide format
purchases
# A tibble: 6 × 3
  customer_id item_no item        
        <dbl> <chr>   <chr>       
1           1 item_1  bread       
2           1 item_2  milk        
3           1 item_3  banana      
4           2 item_1  milk        
5           2 item_2  toilet paper
6           2 item_3  <NA>        
purchases |> 
    pivot_wider(              
        names_from = item_no, 
        values_from = item    
    ) 
# A tibble: 2 × 4
  customer_id item_1 item_2       item_3
        <dbl> <chr>  <chr>        <chr> 
1           1 bread  milk         banana
2           2 milk   toilet paper <NA>  

Source: FiveThirtyEight

17-tidyr (Present your work!)

In lab.qmd ## Lab 17 section,

  • Import trump.csv. Call it trump_data as below on the left.

  • Use pivot_longer() to transform trump_data into the data set trump_longer on the right.

trump_data
# A tibble: 2,702 × 4
  subgroup date       approval disapproval
  <chr>    <date>        <dbl>       <dbl>
1 Voters   2020-10-04     44.7        52.2
2 Adults   2020-10-04     43.2        52.6
3 Adults   2020-10-03     43.2        52.6
4 Voters   2020-10-03     45.0        51.7
5 Adults   2020-10-02     43.3        52.4
6 Voters   2020-10-02     44.5        52.1
# ℹ 2,696 more rows
trump_longer <- ______________
    pivot_longer(
        cols = ____________,
        names_to = _______________,
        values_to = _______________
    ) 
# A tibble: 5,404 × 4
  subgroup date       rating_type rating_value
  <chr>    <date>     <chr>              <dbl>
1 Voters   2020-10-04 approval            44.7
2 Voters   2020-10-04 disapproval         52.2
3 Adults   2020-10-04 approval            43.2
4 Adults   2020-10-04 disapproval         52.6
5 Adults   2020-10-03 approval            43.2
6 Adults   2020-10-03 disapproval         52.6
# ℹ 5,398 more rows

BONUS 💳: Use trump_longer to generate a plot like the one below.

Pivoting

pd.wide_to_long()

pd.pivot()

pd.wide_to_long()

import numpy as np
import pandas as pd
customers = pd.read_csv('./data/sales/customers.csv')
purchases = pd.wide_to_long(df = customers,
                            stubnames = ['item'], 
                            i = 'customer_id', 
                            j = 'item_no', sep = '_')
purchases
                             item
customer_id item_no              
1           1               bread
2           1                milk
1           2                milk
2           2        toilet paper
1           3              banana
2           3                 NaN

pd.pivot()

purchases = purchases.reset_index()
purchases
   customer_id  item_no          item
0            1        1         bread
1            2        1          milk
2            1        2          milk
3            2        2  toilet paper
4            1        3        banana
5            2        3           NaN
purchases.pivot(index = "customer_id", columns = "item_no", values = "item")
item_no          1             2       3
customer_id                             
1            bread          milk  banana
2             milk  toilet paper     NaN