Data Wrangling - two data frames đź› 

MATH/COSC 3570 Introduction to Data Science

Dr. Cheng-Han Yu
Department of Mathematical and Statistical Sciences
Marquette University

Joining data frames

Joining data frames

  • Have multiple data frames

  • Want to bring them together

  • SQL-like functions

    • left_join(x, y)
    • right_join(x, y)
    • full_join(x, y)
    • inner_join(x, y)
    • semi_join(x, y)
    • anti_join(x, y)

Setup

Data sets x and y share the same variable id.

x <- tibble(
    id = c(1, 2, 3),
    var_x = c("x1", "x2", "x3")
    )
x
# A tibble: 3 Ă— 2
     id var_x
  <dbl> <chr>
1     1 x1   
2     2 x2   
3     3 x3   
y <- tibble(
    id = c(1, 2, 4),
    var_y = c("y1", "y2", "y4")
    )
y
# A tibble: 3 Ă— 2
     id var_y
  <dbl> <chr>
1     1 y1   
2     2 y2   
3     4 y4   

left_join(x, y): all rows from x

## by = keys
left_join(x, y, by = "id")
# A tibble: 3 Ă— 3
     id var_x var_y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     3 x3    <NA> 
  • NA is added to the id not appearing in y.

left_join() Example

pop_x
       state population
1    Alabama    4779736
2     Alaska     710231
3    Arizona    6392017
4   Arkansas    2915918
5 California   37253956
6   Colorado    5029196
elec_vote_y
        state elec_vote
1  California        55
2     Arizona        11
3     Alabama         9
4 Connecticut         7
5      Alaska         3
6    Delaware         3
pop_x |> 
    left_join(elec_vote_y) #<<
       state population elec_vote
1    Alabama    4779736         9
2     Alaska     710231         3
3    Arizona    6392017        11
4   Arkansas    2915918        NA
5 California   37253956        55
6   Colorado    5029196        NA
  • Connecticut and Delaware in elec_vote_y will not be shown in the left-joined data because they are not in pop_x.
library(tidyverse)
library(dslabs)
pop_x <- murders |> 
    slice(1:6) |>
    select(state, population)

elec_vote_y <- results_us_election_2016 |> 
    filter(state %in% c("Alabama", "Alaska", "Arizona", 
                        "California", "Connecticut", "Delaware")) |> 
    select(state, electoral_votes) |> 
    rename(elec_vote = electoral_votes)

right_join(x, y): all rows from y

right_join(x, y)
Joining with `by = join_by(id)`
# A tibble: 3 Ă— 3
     id var_x var_y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     4 <NA>  y4   
  • NA is in the column coming from x.

right_join() Example

pop_x
       state population
1    Alabama    4779736
2     Alaska     710231
3    Arizona    6392017
4   Arkansas    2915918
5 California   37253956
6   Colorado    5029196
elec_vote_y
        state elec_vote
1  California        55
2     Arizona        11
3     Alabama         9
4 Connecticut         7
5      Alaska         3
6    Delaware         3
pop_x |> 
    right_join(elec_vote_y) #<<
        state population elec_vote
1     Alabama    4779736         9
2      Alaska     710231         3
3     Arizona    6392017        11
4  California   37253956        55
5 Connecticut         NA         7
6    Delaware         NA         3
  • Arkansas and Colorado in pop_x will not be shown in the right-joined data because they are not in elec_vote_y.

full_join(x, y): all rows from both x and y

full_join(x, y)
Joining with `by = join_by(id)`
# A tibble: 4 Ă— 3
     id var_x var_y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     3 x3    <NA> 
4     4 <NA>  y4   
  • Keep all the rows and fill the missing parts with NAs.

full_join() Example

pop_x
       state population
1    Alabama    4779736
2     Alaska     710231
3    Arizona    6392017
4   Arkansas    2915918
5 California   37253956
6   Colorado    5029196
elec_vote_y
        state elec_vote
1  California        55
2     Arizona        11
3     Alabama         9
4 Connecticut         7
5      Alaska         3
6    Delaware         3
pop_x |> 
    full_join(elec_vote_y) #<<
        state population elec_vote
1     Alabama    4779736         9
2      Alaska     710231         3
3     Arizona    6392017        11
4    Arkansas    2915918        NA
5  California   37253956        55
6    Colorado    5029196        NA
7 Connecticut         NA         7
8    Delaware         NA         3
  • full_join() takes the union of observations of x and y, so it produces the data set with the most rows.

inner_join(x, y): only rows w/ keys in both x and y

inner_join(x, y)
Joining with `by = join_by(id)`
# A tibble: 2 Ă— 3
     id var_x var_y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
  • Keep only the rows that have information in both tables.

inner_join() Example

pop_x
       state population
1    Alabama    4779736
2     Alaska     710231
3    Arizona    6392017
4   Arkansas    2915918
5 California   37253956
6   Colorado    5029196
elec_vote_y
        state elec_vote
1  California        55
2     Arizona        11
3     Alabama         9
4 Connecticut         7
5      Alaska         3
6    Delaware         3
pop_x |> 
    inner_join(elec_vote_y) #<<
       state population elec_vote
1    Alabama    4779736         9
2     Alaska     710231         3
3    Arizona    6392017        11
4 California   37253956        55

16-Joining tables

In lab.qmd ## Lab 16 section

diamond_color <- readr::read_csv("the url")
  • Use left_join() to combine the data set diamonds in ggplot2 and diamond_color by the key variable color.
  • Select the variables carat, color, Description, Details.
## Variable "color" in diamonds but "Color" in diamond_color

joined_df <- diamonds |>  
    _______(_______, by = c('color' = 'Color')) |>  ## join
    _______(_________________________________________)  ## select
  • Create a bar chart of the variable color.

# A tibble: 53,940 Ă— 4
  carat color Description    Details                  
  <dbl> <chr> <chr>          <chr>                    
1  0.23 E     Colorless      Minute traces of color   
2  0.21 E     Colorless      Minute traces of color   
3  0.23 E     Colorless      Minute traces of color   
4  0.29 I     Near Colorless Slightly detectable color
5  0.31 J     Near Colorless Slightly detectable color
6  0.24 J     Near Colorless Slightly detectable color
# â„ą 53,934 more rows

Joining Data Frames

pd.merge()

pd.merge()

import numpy as np
import pandas as pd
pop_x
        state  population
0     Alabama     4779736
1      Alaska      710231
2     Arizona     6392017
3    Arkansas     2915918
4  California    37253956
5    Colorado     5029196
elec_vote_y
          state  electoral_votes
21      Alabama                9
43       Alaska                3
13      Arizona               11
0    California               55
26  Connecticut                7
44     Delaware                3
## dplyr::left_join()
pd.merge(pop_x, elec_vote_y, how = 'left')
        state  population  electoral_votes
0     Alabama     4779736              9.0
1      Alaska      710231              3.0
2     Arizona     6392017             11.0
3    Arkansas     2915918              NaN
4  California    37253956             55.0
5    Colorado     5029196              NaN
murders = pd.read_csv('./data/murders.csv')
pop_x = murders[0:6][['state','population']]

election = pd.read_csv('./data/results_us_election_2016.csv')
raws1 = ["Alabama", "Alaska", "Arizona", "California", "Connecticut", "Delaware"]
cols1 = ["state", "electoral_votes"]
df = election[cols1]
pop = []
for i in raws1:
    mask = df["state"] == i
    pos = np.flatnonzero(mask)
    pop.append(pos)

pop = np.array(pop)
pop = np.resize(pop, 6)
elec_vote_y = df.iloc[pop]

pd.merge(pop_x, elec_vote_y, how = 'right') ## dplyr::right_join()
         state  population  electoral_votes
0      Alabama   4779736.0                9
1       Alaska    710231.0                3
2      Arizona   6392017.0               11
3   California  37253956.0               55
4  Connecticut         NaN                7
5     Delaware         NaN                3
pd.merge(pop_x, elec_vote_y, how = 'outer') ## dplyr::full_join()
         state  population  electoral_votes
0      Alabama   4779736.0              9.0
1       Alaska    710231.0              3.0
2      Arizona   6392017.0             11.0
3     Arkansas   2915918.0              NaN
4   California  37253956.0             55.0
5     Colorado   5029196.0              NaN
6  Connecticut         NaN              7.0
7     Delaware         NaN              3.0
pd.merge(pop_x, elec_vote_y, how = 'inner') ## dplyr::inner_join()
        state  population  electoral_votes
0     Alabama     4779736                9
1      Alaska      710231                3
2     Arizona     6392017               11
3  California    37253956               55