MATH/COSC 3570 Introduction to Data Science
Have multiple data frames
Want to bring them together
SQL-like functions
left_join(x, y)
right_join(x, y)
full_join(x, y)
inner_join(x, y)
semi_join(x, y)
anti_join(x, y)
Data sets x
and y
share the same variable id
.
left_join(x, y)
: all rows from xleft_join()
Example state population
1 Alabama 4779736
2 Alaska 710231
3 Arizona 6392017
4 Arkansas 2915918
5 California 37253956
6 Colorado 5029196
state population elec_vote
1 Alabama 4779736 9
2 Alaska 710231 3
3 Arizona 6392017 11
4 Arkansas 2915918 NA
5 California 37253956 55
6 Colorado 5029196 NA
Connecticut
and Delaware
in elec_vote_y
will not be shown in the left-joined data because they are not in pop_x
.library(tidyverse)
library(dslabs)
pop_x <- murders |>
slice(1:6) |>
select(state, population)
elec_vote_y <- results_us_election_2016 |>
filter(state %in% c("Alabama", "Alaska", "Arizona",
"California", "Connecticut", "Delaware")) |>
select(state, electoral_votes) |>
rename(elec_vote = electoral_votes)
right_join(x, y)
: all rows from yright_join()
Example state population
1 Alabama 4779736
2 Alaska 710231
3 Arizona 6392017
4 Arkansas 2915918
5 California 37253956
6 Colorado 5029196
full_join(x, y)
: all rows from both x and yfull_join()
Example state population elec_vote
1 Alabama 4779736 9
2 Alaska 710231 3
3 Arizona 6392017 11
4 Arkansas 2915918 NA
5 California 37253956 55
6 Colorado 5029196 NA
7 Connecticut NA 7
8 Delaware NA 3
full_join()
takes the union of observations of x
and y
, so it produces the data set with the most rows.inner_join(x, y)
: only rows w/ keys in both x and yinner_join()
Example state population
1 Alabama 4779736
2 Alaska 710231
3 Arizona 6392017
4 Arkansas 2915918
5 California 37253956
6 Colorado 5029196
16-Joining tables
In lab.qmd ## Lab 16
section
diamond_color
.left_join()
to combine the data set diamonds
in ggplot2 and diamond_color
by the key variable color
.carat
, color
, Description
, Details
.color
.# A tibble: 53,940 Ă— 4
carat color Description Details
<dbl> <chr> <chr> <chr>
1 0.23 E Colorless Minute traces of color
2 0.21 E Colorless Minute traces of color
3 0.23 E Colorless Minute traces of color
4 0.29 I Near Colorless Slightly detectable color
5 0.31 J Near Colorless Slightly detectable color
6 0.24 J Near Colorless Slightly detectable color
# â„ą 53,934 more rows
murders = pd.read_csv('./data/murders.csv')
pop_x = murders[0:6][['state','population']]
election = pd.read_csv('./data/results_us_election_2016.csv')
raws1 = ["Alabama", "Alaska", "Arizona", "California", "Connecticut", "Delaware"]
cols1 = ["state", "electoral_votes"]
df = election[cols1]
pop = []
for i in raws1:
mask = df["state"] == i
pos = np.flatnonzero(mask)
pop.append(pos)
pop = np.array(pop)
pop = np.resize(pop, 6)
elec_vote_y = df.iloc[pop]
state population electoral_votes
0 Alabama 4779736.0 9
1 Alaska 710231.0 3
2 Arizona 6392017.0 11
3 California 37253956.0 55
4 Connecticut NaN 7
5 Delaware NaN 3
state population electoral_votes
0 Alabama 4779736 9
1 Alaska 710231 3
2 Arizona 6392017 11
3 California 37253956 55