MATH/COSC 3570 Introduction to Data Science
Have multiple data frames
Want to bring them together
SQL-like functions
left_join(x, y)right_join(x, y)full_join(x, y)inner_join(x, y)semi_join(x, y)anti_join(x, y)Data sets x and y share the same variable id.
left_join(x, y): all rows from xleft_join() Example state population
1 Alabama 4779736
2 Alaska 710231
3 Arizona 6392017
4 Arkansas 2915918
5 California 37253956
6 Colorado 5029196
state population elec_vote
1 Alabama 4779736 9
2 Alaska 710231 3
3 Arizona 6392017 11
4 Arkansas 2915918 NA
5 California 37253956 55
6 Colorado 5029196 NA
Connecticut and Delaware in elec_vote_y will not be shown in the left-joined data because they are not in pop_x.library(tidyverse)
library(dslabs)
pop_x <- murders |>
slice(1:6) |>
select(state, population)
elec_vote_y <- results_us_election_2016 |>
filter(state %in% c("Alabama", "Alaska", "Arizona",
"California", "Connecticut", "Delaware")) |>
select(state, electoral_votes) |>
rename(elec_vote = electoral_votes)right_join(x, y): all rows from yright_join() Example state population
1 Alabama 4779736
2 Alaska 710231
3 Arizona 6392017
4 Arkansas 2915918
5 California 37253956
6 Colorado 5029196
full_join(x, y): all rows from both x and yfull_join() Example state population elec_vote
1 Alabama 4779736 9
2 Alaska 710231 3
3 Arizona 6392017 11
4 Arkansas 2915918 NA
5 California 37253956 55
6 Colorado 5029196 NA
7 Connecticut NA 7
8 Delaware NA 3
full_join() takes the union of observations of x and y, so it produces the data set with the most rows.inner_join(x, y): only rows w/ keys in both x and yinner_join() Example state population
1 Alabama 4779736
2 Alaska 710231
3 Arizona 6392017
4 Arkansas 2915918
5 California 37253956
6 Colorado 5029196
16-Joining tables
In lab.qmd ## Lab 16 section
diamond_color.left_join() to combine the data set diamonds in ggplot2 and diamond_color by the key variable color.carat, color, Description, Details.color.# A tibble: 53,940 Ă— 4
carat color Description Details
<dbl> <chr> <chr> <chr>
1 0.23 E Colorless Minute traces of color
2 0.21 E Colorless Minute traces of color
3 0.23 E Colorless Minute traces of color
4 0.29 I Near Colorless Slightly detectable color
5 0.31 J Near Colorless Slightly detectable color
6 0.24 J Near Colorless Slightly detectable color
# ℹ 53,934 more rows
murders = pd.read_csv('./data/murders.csv')
pop_x = murders[0:6][['state','population']]
election = pd.read_csv('./data/results_us_election_2016.csv')
raws1 = ["Alabama", "Alaska", "Arizona", "California", "Connecticut", "Delaware"]
cols1 = ["state", "electoral_votes"]
df = election[cols1]
pop = []
for i in raws1:
mask = df["state"] == i
pos = np.flatnonzero(mask)
pop.append(pos)
pop = np.array(pop)
pop = np.resize(pop, 6)
elec_vote_y = df.iloc[pop] state population electoral_votes
0 Alabama 4779736.0 9
1 Alaska 710231.0 3
2 Arizona 6392017.0 11
3 California 37253956.0 55
4 Connecticut NaN 7
5 Delaware NaN 3
state population electoral_votes
0 Alabama 4779736 9
1 Alaska 710231 3
2 Arizona 6392017 11
3 California 37253956 55