## code
# # A tibble: 8 × 4
# rate rank state total
# <dbl> <dbl> <chr> <dbl>
# 1 0.515 49 Hawaii 7
# 2 0.766 46 Idaho 12
# 3 0.828 44 Maine 11
# 4 0.380 50 New Hampshire 5
# 5 0.940 42 Oregon 36
# 6 0.796 45 Utah 22
# 7 0.320 51 Vermont 2
# 8 0.887 43 Wyoming 5
Homework 2: Data Visualization and Data Wrangling
Spring 2024 MATH/COSC 3570 Introduction to Data Science by Dr. Cheng-Han Yu
1 Data Wrangling and Tidying
1.1 Murders
- Import the data set
murders
. Use the pipe operator|>
and the dplyr functionsmutate()
,filter()
, andselect()
to get the following data output. Call the data setdf
.
The filtering conditions are
region
in “Northeast” or “West”rate = total / population * 100000
is less than 1.
The new variable rank
is based on rate
. The highest rate is ranked 1st. [Hint:] Use the function rank()
.
- Change the type of column
rank
to factor, andtotal
to integer. (You can use the built-inas.factor()
andas.integer()
or theconvert()
function in the hablar package.
## code
- Create a list named
df_lst
of two elements. The first element is a subset ofdf
whose rows havetotal
less than 10, and the second element is a subset ofdf
so that each of its row hastotal
higher than 10. Print it out.
## code
# [[1]]
# # A tibble: 4 × 4
# rate rank state total
# <dbl> <fct> <chr> <dbl>
# 1 0.515 49 Hawaii 7
# 2 0.380 50 New Hampshire 5
# 3 0.320 51 Vermont 2
# 4 0.887 43 Wyoming 5
#
# [[2]]
# # A tibble: 4 × 4
# rate rank state total
# <dbl> <fct> <chr> <dbl>
# 1 0.766 46 Idaho 12
# 2 0.828 44 Maine 11
# 3 0.940 42 Oregon 36
# 4 0.796 45 Utah 22
- Combine the two data sets in
df_lst
usingrbind()
.
## code
- The dplyr provides
dplyr::bind_rows()
anddplyr::bind_cols()
that are analogs torbind()
andcbind()
in the R base. Combine the two data sets indf_lst
usingbind_rows()
. The result should be exactly the same as the previous one.
## code
- Combine the two data frames
data1
anddata2
below usingbind_rows()
andrbind()
. Describe what happened and their difference. (Note that the two data sets have different column names)
<- tibble(x = letters[1:5])
data1 <- tibble(y = 1:3) data2
## code (Error may happen. set eval: false if you'd like to render the document.)
- With
df
, selectstate
andtotal
, and arrangedf
bytotal
in an increasing order.
## code
- With
df
, usecontains()
to select column variables whose name contains the string “at”.
## code
- Back to
murders
. Extract the rows whose has the largest population in its region as the shown output. The population is ranked in a decreasing order.
## code
# # A tibble: 4 × 5
# state abb region population total
# <chr> <chr> <chr> <dbl> <dbl>
# 1 California CA West 37253956 1257
# 2 Texas TX South 25145561 805
# 3 New York NY Northeast 19378102 517
# 4 Illinois IL North Central 12830632 364
1.2 Baseball
- Install and load the Lahman library. This database includes data related to baseball teams. It includes summary statistics about how the players performed on offense and defense for several years. It also includes personal information about the players. The
Batting
data frame contains the offensive statistics for all players for many years:
Rows: 112,184
Columns: 22
$ playerID <chr> "abercda01", "addybo01", "allisar01", "allisdo01", "ansonca01…
$ yearID <int> 1871, 1871, 1871, 1871, 1871, 1871, 1871, 1871, 1871, 1871, 1…
$ stint <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ teamID <fct> TRO, RC1, CL1, WS3, RC1, FW1, RC1, BS1, FW1, BS1, CL1, CL1, W…
$ lgID <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ G <int> 1, 25, 29, 27, 25, 12, 1, 31, 1, 18, 22, 1, 10, 3, 20, 29, 1,…
$ AB <int> 4, 118, 137, 133, 120, 49, 4, 157, 5, 86, 89, 3, 36, 15, 94, …
$ R <int> 0, 30, 28, 28, 29, 9, 0, 66, 1, 13, 18, 0, 6, 7, 24, 26, 0, 0…
$ H <int> 0, 32, 40, 44, 39, 11, 1, 63, 1, 13, 27, 0, 7, 6, 33, 32, 0, …
$ X2B <int> 0, 6, 4, 10, 11, 2, 0, 10, 1, 2, 1, 0, 0, 0, 9, 3, 0, 0, 1, 0…
$ X3B <int> 0, 0, 5, 2, 3, 1, 0, 9, 0, 1, 10, 0, 0, 0, 1, 3, 0, 0, 1, 0, …
$ HR <int> 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
$ RBI <int> 0, 13, 19, 27, 16, 5, 2, 34, 1, 11, 18, 0, 1, 5, 21, 23, 0, 0…
$ SB <int> 0, 8, 3, 1, 6, 0, 0, 11, 0, 1, 0, 0, 2, 2, 4, 4, 0, 0, 3, 0, …
$ CS <int> 0, 1, 1, 1, 2, 1, 0, 6, 0, 0, 1, 0, 0, 0, 0, 4, 0, 0, 1, 0, 0…
$ BB <int> 0, 4, 2, 0, 2, 0, 1, 13, 0, 0, 3, 1, 2, 0, 2, 9, 0, 0, 4, 1, …
$ SO <int> 0, 0, 5, 2, 1, 1, 0, 1, 0, 0, 4, 0, 0, 0, 2, 2, 3, 0, 2, 0, 2…
$ IBB <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ HBP <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ SH <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ SF <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ GIDP <int> 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 1, 2, 0, 0, 0, 0, 3…
Use Batting
data to obtain the top 10 player observations that hit the most home runs (in descending order) in 2022. Call the data set top10
, make it as a tibble and print it out.
## code
- But who are these players? In the
top10
data, we see an ID, but not the names. The player names are in thePeople
data set:
Rows: 20,676
Columns: 26
$ playerID <chr> "aardsda01", "aaronha01", "aaronto01", "aasedo01", "abada…
$ birthYear <int> 1981, 1934, 1939, 1954, 1972, 1985, 1850, 1877, 1869, 186…
$ birthMonth <int> 12, 2, 8, 9, 8, 12, 11, 4, 11, 10, 9, 3, 10, 2, 8, 9, 6, …
$ birthDay <int> 27, 5, 5, 8, 25, 17, 4, 15, 11, 14, 20, 16, 22, 16, 17, 1…
$ birthCountry <chr> "USA", "USA", "USA", "USA", "USA", "D.R.", "USA", "USA", …
$ birthState <chr> "CO", "AL", "AL", "CA", "FL", "La Romana", "PA", "PA", "V…
$ birthCity <chr> "Denver", "Mobile", "Mobile", "Orange", "Palm Beach", "La…
$ deathYear <int> NA, 2021, 1984, NA, NA, NA, 1905, 1957, 1962, 1926, NA, 1…
$ deathMonth <int> NA, 1, 8, NA, NA, NA, 5, 1, 6, 4, NA, 2, 6, NA, NA, NA, N…
$ deathDay <int> NA, 22, 16, NA, NA, NA, 17, 6, 11, 27, NA, 13, 11, NA, NA…
$ deathCountry <chr> NA, "USA", "USA", NA, NA, NA, "USA", "USA", "USA", "USA",…
$ deathState <chr> NA, "GA", "GA", NA, NA, NA, "NJ", "FL", "VT", "CA", NA, "…
$ deathCity <chr> NA, "Atlanta", "Atlanta", NA, NA, NA, "Pemberton", "Fort …
$ nameFirst <chr> "David", "Hank", "Tommie", "Don", "Andy", "Fernando", "Jo…
$ nameLast <chr> "Aardsma", "Aaron", "Aaron", "Aase", "Abad", "Abad", "Aba…
$ nameGiven <chr> "David Allan", "Henry Louis", "Tommie Lee", "Donald Willi…
$ weight <int> 215, 180, 190, 190, 184, 235, 192, 170, 175, 169, 220, 19…
$ height <int> 75, 72, 75, 75, 73, 74, 72, 71, 71, 68, 74, 71, 70, 78, 7…
$ bats <fct> R, R, R, R, L, L, R, R, R, L, R, R, R, R, R, L, R, L, L, …
$ throws <fct> R, R, R, R, L, L, R, R, R, L, R, R, R, R, L, L, R, L, R, …
$ debut <chr> "2004-04-06", "1954-04-13", "1962-04-10", "1977-07-26", "…
$ finalGame <chr> "2015-08-23", "1976-10-03", "1971-09-26", "1990-10-03", "…
$ retroID <chr> "aardd001", "aaroh101", "aarot101", "aased001", "abada001…
$ bbrefID <chr> "aardsda01", "aaronha01", "aaronto01", "aasedo01", "abada…
$ deathDate <date> NA, 2021-01-22, 1984-08-16, NA, NA, NA, 1905-05-17, 1957…
$ birthDate <date> 1981-12-27, 1934-02-05, 1939-08-05, 1954-09-08, 1972-08-…
We can see column names nameFirst
and nameLast
. Use the left_join()
function to create a table of the top home run hitters. The data table should have variables playerID, nameFirst, nameLast, and HR. Overwrite the object top10
with this new table, and print it out.
## code
- Use the
Fielding
data frame to add each player’s position to the table you created in (2). Make sure that you filter for the year 2022 first, then useright_join()
. This time show nameFirst, nameLast, teamID, HR, and POS.
## code
1.3 Pivoting
- The R built-in
co2
data set is not tidy. Let’s make it tidy. Run the following code to define theco2_wide
object:
<- data.frame(matrix(co2, ncol = 12, byrow = TRUE)) |>
co2_wide setNames(1:12) |>
mutate(year = as.character(1959:1997))
Use the pivot_longer()
function to make it tidy. Call the column with the CO2 measurements co2
and call the month column month
. Call the resulting object co2_tidy
. Print it out.
## code
## # A tibble: 468 x 3
## year month co2
## <chr> <chr> <dbl>
## 1 1959 1 315.
## 2 1959 2 316.
## 3 1959 3 316.
## 4 1959 4 318.
## 5 1959 5 318.
## 6 1959 6 318
## 7 1959 7 316.
## 8 1959 8 315.
## 9 1959 9 314.
## 10 1959 10 313.
## # … with 458 more rows
1.4 Data Manipulation in Python
import numpy as np
import pandas as pd
- Use Python to do Section 1.1 question 1.
## code
- Use Python to do Section 1.1 question 7.
## code
- Use Python to do Section 1.1 question 9. [Hint:] The method pandas.DataFrame.drop_duplicates() is analogous to
dplyr::distinct()
. Please figure out what we should use in the argument subset and keep.
## code
- Use Python to do Section 1.2 question 1. (Import the data Batting.csv).
## code
- Use Python to do Section 1.2 question 2. (Import the data People.csv).
## code
2 Data Visualization
In this section all the plots should be generated using ggplot2.
2.1 murders
Use murders
to make plots.
- Create a scatter plot of total murders (x-axis) versus population sizes (y-axis) using the pipe operator
|>
that the murders data set is on the left to|>
.
## code
- Generate the plot below using
label
andcolor
aesthetics inaes()
and a geometry layergeom_label()
. Save the ggplot object asp
. Here, we add abbreviation as the label, and make the labels’ color be determined by the state’s region.
## code
- Use the object
p
in (2) and
- Change both axes to be in the \(\log_{10}\) scale using
scale_x_log10()
andscale_y_log10()
- Add a title “Gun murder data”
- Use the wall street journal theme in ggthemes.
## code
2.2 mpg
Use mpg
to make plots.
- What’s gone wrong with this code? Why are the points not blue? Change it so that the points are colored in blue.
|> ggplot(mapping = aes(x = displ, y = hwy, colour = "blue")) +
mpg geom_point()
- Generate the bar chart below.
## code
- Complete the code to generate the boxplot below. Note that
x = class
andy = hwy
, so the coordinates need to be flipped.
## code
- Generate the histogram below with density scale. Map
y
to the internal variable..density..
(after_stat(density)
) to show density values. Put the legend inside the plot atc(0.9, 0.15)
. (check ?theme help page)
## code
- Generate the scatter plot below.
## code