Data Importing

MATH/COSC 3570 Introduction to Data Science

Dr. Cheng-Han Yu
Department of Mathematical and Statistical Sciences
Marquette University

DBI, jsonline, xml2, httr

R Data Importing

Rectangular Data

readr 📦 Functions

Function Format Typical suffix
read_table() white space separated values txt
read_csv() comma separated values csv
read_csv2() semicolon separated values csv
read_tsv() tab delimited separated values tsv
read_fwf() fixed width files txt
read_delim() general text file format, must define delimiter txt

Be careful: The suffix usually tells us what type of file it is, but no guarantee that these always match.

readr::read_lines("./data/murders.csv", n_max = 3)  ## there is a header
[1] "state,abb,region,population,total" "Alabama,AL,South,4779736,135"     
[3] "Alaska,AK,West,710231,19"         

Reading Data

read_csv() prints out a column specification giving us delimiter, name and type of each column.

murders_csv <- read_csv(file = "./data/murders.csv")
# ── Column specification ─────────────
# Delimiter: ","
# chr (3): state, abb, region
# dbl (2): population, total
head(murders_csv)
# A tibble: 6 × 5
  state      abb   region population total
  <chr>      <chr> <chr>       <dbl> <dbl>
1 Alabama    AL    South     4779736   135
2 Alaska     AK    West       710231    19
3 Arizona    AZ    West      6392017   232
4 Arkansas   AR    South     2915918    93
5 California CA    West     37253956  1257
6 Colorado   CO    West      5029196    65
## View data in RStudio
view(murders_csv)

Missing Values

Which type is the column vector x? Why?

  • Type coercion1 happens and all column elements are transformed to character type.
read_csv("./data/df-na.csv")
# A tibble: 9 × 3
  x     y              z     
  <chr> <chr>          <chr> 
1 1     a              hi    
2 <NA>  b              hello 
3 3     Not applicable 9999  
4 4     d              ola   
5 5     e              hola  
6 .     f              whatup
7 7     g              wassup
8 8     h              sup   
9 9     i              <NA>  

Solution 1: Explicit NAs

  • By default, read_csv() only recognizes ” “ and NA as a missing value.
  • Specify the values that are used to represent missing values by argument na.
read_csv("./data/df-na.csv", 
         na = c("", "NA", ".", "9999", "Not applicable"))

# A tibble: 9 × 3
      x y     z     
  <dbl> <chr> <chr> 
1     1 a     hi    
2    NA b     hello 
3     3 <NA>  <NA>  
4     4 d     ola   
5     5 e     hola  
6    NA f     whatup
7     7 g     wassup
8     8 h     sup   
9     9 i     <NA>  

Solution 2: Specify Column Types

read_csv("./data/df-na.csv", 
         col_types = 
             cols(col_double(), 
                  col_character(), 
                  col_character()))
# A tibble: 9 × 3
      x y              z     
  <dbl> <chr>          <chr> 
1     1 a              hi    
2    NA b              hello 
3     3 Not applicable 9999  
4     4 d              ola   
5     5 e              hola  
6    NA f              whatup
7     7 g              wassup
8     8 h              sup   
9     9 i              <NA>  
# Warning message:
# One or more parsing issues, 
# call `problems()` 
# on your data frame for details
problems()
# A tibble: 1 × 5
#     row   col expected actual file 
#   <int> <int> <chr>    <chr>  <chr>
# 1     7     1 a double .      "" 

Column Types

type function data type
col_character() character
col_date() date
col_datetime() POSIXct (date-time)
col_double() double (numeric)
col_factor() factor
col_guess() let readr guess (default)
col_integer() integer
col_logical() logical
col_number() numbers mixed with non-number characters
col_numeric() double or integer
col_skip() do not read
col_time() time

Writing Data

## Create tibbles using a row-by-row layout
(df <- tribble(
  ~x, ~y,
  1,  "a",
  2,  "b",
  3,  "c"
))
# A tibble: 3 × 2
      x y    
  <dbl> <chr>
1     1 a    
2     2 b    
3     3 c    
## same as tibble(x = 1:3, y = c(a, b, c))
## save data to "./data/df.csv"
df |> write_csv(file = "./data/df.csv")

read_rds() and write_rds()

  • We save an R object (usually a data set) in .Rds in the R binary file format. 1
readr::write_rds(cars, 
                 file = "./data/cars.rds") 
# fs::dir_ls(path = "./data") |> head(10) (Check Files using fs)


my_car <- readr::read_rds(file = "./data/cars.rds") 
head(my_car, 3)
  speed dist
1     4    2
2     4   10
3     7    4

10-Import Data

  • If you haven’t, install and load the tidyverse package.

In lab.qmd ## Lab 10 section,

  • Import ssa_male_prob.csv and ssa_female_prob.Rds in the data folder using read_csv() and call them ssa_male and ssa_female, respectively.
ssa_male <- readr::read____(____________)
ssa_female <- readr::read____(____________)
  • Plot Age (x-axis) vs. LifeExp (y-axis) for Female. The type should be “line”, and the line color is red. Add x-label, y-label and title to your plot.
plot(x = _____, y = _____, type = ______, col = ______,
     xlab = ______, ylab = _______, main = ____________)
  • Use lines() to add a line of Age (x-axis) vs. LifeExp (y-axis) for Male to the plot. The color is blue.
lines(x = _____, y = _____, col = ______)

Read Excel Data

readxl 📦 Functions

Function Format Typical suffix
read_excel() auto detect the format xls, xlsx
read_xls() original format xls
read_xlsx() new format xlsx
  • The Microsoft Excel can have more than one sheet in one file.
  • The functions above read the first sheet by default.
  • The excel_sheets() gives us the names of all the sheets in an Excel file.
library(readxl)
excel_sheets("./data/2010_bigfive_regents.xls")
[1] "Sheet1" "Sheet2" "Sheet3"

Sheet Names

  • The sheet names can be passed to the sheet argument to read sheets other than the first.
excel_sheets("./data/2010_bigfive_regents.xls")
[1] "Sheet1" "Sheet2" "Sheet3"
(data_xls <- read_xls(path = "./data/2010_bigfive_regents.xls", 
                      sheet = "Sheet3", skip = 1))
# A tibble: 19 × 6
  Scores `131024` `113804` `104201` `103886` `91756`
   <dbl>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl>
1     10       NA       64        8      227      34
2     11        6       83       11      217      58
3     12       23       87        7       28      67
4     13        1       54       16      230      42
5     14        3      145       18      303      57
6     15       58      151       50      192      98
7     16        1      129       13      156     125
8     17       73      214       59      163     115
# ℹ 11 more rows

pd.read_csv

pd.read_excel

pd.DataFrame.to_csv

pd.read_csv

import numpy as np
import pandas as pd

py_df = pd.read_csv('./data/murders.csv')
py_df.head()
        state abb region  population  total
0     Alabama  AL  South     4779736    135
1      Alaska  AK   West      710231     19
2     Arizona  AZ   West     6392017    232
3    Arkansas  AR  South     2915918     93
4  California  CA   West    37253956   1257

pd.DataFrame.to_csv

w = {"x":[1, 2, 3], 
     "y":['a', 'b','c']}
wdf = pd.DataFrame(w)

wdf.to_csv("./data/wdf.csv")
mydf = pd.read_csv('./data/wdf.csv')
mydf.head()
   Unnamed: 0  x  y
0           0  1  a
1           1  2  b
2           2  3  c
## index = False means don't write row names
wdf.to_csv("./data/wdf.csv", index = False)
mydf = pd.read_csv('./data/wdf.csv')
mydf.head()
   x  y
0  1  a
1  2  b
2  3  c