R/Python Data Frames for Data Science

MATH/COSC 3570 Introduction to Data Science

Dr. Cheng-Han Yu
Department of Mathematical and Statistical Sciences
Marquette University

R Tidyverse

tidyverse πŸ“¦

  • The tidyverse is a πŸ“¦ for data science.
  • All packages share common design philosophy, grammar, and data structures.
  • The core tidyverse packages include

Source: https://github.com/spcanelon/tour-of-the-tidyverse

Workflow of Data Science with R packages

Source: https://oliviergimenez.github.io/intro_tidyverse/#7

Install and Load tidyverse πŸ“¦

  • tidyverse is loading all the core packages for us!
── Attaching core tidyverse packages ──────────────────── tidyverse 2.0.0 ──
βœ” dplyr     1.1.2     βœ” readr     2.1.4
βœ” forcats   1.0.0     βœ” stringr   1.5.0
βœ” ggplot2   3.4.3     βœ” tibble    3.2.1
βœ” lubridate 1.9.2     βœ” tidyr     1.3.0
βœ” purrr     1.0.2     
── Conflicts ──────────────────────────────────── tidyverse_conflicts() ──
βœ– dplyr::filter() masks stats::filter()
βœ– dplyr::lag()    masks stats::lag()
β„Ή Use the conflicted package to force all conflicts to become errors

Tidy Data (Data Matrix)

β€œHappy families are all alike; every unhappy family is unhappy in its own way.” – Leo Tolstoy

  • Each variable must have its own column.
  • Each observation must have its own row.
  • Each value must have its own cell. (match the corresponding row observation and column variable)

Tidy Data (Data Matrix)

β€œTidy datasets are all alike, but every messy dataset is messy in its own way.” – Hadley Wickham

  • Each variable must have its own column.
  • Each observation must have its own row.
  • Each value must have its own cell. (match the corresponding row observation and column variable)

Why Tidy Data?

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.
  • Advantages of tidy data:
    • If you store all data in a tidy way, you only need to learn the tools that work with them.
    • Placing variables in columns allows R’s vectorised nature to shine. That makes transforming tidy data feel natural.
  • Practical instructions:

    Put each dataset in a data frame.

    Put each variable in a column.

Data Frames Store Tidy Data

  • Collecting information about the distributions of colors and defects in a bag of M&Ms.

Non-tidy Data

  • If you import data in this format into R/Python, you will be in a mess.

Tidy Data

  • Each row is for one M&M. Each variable is in each column. One value is in a cell.

  • Don’t code β€œRed” in one place and β€œRED” in another. Be consistent!


Data Frames


  • Tibbles are modern version of R data frames.
  • Create a new tibble using tibble().
  • It is like base::data.frame(), but with a couple differences.
df <- data.frame(x = 1:5, 
                 y = letters[1:5], 
                 z = 5:1)
  x y z
1 1 a 5
2 2 b 4
3 3 c 3
4 4 d 2
5 5 e 1
[1] "data.frame"
tib <- tibble(x = 1:5, 
              y = letters[1:5], 
              z = 5:1)
# A tibble: 5 Γ— 3
      x y         z
  <int> <chr> <int>
1     1 a         5
2     2 b         4
3     3 c         3
4     4 d         2
5     5 e         1
[1] "tbl_df"     "tbl"        "data.frame"

Printing of data.frame Class

How the printing method of data.frame can be improved? (Check iris in your R console)

Tibbles Display Better

  • as_tibble() turns a data frame or matrix into a tibble.
(iris_tbl <- as_tibble(iris))  ## check iris_tbl in your R console
# A tibble: 150 Γ— 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl> <fct>  
1          5.1         3.5          1.4         0.2 setosa 
2          4.9         3            1.4         0.2 setosa 
3          4.7         3.2          1.3         0.2 setosa 
4          4.6         3.1          1.5         0.2 setosa 
5          5           3.6          1.4         0.2 setosa 
6          5.4         3.9          1.7         0.4 setosa 
# β„Ή 144 more rows
  • Only shows the first couple of rows.

  • Prints data size and column type.

Subsets of base::data.frame May Not be Data Frames

  • Sometimes [] returns a data frame and sometimes it just returns a vector.
df <- data.frame(x = 1:3, 
                 y = 3:1, 
                 z = LETTERS[1:3])
df[, 1:2]
  x y
1 1 3
2 2 2
3 3 1
class(df[, 1:2])
[1] "data.frame"
df[, 1]
[1] 1 2 3
class(df[, 1])
[1] "integer"

Treat the df as a list. How do we grab the 1st column and preserve its data frame type?

1 1
2 2
3 3

Subsets of Tibbles Are Tibbles

  • [] always returns another tibble.
df_tbl <- tibble(x = 1:2, y = 2:1, 
                 z = LETTERS[1:2])
df_tbl[, 1]
# A tibble: 2 Γ— 1
1     1
2     2
# A tibble: 2 Γ— 1
1     1
2     2
class(df_tbl[, 1])
[1] "tbl_df"     "tbl"        "data.frame"
[1] "tbl_df"     "tbl"        "data.frame"
  • $ and [[]] return a vector.
[1] 1 2
[1] "integer"

[1] 1 2
[1] "integer"

Tibbles Never Do Partial Matching

  • Data frames do partial matching

  • Treat name β€œa” as β€œabc”!

(df <- data.frame(abc = 1))
1   1
[1] 1
  • Tibbles never do partial matching

  • Name β€œa” cannot be recognized!

(tib <- tibble(abc = 1))
# A tibble: 1 Γ— 1
1     1
Warning: Unknown or uninitialised column: `a`.

Tibbles Can Have Complex Entries

  • Data frame: Columns can’t be defined using other created variables.
data.frame(x = 1:5, 
           y = 1:5, 
           z = x + 3)
# object 'x' not found
  • Tibble: Allow to refer to created variables.
tibble(x = 1:5, 
       y = 1:5, 
       z = x + 3)
# A tibble: 5 Γ— 3
      x     y     z
  <int> <int> <dbl>
1     1     1     4
2     2     2     5
3     3     3     6
4     4     4     7
5     5     5     8

Pipe Operator


  • The pipe %>% comes from the magrittr package of tidyverse.
  • R (4.1+) has native base pipe operator |>. Tools > Global Options > Code


For simple cases |> and %>% behave identically. The base pipe is recommended because we can use |> anywhere anytime in R, even we don’t use tidyverse.

What and How to Use Pipe

  • To add the pipe, use keyboard shortcut Ctrl/Cmd + Shift + M

  • The pipe sends the result of the left side of the pipe to be the first argument of the function on the right side of the pipe.

16 |> sqrt() |> log2()
[1] 2
[1] 2
## We can define other arguments as if the first argument is already defined
16 |> sqrt() |> log(base = 2)
[1] 2

Why Pipe Operator?

  • Nested vs. Sequential-piped
  • More natural and easier-to-read structure

Source: https://www.andrewheiss.com/

08-Tibbles and Pipes

In lab.qmd ## Lab 8 section,

  • Compare and contrast the following operations on a data.frame and equivalent tibble. What are the differences? Please comment.
df <- data.frame(abc = 1:2, 
                 xyz = c("a", "b"))
# list method
df[c("abc", "xyz")]
# matrix method
df[, 2]
df[, "xyz"]
df[, c("abc", "xyz")]
tib <- tibble(abc = 1:2, 
              xyz = c("a", "b"))
# list method
tib[c("abc", "xyz")]
# matrix method
tib[, 2]
tib[, "xyz"]
tib[, c("abc", "xyz")]
  • Use |> to first select last 12 rows of iris data set using tail(), then provides summary statistics on its columns using summary().



  • Like tidyverse in R, pandas is a Python library that provides data structures, manipulation and analysis tools for data science.
import numpy as np
import pandas as pd

Pandas Data Frame

  • Create a data frame from a dictionary
data = {"math": [99, 65, 87], "stats": [92, 48, 88], "cs": [50, 88, 94]}

df = pd.DataFrame(data)
   math  stats  cs
0    99     92  50
1    65     48  88
2    87     88  94
  • Row and column names
df.index = ["s1", "s2", "s3"]
df.columns = ["Math", "Stat", "CS"]
    Math  Stat  CS
s1    99    92  50
s2    65    48  88
s3    87    88  94

Subsetting Columns


  • In Python, [] returns Series, [[]] returns DataFrame!
  • In R, [] returns tibble, [[]] returns vector!
## Series
s1    99
s2    65
s3    87
Name: Math, dtype: int64
<class 'pandas.core.series.Series'>
# ## DataFrame
s1    99
s2    65
s3    87
<class 'pandas.core.frame.DataFrame'>
df[["Math", "CS"]]
    Math  CS
s1    99  50
s2    65  88
s3    87  94

Subsetting Rows DataFrame.iloc

  • integer-location based indexing for selection by position
    Math  Stat  CS
s1    99    92  50
s2    65    48  88
s3    87    88  94
## first row Series
Math    99
Stat    92
CS      50
Name: s1, dtype: int64
## first row DataFrame
    Math  Stat  CS
s1    99    92  50
## first 2 rows
df.iloc[[0, 1]]
    Math  Stat  CS
s1    99    92  50
s2    65    48  88
## 1st and 3rd row
df.iloc[[True, False, True]]
    Math  Stat  CS
s1    99    92  50
s3    87    88  94

Subsetting Rows and Columns DataFrame.iloc

    Math  Stat  CS
s1    99    92  50
s2    65    48  88
s3    87    88  94
## (1, 3) row and (1, 3) col
df.iloc[[0, 2], [0, 2]]
    Math  CS
s1    99  50
s3    87  94
## all rows and 1st col
df.iloc[:, [True, False, False]]
s1    99
s2    65
s3    87
df.iloc[0:2, 1:3]
    Stat  CS
s1    92  50
s2    48  88

Subsetting Rows and Columns DataFrame.loc

Access a group of rows and columns by label(s)

    Math  Stat  CS
s1    99    92  50
s2    65    48  88
s3    87    88  94
df.loc['s1', "CS"]
## all rows and 1st col
df.loc['s1':'s3', [True, False, False]]
s1    99
s2    65
s3    87
df.loc['s2', ['Math', 'Stat']]
Math    65
Stat    48
Name: s2, dtype: int64

Obtain a Single Cell Value DataFrame.iat/ DataFrame.at

    Math  Stat  CS
s1    99    92  50
s2    65    48  88
s3    87    88  94
df.iat[1, 2]
df.at['s2', 'Stat']

New Columns DataFrame.insert and New Rows pd.concat

    Math  Stat  CS
s1    99    92  50
s2    65    48  88
s3    87    88  94
df.insert(loc = 2, 
          column = "Chem", 
          value = [77, 89, 76])
    Math  Stat  Chem  CS
s1    99    92    77  50
s2    65    48    89  88
s3    87    88    76  94
df1 = pd.DataFrame({
    "Math": 88, 
    "Stat": 99, 
    "Chem": 0, 
    "CS": 100
    }, index = ['s4'])
pd.concat(objs = [df, df1])
    Math  Stat  Chem   CS
s1    99    92    77   50
s2    65    48    89   88
s3    87    88    76   94
s4    88    99     0  100


NumPy for arrays/matrices

  • The array object in NumPy is called ndarray.

  • Use array() to create an array.

range(0, 5, 1) # a seq of number from 0 to 4 with increment of 1
range(0, 5)
list(range(0, 5, 1))
[0, 1, 2, 3, 4]
import numpy as np
arr = np.array(range(0, 5, 1)) ## One-dim array 
array([0, 1, 2, 3, 4])
<class 'numpy.ndarray'>

1D Array (Vector) and 2D Array (Matrix)

  • np.arange: Efficient way to create a one-dim array of sequence of numbers
np.arange(2, 5)
array([2, 3, 4])
np.arange(6, 0, -1)
array([6, 5, 4, 3, 2, 1])
  • 2D array
np.array([[1, 2, 3], [4, 5, 6]])
array([[1, 2, 3],
       [4, 5, 6]])


arr2 = np.arange(8).reshape(2, 4)
array([[0, 1, 2, 3],
       [4, 5, 6, 7]])
(2, 4)

Stacking Arrays

a = np.array([1, 2, 3, 4]).reshape(2, 2)
b = np.array([5, 6, 7, 8]).reshape(2, 2)

np.vstack((a, b))
array([[1, 2],
       [3, 4],
       [5, 6],
       [7, 8]])
np.hstack((a, b))
array([[1, 2, 5, 6],
       [3, 4, 7, 8]])

09-NumPy and pandas

In lab.qmd ## Lab 9 section, create a Python pandas.DataFrame equivalent to the R tibble

tibble(x = 1:5, y = 5:1, z = LETTERS[1:5])
# A tibble: 5 Γ— 3
      x     y z    
  <int> <int> <chr>
1     1     5 A    
2     2     4 B    
3     3     3 C    
4     4     2 D    
5     5     1 E    
import numpy as np
import pandas as pd
import string as st
dic = {'__': np.arange(__, __), 
       '__': np.arange(__, __, __),
       '__': list(__.ascii_uppercase)[___]}

Lab Bonus!

  • Happy Ralentine’s Day! ❀️
x <- seq(0, 2*pi, by = 0.01)
xhrt <- 16 * sin(x) ^ 3
yhrt <- 13 * cos(x) - 5 * cos(2*x) - 2 * cos(3*x) - cos(4*x)
par(mar = c(0, 0, 0, 0))
plot(xhrt, yhrt, type = "l", axes = FALSE, xlab = "", ylab = "")
polygon(xhrt, yhrt, col = "red", border = NA)
points(c(10,-10, -15, 15), c(-10, -10, 10, 10), pch = 169, font = 5)
text(0, 0, "Happy Valentine's Day!", font = 2, cex = 2, col = "pink")

Lab Bonus!

  • Happy Pylentine’s Day! ❀️
lines = []
msg = "~Happy Valentine's Day!~"
for y in range(15, -15, -1):
    line = ""
    for x in range(-30, 30):
        f = ((x * 0.05) ** 2 + (y * 0.1) ** 2 - 1) ** 3 - (x * 0.05) ** 2 * (y * 0.1) ** 3
        line += msg[(x - y) % len(msg)] if f <= 0 else " "

