Data Visualization – ggplot2 📊

MATH/COSC 3570 Introduction to Data Science

Dr. Cheng-Han Yu
Department of Mathematical and Statistical Sciences
Marquette University

gthemes, ggridges, ggbeeswarm, ggdendro, ggpubr, ggmap, ggradar, ggcorrplot, GGally, and more!

Visualizing Data

Plotting Systems: base, lattice and ggplot2

ggplot2

  • has the most powerful functionality.

  • is more beautiful?

  • has larger file size that occupies more memory space and has longer render time.

Elegant Data Visualisation

Using the Grammar of Graphics

The ggplot2 Grammar

  • Three main components:

Grammar element What it is
Data The data frame used for plotting
Geometry
  • The geometric shape that represents the data
  • e.g., point, boxplot, histogram
Aesthetic mapping
  • The aesthetics of the geometric object
  • e.g., color, size, shape
  • How we define the mapping depends on what geometry we are using.
ggplot(data = <DATASET>, mapping = aes(<MAPPINGS>)) + 
       <GEOM_FUNCTION>() +
       other options/layers

mpg Data

ggplot2::mpg
# A tibble: 234 × 11
  manufacturer model      displ  year   cyl trans  drv     cty   hwy fl    class
  <chr>        <chr>      <dbl> <int> <int> <chr>  <chr> <int> <int> <chr> <chr>
1 audi         a4           1.8  1999     4 auto(… f        18    29 p     comp…
2 audi         a4           1.8  1999     4 manua… f        21    29 p     comp…
3 audi         a4           2    2008     4 manua… f        20    31 p     comp…
4 audi         a4           2    2008     4 auto(… f        21    30 p     comp…
5 audi         a4           2.8  1999     6 auto(… f        16    26 p     comp…
6 audi         a4           2.8  1999     6 manua… f        18    26 p     comp…
7 audi         a4           3.1  2008     6 auto(… f        18    27 p     comp…
8 audi         a4 quattro   1.8  1999     4 manua… 4        18    26 p     comp…
# ℹ 226 more rows

ggplot(data = mpg,
       mapping = aes(x = displ, 
                     y = hwy, 
                     color = class)) + 
    geom_point() +
    labs(title = "Engine Size v.s. Fuel Efficiency",
         subtitle = "Dimensions for class",
         x = "Engine displacement (litres)", y = "Highway (mpg)",
         color = "Type of car",
         caption = "Source: http://fueleconomy.gov")

Coding out loud 😃

Start with the mpg data frame

library(ggplot2)
ggplot(data = mpg) #<<

Start with the mpg data frame, map engine displacement to the x-axis

ggplot(data = mpg,
       mapping = aes(x = displ)) #<<
  • displ is the variable name in mpg.

  • R will create tick marks and label of x-axis for you.

Start with the mpg data frame, map engine displacement to the x-axis and map highway miles per gallon to the y-axis.

ggplot(data = mpg,
       mapping = aes(x = displ,
                     y = hwy)) #<<
  • Specify y = hwy in the same aes() of the mapping argument as x = displ, separated by comma.

Start with the mpg data frame, map engine displacement to the x-axis and map highway miles per gallon to the y-axis. Represent each observation with a point

ggplot(data = mpg,
       mapping = aes(x = displ, 
                     y = hwy)) + 
  geom_point() #<<
  • To define a geometry, add a geom layer.

Don’t miss + sign!

Start with the mpg data frame, map engine displacement to the x-axis and map highway miles per gallon to the y-axis. Represent each observation with a point and map type of car (class) to the color of each point.

ggplot(data = mpg,
       mapping = 
         aes(x = displ, 
             y = hwy, 
             color = class)) + #<<
  geom_point()
  • Add color = class in aes() of the mapping argument, where class is the variable name for type of car.

  • ggplot automatically generates a legend on the right.

Start with the mpg data frame, map engine displacement to the x-axis and map highway miles per gallon to the y-axis. Represent each observation with a point and map type of car (class) to the color of each point. Title the plot “Engine Size v.s. Fuel Efficiency”

ggplot(data = mpg,
       mapping = aes(x = displ, 
                     y = hwy, 
                     color = class)) + 
  geom_point() +
  labs(
    title="Engine Size vs. Fuel Efficiency" #<<
    )
  • Add any labels in labs() layer.

Start with the mpg data frame, map engine displacement to the x-axis and map highway miles per gallon to the y-axis. Represent each observation with a point and map type of car (class) to the color of each point. Title the plot “Engine Size vs. Fuel Efficiency”, add the subtitle “Dimensions for class”

ggplot(data = mpg,
       mapping = aes(x = displ, 
                     y = hwy, 
                     color = class)) + 
  geom_point() +
  labs(
    title="Engine Size vs. Fuel Efficiency",
    subtitle="Dimensions for class" #<<
    ) 

Start with the mpg data frame, map engine displacement to the x-axis and map highway miles per gallon to the y-axis. Represent each observation with a point and map type of car (class) to the color of each point. Title the plot “Engine Size vs. Fuel Efficiency”, add the subtitle “Dimensions for class”, label the x and y axes as “Engine displacement (litres)” and “Highway (mpg)”, respectively

ggplot(data = mpg,
       mapping = aes(x = displ, 
                     y = hwy, 
                     color = class)) + 
  geom_point() +
  labs(
    title = "Engine Size vs. Fuel Efficiency",
    subtitle = "Dimensions for class",
    x = "Engine displacement (litres)", #<<
    y = "Highway (mpg)" #<<
    ) 

Start with the mpg data frame, map engine displacement to the x-axis and map highway miles per gallon to the y-axis. Represent each observation with a point and map type of car (class) to the color of each point. Title the plot “Engine Size vs. Fuel Efficiency”, add the subtitle “Dimensions for class”, label the x and y axes as “Engine displacement (litres)” and “Highway (mpg)”, respectively, label the legend “Type of car”

ggplot(data = mpg,
       mapping = aes(x = displ, 
                     y = hwy, 
                     color = class)) + 
  geom_point() +
  labs(
    title = "Engine Size vs. Fuel Efficiency",
    subtitle = "Dimensions for class",
    x = "Engine displacement (litres)", 
    y = "Highway (mpg)",
    color = "Type of car" #<<
    ) 
  • The legend is generated when we map type of car (class) to color.

Start with the mpg data frame, map engine displacement to the x-axis and map highway miles per gallon to the y-axis. Represent each observation with a point and map type of car (class) to the color of each point. Title the plot “Engine Size vs. Fuel Efficiency”, add the subtitle “Dimensions for class”, label the x and y axes as “Engine displacement (litres)” and “Highway (mpg)”, respectively, label the legend “Type of car”, and add a caption for the data source.

ggplot(data = mpg,
       mapping = aes(x = displ, 
                     y = hwy, 
                     color = class)) + 
  geom_point() +
  labs(
    title = "Engine Size vs. Fuel Efficiency",
    subtitle = "Dimensions for class",
    x = "Engine displacement (litres)", 
    y = "Highway (mpg)",
    color = "Type of car",
    caption="Source: http://fueleconomy.gov" #<<
    ) 

Start with the mpg data frame, map engine displacement to the x-axis and map highway miles per gallon to the y-axis. Represent each observation with a point and map type of car (class) to the color of each point. Title the plot “Engine Size vs. Fuel Efficiency”, add the subtitle “Dimensions for class”, label the x and y axes as “Engine displacement (litres)” and “Highway (mpg)”, respectively, label the legend “Type of car”, and add a caption for the data source. Finally, use a discrete color scale that is designed to be perceived by viewers with common forms of color blindness.

ggplot(data = mpg,
       mapping = aes(x = displ, 
                     y = hwy, 
                     color = class)) + 
  geom_point() +
  labs(
    title = "Engine Size vs. Fuel Efficiency",
    subtitle = "Dimensions for class",
    x = "Engine displacement (litres)", 
    y = "Highway (mpg)",
    color = "Type of car",
    caption = "Source: http://fueleconomy.gov"
    ) +
  scale_colour_viridis_d() #<<

11-ggplot2

In lab.qmd ## Lab 11 section,

  • Use readr::read_csv() to import the data penguins.csv into your R workspace.

  • Generate the following ggplot:

penguins <- read_csv(_________________)
________ |> 
  ggplot(mapping = ____(x = ______________,
                        y = ______________,
                        colour = ________)) +
  geom______() +
  ____(title = ____________________,
       _________ = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
       x = _____________, y = _______________,
       _______ = "Species",
       _______ = "Source: Palmer Station LTER / palmerpenguins package")

Assign a Plot to an Object

p <- ggplot(data = mpg,
            mapping = 
                aes(x = displ, 
                    y = hwy, 
                    color = class)) + 
    geom_point()
class(p)
[1] "gg"     "ggplot"
p

p + labs(
      title = "Engine Size vs. Fuel Efficiency",
      subtitle = "Dimensions for class",
      x = "Engine displacement (litres)", 
      y = "Highway (mpg)",
      color = "Type of car",
      caption = "Source: http://fueleconomy.gov"
    )

Theme Options

Options include

theme_grey() (default), theme_bw(), theme_dark(), theme_classic(), etc.

p + theme_bw()

p + theme_dark()

Add-on 📦: ggthemes

p + ggthemes::theme_economist()

p + ggthemes::theme_fivethirtyeight()

Customize Theme

  • Use theme() to tweak the display of the current theme, including title, axis labels, etc. Check ?theme.
p + theme(
    panel.background = 
        element_rect(fill = "#FFCC00",
                     colour = "blue",
                     size = 2.5, 
                     linetype = "solid"),
    plot.background = 
        element_rect(fill = "lightblue"),
    axis.line = 
        element_line(size = 0.5, 
                     linetype = "solid",
                     colour = "red")
    )

Aesthetics

Aesthetics options

Commonly used characteristics of plotting characters that can be mapped to a specific variable in the data are

  • colour
  • shape
  • size
  • alpha (transparency)

Colour

ggplot(
    data = mpg,
    mapping = aes(
        x = displ, 
        y = hwy, 
        color = class)) + #<<
    geom_point()

Shape

Mapped to a different variable than colour

ggplot(
    data = mpg,
    mapping = aes(
        x = displ, 
        y = hwy, 
        color = class,
        shape = drv)) + #<<
    geom_point()

Shape

Mapped to same variable as colour

ggplot(
    data = mpg,
    mapping = aes(
        x = displ, 
        y = hwy, 
        color = class,
        shape = class)) + #<<
    geom_point()

Size

ggplot(
    data = mpg,
    mapping = aes(
        x = displ, 
        y = hwy, 
        color = class,
        shape = class,
        size = cty)) + #<<
    geom_point()

Alpha

ggplot(
    data = mpg,
    mapping = aes(
        x = displ, 
        y = hwy, 
        color = class,
        shape = class,
        size = cty,
        alpha = year)) + #<<
    geom_point()

Mapping vs. Setting

Mapping

  • Determine the size, alpha, etc.

based on the values of a variable in the data.

ggplot(data = mpg,
       mapping = aes(x = displ, 
                     y = hwy, 
                     size = cty, #<<
                     alpha = year)) + #<<
    geom_point()

Mapping vs. Setting

Setting

  • Determine the size, alpha, etc.

not based on the values of a variable in the data.

  • goes into geom_*().
ggplot(data = mpg,
       mapping = aes(x = displ, 
                     y = hwy)) +
    geom_point(size = 5, alpha = 0.5) #<<

Faceting

Faceting

  • One way to add additional variables’ information is with aesthetics. But we see that putting all information in one plot may not be a good idea.

  • Another way, particularly useful for categorical variables, is to

split your plot into facets, smaller plots that each display one subset of the data.

  • Useful for exploring conditional relationships and large data.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
    geom_point() + 
    facet_grid(drv ~ cyl)  #<<

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
    geom_point() + 
    facet_wrap(~ cyl, ncol = 2) #<<

Facet and Color

ggplot(data = mpg, 
       mapping = aes(x = displ, y = hwy,
                     color = drv)) +
    geom_point() + 
    facet_grid(drv ~ cyl)

Facet and Color with no Legend

ggplot(data = mpg, 
       mapping = aes(x = displ, y = hwy, color = drv)) +
    geom_point() + 
    facet_grid(drv ~ cyl) +
    guides(color = "none") #<<

12-Faceting

In lab.qmd ## Lab 12 section,

ggplot(data = _______, 
       mapping = aes(x = ______, y = ______, ______ = drv, shape = _____)) +
    geom______(______ = 3, ______ = 0.8) + 
    facet_grid(______ ~ _______) +
    guides(______ = "none")

ggplot for Python

  • plotnine package

  • Syntax are the same as ggplot in R.

from plotnine import ggplot, geom_point, aes, stat_smooth, facet_wrap