Visualizing Data 📈

MATH/COSC 3570 Introduction to Data Science

Dr. Cheng-Han Yu
Department of Mathematical and Statistical Sciences
Marquette University

Categorical vs. Numerical Variables

A categorical (qualitative) variable provides non-numerical information which can be placed in one (and only one) category from two or more categories.

Gender (Male 👨, Female 👩, Other 🏳️‍🌈)
Class (Freshman, Sophomore, Junior, Senior, Graduate)
Country (USA 🇺🇸, Canada 🇨🇦, UK 🇬🇧, Germany 🇩🇪, Japan 🇯🇵, Korea 🇰🇷)

A numerical (quantitative) variable is recorded in a numerical value representing counts or measurements.

GPA
The number of relationships you’ve had
Height

Data: Lending Club

Lending Club is a platform that allows individuals to lend to other individuals.
Not all loans are created equal – ease of getting a loan depends on ability to pay back the loan.
Data includes loans made, these are not loan applications.

Take a Peek at Data

library(openintro) ## for loading the data set
dplyr::glimpse(loans_full_schema)

Rows: 10,000
Columns: 55
$ emp_title                        <chr> "global config engineer ", "warehouse…
$ emp_length                       <dbl> 3, 10, 3, 1, 10, NA, 10, 10, 10, 3, 1…
$ state                            <fct> NJ, HI, WI, PA, CA, KY, MI, AZ, NV, I…
$ homeownership                    <fct> MORTGAGE, RENT, RENT, RENT, RENT, OWN…
$ annual_income                    <dbl> 90000, 40000, 40000, 30000, 35000, 34…
$ verified_income                  <fct> Verified, Not Verified, Source Verifi…
$ debt_to_income                   <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.4…
$ annual_income_joint              <dbl> NA, NA, NA, NA, 57000, NA, 155000, NA…
$ verification_income_joint        <fct> , , , , Verified, , Not Verified, , ,…
$ debt_to_income_joint             <dbl> NA, NA, NA, NA, 37.7, NA, 13.1, NA, N…
$ delinq_2y                        <int> 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0…
$ months_since_last_delinq         <int> 38, NA, 28, NA, NA, 3, NA, 19, 18, NA…
$ earliest_credit_line             <dbl> 2001, 1996, 2006, 2007, 2008, 1990, 2…
$ inquiries_last_12m               <int> 6, 1, 4, 0, 7, 6, 1, 1, 3, 0, 4, 4, 8…
$ total_credit_lines               <int> 28, 30, 31, 4, 22, 32, 12, 30, 35, 9,…
$ open_credit_lines                <int> 10, 14, 10, 4, 16, 12, 10, 15, 21, 6,…
$ total_credit_limit               <int> 70795, 28800, 24193, 25400, 69839, 42…
$ total_credit_utilized            <int> 38767, 4321, 16000, 4997, 52722, 3898…
$ num_collections_last_12m         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ num_historical_failed_to_pay     <int> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0…
$ months_since_90d_late            <int> 38, NA, 28, NA, NA, 60, NA, 71, 18, N…
$ current_accounts_delinq          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ total_collection_amount_ever     <int> 1250, 0, 432, 0, 0, 0, 0, 0, 0, 0, 0,…
$ current_installment_accounts     <int> 2, 0, 1, 1, 1, 0, 2, 2, 6, 1, 2, 1, 2…
$ accounts_opened_24m              <int> 5, 11, 13, 1, 6, 2, 1, 4, 10, 5, 6, 7…
$ months_since_last_credit_inquiry <int> 5, 8, 7, 15, 4, 5, 9, 7, 4, 17, 3, 4,…
$ num_satisfactory_accounts        <int> 10, 14, 10, 4, 16, 12, 10, 15, 21, 6,…
$ num_accounts_120d_past_due       <int> 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, …
$ num_accounts_30d_past_due        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ num_active_debit_accounts        <int> 2, 3, 3, 2, 10, 1, 3, 5, 11, 3, 2, 2,…
$ total_debit_limit                <int> 11100, 16500, 4300, 19400, 32700, 272…
$ num_total_cc_accounts            <int> 14, 24, 14, 3, 20, 27, 8, 16, 19, 7, …
$ num_open_cc_accounts             <int> 8, 14, 8, 3, 15, 12, 7, 12, 14, 5, 8,…
$ num_cc_carrying_balance          <int> 6, 4, 6, 2, 13, 5, 6, 10, 14, 3, 5, 3…
$ num_mort_accounts                <int> 1, 0, 0, 0, 0, 3, 2, 7, 2, 0, 2, 3, 3…
$ account_never_delinq_percent     <dbl> 92.9, 100.0, 93.5, 100.0, 100.0, 78.1…
$ tax_liens                        <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ public_record_bankrupt           <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
$ loan_purpose                     <fct> moving, debt_consolidation, other, de…
$ application_type                 <fct> individual, individual, individual, i…
$ loan_amount                      <int> 28000, 5000, 2000, 21600, 23000, 5000…
$ term                             <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 3…
$ interest_rate                    <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.7…
$ installment                      <dbl> 652.5, 167.5, 71.4, 664.2, 786.9, 153…
$ grade                            <ord> C, C, D, A, C, A, C, B, C, A, C, B, C…
$ sub_grade                        <fct> C3, C1, D1, A3, C3, A3, C2, B5, C2, A…
$ issue_month                      <fct> Mar-2018, Feb-2018, Feb-2018, Jan-201…
$ loan_status                      <fct> Current, Current, Current, Current, C…
$ initial_listing_status           <fct> whole, whole, fractional, whole, whol…
$ disbursement_method              <fct> Cash, Cash, Cash, Cash, Cash, Cash, C…
$ balance                          <dbl> 27016, 4651, 1825, 18853, 21430, 4257…
$ paid_total                       <dbl> 1999, 499, 282, 3313, 2325, 873, 2731…
$ paid_principal                   <dbl> 984, 349, 175, 2747, 1570, 743, 1440,…
$ paid_interest                    <dbl> 1015.2, 150.5, 106.4, 566.1, 754.8, 1…
$ paid_late_fees                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Selected Variables

loans <- loans_full_schema |> 
    dplyr::select(loan_amount, 
                  interest_rate, 
                  grade, 
                  homeownership, 
                  debt_to_income)
glimpse(loans)

Rows: 10,000
Columns: 5
$ loan_amount    <int> 28000, 5000, 2000, 21600, 23000, 5000, 24000, 20000, 20…
$ interest_rate  <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.72, 13.59, 11.99, 1…
$ grade          <ord> C, C, D, A, C, A, C, B, C, A, C, B, C, B, D, D, D, F, E…
$ homeownership  <fct> MORTGAGE, RENT, RENT, RENT, RENT, OWN, MORTGAGE, MORTGA…
$ debt_to_income <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.46, 23.66, 16.19, 3…

Variable Description

variable	description
`loan_amount`	Amount of the loan received, in US dollars
`interest_rate`	Interest rate on the loan, in an annual percentage
`grade`	Loan grade, which takes a values A through G and represents the quality of the loan
`homeownership`	Indicates whether the person owns, owns but has a mortgage, or rents
`debt_to_income`	Debt-to-income ratio

Variable Types

variable	type
`loan_amount`	numerical, continuous
`interest_rate`	numerical, continuous
`grade`	categorical, ordinal
`homeownership`	categorical, nominal
`debt_to_income`	numerical, continuous

Visualizing Categorical Data

Bar Chart

A bar chart shows the frequency table of a categorical variable.

## geom_bar(stat = "count")
loans |> ggplot(aes(x = homeownership)) +
    geom_bar()  #<<

Stacked Bar Chart

(freq_tbl <- 
    as.data.frame(
        table(loans$homeownership)
        ))

      Var1 Freq
1             0
2      ANY    0
3 MORTGAGE 4789
4      OWN 1353
5     RENT 3858

## remove count 0 item
freq_tbl <- freq_tbl[-c(1, 2), ]

## column names
names(freq_tbl) <- c("type","count")
freq_tbl

      type count
3 MORTGAGE  4789
4      OWN  1353
5     RENT  3858

bar <- freq_tbl |> 
    ggplot(aes(x = "", 
               y = count, 
               fill = type)) + 
    geom_bar(stat = "identity")
bar

Pie Chart

pie <- bar + 
    coord_polar(theta = "y")
pie

pie + theme_void()

Segmented Bar Plot: Stacked

loans |> ggplot(aes(x = homeownership, 
                    fill = grade)) + #<<
    geom_bar()

Segmented Bar Plot: Compare Proportions

position = "fill" makes each set of stacked bars the same height.

loans |> ggplot(aes(x = homeownership, fill = grade)) +
    geom_bar(position = "fill") #<<

Segmented Bar Plot: Compare Individual Values

position = "dodge" places overlapping objects directly beside one another.

loans |> ggplot(aes(x = homeownership, fill = grade)) +
  geom_bar(position = "dodge") #<<

Customizing Bar Plots

loans |> 
  ggplot(
    aes(x = homeownership,
        fill = homeownership)
  ) + 
  geom_bar(
    color = "blue", 
    width = 0.2, 
    alpha = 0.5
  ) + 
  labs(
    x = "Homeownership", 
    title = "Homeownership Counts"
  ) +
  geom_text(
    aes(label = after_stat(count)), 
        stat = 'count', 
        hjust = 3, 
        color = "red", 
        size = 5
  ) + 
  theme_minimal() + 
  coord_flip() ## y = homeownership

stat is short for statistical transformation.
stat = 'count' uses the stat_count() method to get the count of each homeownership.
One reason why ggplot2 is powerful is you can customize your plot with great flexibility.
When you use aesthetic option fill, each bar will be colored according to the variable we assign.
Because here x and fill map to the same variable, we have each bar has its own one color.
We can specify some settings in geom_bar. color is the color of the edge of the bar, and we can also control the width and transparency of bars.
We know labs already.
If you wanna add text other then labels to your plot, you can use geom_text() function.
If you wanna add the counts on each bar, you map label to the variable ..count..
When you see the the variable or data has ..variable_name.., it means that it is an interval variable created by ggplot2.
In order to use this ..count.. internal variable, we have to set the stat transformation = “count.
And we can adjust the text horizontally through [hjust], and we can also specify its color and size as we usually do.
Finally I use theme_minimal() and I flip the coordinate so that the bars become horizontal. It is the same as y = homeownership

Visualizing Numerical Data

Histogram

loans |> ggplot(aes(x = loan_amount)) +
    geom_histogram() #<<

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

binwidth = 1000
binwidth = 5000
binwidth = 20000

ggplot(loans, aes(x = loan_amount)) +
    geom_histogram(binwidth = 1000)

ggplot(loans, aes(x = loan_amount)) +
    geom_histogram(binwidth = 5000)

ggplot(loans, aes(x = loan_amount)) +
    geom_histogram(binwidth = 20000)

Customizing Histograms

loans |> 
  ggplot(
    aes(x = loan_amount)
  ) +
  geom_histogram(
    binwidth = 5000,
    fill = "#003366",  
    colour = "#FFCC00",  
    alpha = 0.8,  
    linetype = "dashed"
  ) +  
  labs(
    x = "Loan amount ($)", 
    y = "Frequency", 
    title = "Lending Club loans"
  ) + 
  theme_light()

ggplot2::geom_histogram()

Fill with a Categorical Variable: Stack

loans |> 
  ggplot(
    aes(x = loan_amount,
        fill = homeownership) #<<
  ) +
  geom_histogram(
    binwidth = 5000
  ) +
  labs(
    x = "Loan amount ($)",
    y = "Frequency"
  ) +
  theme_linedraw()

Fill with a Categorical Variable: Identity (bad)

loans |> 
  ggplot(
    aes(x = loan_amount,
        fill = homeownership)
  ) +
  geom_histogram(
    binwidth = 5000,
    position = "identity" #<<
  ) +
  labs(
    x = "Loan amount ($)",
    y = "Frequency",
    title = "Lending Club loans"
  ) +
  theme_minimal()

Why such plot is bad?

Fill with a Categorical Variable: Identity

loans |> 
  ggplot(
    aes(x = loan_amount,
        fill = homeownership)
  ) +
  geom_histogram(
    binwidth = 5000,
    alpha = 0.5,  #<<
    position = "identity"
  ) +
  labs(
    x = "Loan amount ($)",
    y = "Frequency",
    title = "Lending Club loans"
  ) +
  theme_minimal()

Facet with a Categorical Variable

loans |> 
  ggplot(
    aes(x = loan_amount,
        fill = homeownership)
  ) +
  geom_histogram(
    binwidth = 5000
  ) +
  labs(
    x = "Loan amount ($)",
    y = "Frequency",
    title = "Lending Club loans"
  ) +
  facet_wrap(
    ~ homeownership, #<<
    nrow = 3  #<<
  )

13-Visualization

In lab.qmd ## Lab 13 section,

Import the data penguins.csv.
Generate the following

# library(tidyverse)
penguins <- read_csv(__________________)
________ |> ggplot(_______________________) +  ## mapping layer  
    ___________________ +  ## geometry layer
    _____________________________  ## label layer

________ |> ggplot(______________________________) +  ## mapping layer  
    _______________ +  ## geometry layer
    _______________ +  ## label layer
    ______________________________  +   ## facet layer
    ______________________________      ## theme layer (set legend.position = "none")

Density Plot

geom_density() uses kernel density estimation to smooth the histogram or our data. (MATH 4750 Statistical Computing)

ggplot(loans, aes(x = loan_amount)) +
    geom_density()  #<<

Let’s continue. Another way we can visualize numerical data is using a density plot.
A density plot is basically a smoothed version of a histogram.
Remember that a continuous random variable has a continuous probability distribution that is a density curve as a plot. Right?
So a normal r.v. has a normal density curve that this.
But given a data set of continuous variable, we can only draw a histogram that is sort of an approximation to its density curve. The histogram will only be very much like its true density plot when the data size is huge or approaching to infinity.
A density plot here smooths the histogram, telling us that given the data set we have, what the true density curve might look like.
In ggplot, we have geometry geom_density() to create a density plot.
And ggplot uses the so-called kernel density estimation to smooth the histogram. The kernel is not a the linux kernel in computer science. The kernel here is a probability distribution.
If you are interested, it will be taught in 4750 Computational Statistics which is a new course starting this fall.
You can see in the graph that the density curve smooths the histogram, but it;s still not very smooth, right. It’s a little bit jagged.
The ggplot choose a level of smoothness for us. And we can adjust the smoothness by ourselves.

ggplot(loans, aes(x = loan_amount)) +
    geom_density(adjust = 0.5)

ggplot(loans, aes(x = loan_amount)) +
    geom_density(adjust = 1) # default bandwidth

ggplot(loans, aes(x = loan_amount)) +
    geom_density(adjust = 2)

Customizing Density Plots

loans |> 
  ggplot(
    aes(x = loan_amount)
  ) +
  geom_density(
    adjust = 2,
    fill = "#FFCC00", 
    color = "#003366", 
    alpha = 0.5, 
    linetype = "dashed"
  ) + 
  labs(
    x = "Loan amount ($)", 
    y = "Density", 
    title = "Lending Club loans"
  )

ggplot2::geom_density()

Density Curve on a Histogram

loans |> 
  ggplot(
    aes(x = loan_amount)
  ) + 
  ## scale down to 
  ## match the density
  geom_histogram(
    binwidth = 5000, 
    aes(y = ..density..) #<<
  ) + 
  geom_density(
    alpha = 0.1, 
    fill = "#FF6666"
  ) +
  labs(
    x = "Loan amount ($)",
    y = "Density",
    title = "Lending Club loans"
  )

Adding a Categorical Variable to Density Plots

loans |> 
  ggplot(
    aes(x = loan_amount,
        fill = homeownership)
  ) +
  geom_density(
    adjust = 2,
    alpha = 0.4
  ) +
  labs(
    x = "Loan amount ($)",
    y = "Density",
    title = "Amounts of loans",
    fill = "Homeownership"
  )

Box Plot

loans |> ggplot(aes(x = interest_rate)) +
  geom_boxplot() +
  labs(x = "interest rate (%)")

Customizing Box Plots

loans |> 
  ggplot(aes(x = interest_rate)) +
  geom_boxplot(
    # custom outliers
    outlier.colour = "red",
    outlier.shape = 8,
    outlier.size = 3,
    # Notch?
    notch = TRUE,
    notchwidth = 0.1,
    # custom boxes
    fill = "#FFCC00",
    colour = "#003366",
    alpha = 0.2
  ) +
  labs(
    x = "Interest rate (%)",
    title = "Interest rates of loans"
  ) +
  theme(
    axis.ticks.y = element_blank(),
    axis.text.y = element_blank()
  )

ggplot2::geom_boxplot()

How do we customize our boxplot?
For a boxplot, we can change outlier’s color, shape and size, using outlier.colour, outlier.shape and outlier.size.
I remembered I show you point shape and its corresponding number when I introduced base R plotting. You can use the same point shape numbering system here.
You can also decide if the boxplot has a notch, although I don’t see the importance of this option.
And again you can customize the box using fill, color, and alpha argument.
And one interesting part is that, when we show a boxplot like this, the y-axis does not mean anything right? the width or height of the box does not meaning anything.
So we can actually remove the y axis tick marks and labels.
And how? Because the axes labels and ticks are part of plotting theme, we should go to theme and set ticks.y and text.y = element_blank().
This is very ggplot2 syntax. It is a little bit harder to read and make sense of, so it does take time to get used to it.

Adding a Categorical Variable to Box Plots

loans |> 
  ggplot(
    aes(x = interest_rate,
        y = grade) #<<
  ) + 
  geom_boxplot() +
  labs(
    x = "Interest rate (%)",
    y = "Grade",
    title = "Interest Rates of Loans",
    subtitle = "by grade of loan"
  )

Adding Two Categorical Variables to Box Plots

loans |> 
  ggplot(
    aes(x = interest_rate,
        y = grade,
        fill = homeownership) #<<
  ) +
  geom_boxplot() +
  labs(
    x = "Interest rate (%)",
    y = "Grade",
    fill = "Homeowership",
    title = "Interest Rates of Loans",
    subtitle = "by grade and ownership"
  ) +
  theme(
    legend.position = "bottom"
  )

Adding Points on Box Plots

ggplot(loans, aes(x = homeownership, y = interest_rate)) +
    geom_boxplot() +
    geom_point(alpha = 0.1, shape = 1) #<< geom_jitter(width = 0.2, alpha = 0.1, shape = 1)

Relationships between numerical variables

Scatterplot

ggplot(loans, aes(x = debt_to_income, y = interest_rate)) +
    geom_point(shape = 23, fill = "blue", size = 0.8)  #<<

Hex Plot

ggplot(loans, aes(x = debt_to_income, y = interest_rate)) +
    geom_hex()  #<< (hexbin pkg)

Hex Plot Zoom-in

loans |> 
    filter(debt_to_income < 100) |> 
    ggplot(aes(x = debt_to_income, y = interest_rate)) +
    geom_hex() +
    viridis::scale_fill_viridis()

Other Plots

Line Chart for Time Series `ggplot2::geom_line()`

economics_long |> ggplot(aes(x = date, y = value01, 
                             colour = variable, linetype = variable)) +
    geom_line(linewidth = 1) + 
    theme_bw() +
    theme(legend.position = "bottom")

QQ-plots

Quantile-Quantile plots are used to check if data are normally distributed (or follow any distribution).

ggplot(mpg, aes(sample = hwy)) + geom_qq() + geom_qq_line()

Violin Plots

Violin plots are similar to box plots, but show the smooth density of the data.

f <- ggplot(loans, aes(x = loan_amount, y = grade))

f + geom_boxplot()

f + geom_violin()

Add-on 📦: `ggridges` for Ridge Plots

library(ggridges)
ggplot(loans, aes(x = loan_amount, y = grade, fill = grade, color = grade)) +
    geom_density_ridges(alpha = 0.9)

Add-on 📦: `ggrepel`

ggrepel provides geoms for ggplot2 to repel overlapping text labels.

library(ggrepel)
p <- mtcars |> filter(wt > 2.75 & wt < 3.45) |> rownames_to_column("car") |> 
    ggplot(aes(wt, mpg, label = car)) +
    geom_point(color = "red")

p + geom_text() + 
    labs(title = "geom_text()")

p + geom_text_repel() + 
    labs(title = "geom_text_repel()")

Wordcloud from `ggwordcloud` Package

Plot
Code

library(ggwordcloud)
library(showtext)
where <- font_files()[which(str_detect(font_files()$family, "Arial Unicode MS")), ]
thankyou_words_small |> ggplot(aes(label = word, size = speakers, color = name)) + 
    geom_text_wordcloud(area_corr = TRUE, rm_outside = TRUE, 
                        family = where[1, ]$family) +
    scale_size_area(max_size = 24) + 
    theme_minimal() +
    theme(plot.margin = margin(t = 0,  # Top margin
                               r = 0,  # Right margin
                               b = 0,  # Bottom margin
                               l = 0)) # Left margin

Radar Chart from `fmsb` and `ggradar` Package

Plot
Code

library(fmsb)
radar_data <- readr::read_csv(
    file = "./data/radar_data.csv")
# Color vector
colors_border <- c(rgb(0.2,0.5,0.5,0.9), 
                   rgb(0.8,0.2,0.5,0.9), 
                   rgb(0.7,0.5,0.1,0.9))
colors_in <- c(rgb(0.2,0.5,0.5,0.4), 
               rgb(0.8,0.2,0.5,0.4), 
               rgb(0.7,0.5,0.1,0.4))
radarchart(radar_data, axistype = 1, 
           #custom polygon
           pcol = colors_border, 
           pfcol = colors_in, 
           plwd = 4, plty = 1,
           #custom the grid
           cglcol = "grey", cglty = 1, 
           axislabcol = "grey", 
           caxislabels = seq(0, 20, 5), 
           cglwd = 0.8,
           #custom labels
           vlcex = 1.2)

# legend("topright", legend = rownames(radar_data[-c(1, 2), ]), bty = "n", pch = 20 , 
#        col = colors_in, text.col = "grey", cex = 1.2, pt.cex = 3)

library(ggradar)

ggradar_data <- radar_data |>
    as_tibble(rownames = "group") |>
    mutate_at(vars(-group), rescale) |>
    tail(3)

ggradar(ggradar_data,
        base.size = 5,
        grid.label.size = 6,
        axis.label.size = 5,
        group.point.size = 3,
        fill.alpha = 0.2,
        grid.line.width = 0.4,
        plot.legend = FALSE,
        fill = TRUE)

Network from `igraph` Package

Plot
Code

library(igraph)
network_data <- read_rds(file = "./data/network_data.rds")

# build the graph object
network <- graph_from_adjacency_matrix(network_data)
 
# plot it
plot(network)

More R Graphics Resources