Linear Regression

MATH/COSC 3570 Introduction to Data Science

Dr. Cheng-Han Yu
Department of Mathematical and Statistical Sciences
Marquette University

What is Regression

  • Regression models the relationship between a numerical response variable \((Y)\) and one or more numerical/categorical predictors \((X)\), which is a supervised learning method in machine learning.

  • A regression function \(f(X)\) describes how a response variable \(Y\) generally changes as an explanatory variable \(X\) changes.

Examples:

  • college GPA \((Y)\) vs. ACT/SAT score \((X)\)

  • sales \((Y)\) vs. advertising expenditure \((X)\)

  • crime rate \((Y)\) vs. median income level \((X)\)

Simple Linear Regression

\[\begin{align*} y_i &= f(x_i) + \epsilon_i \\ &= \beta_0 + \beta_1~x_{i} + \epsilon_i, \quad i = 1, 2, \dots, n \end{align*}\]

  • \(\beta_0\) and \(\beta_1\) are unknown parameters to be learned or estimated.

What are the assumption on \(\epsilon_i\)?

\(\epsilon_i \sim N(0, \sigma^2)\) and hence \(y_i \mid x_i \sim N(\beta_0+\beta_1x_i, \sigma^2)\)

Simple Linear Regression Assumptions

Simple Linear Regression Assumptions

Simple Linear Regression Assumptions

Ordinary Least Squares (OLS)

Given the training data \((x_1, y_1), \dots, (x_n, y_n)\), use sample statistics \(b_0\) and \(b_1\) computed from the data to

  • inference: estimate \(\beta_0\) and \(\beta_1\)

  • fitting: estimate \(y_i\) or \(f(x_i)\) at \(x_i\) by its fitted value \[\hat{y}_{i} = \hat{f}(x_i) = b_0 + b_1~x_{i}\]

  • prediction: predict \(y_j\) or \(f(x_j)\) at \(x_j\) by its predicted value \[\hat{y}_{j} = \hat{f}(x_j) = b_0 + b_1~x_{j}\] where \((x_j, y_j)\) is never seen and used in training before.

  • Ordinary Least Squares: Find \(b_0\) and \(b_1\), or regression line \(b_0 + b_1x\) that minimizes the sum of squared residuals.
  • The residual \(e_i = y_i - \hat{y}_i\). The sample regression line minimizes \(\sum_{i = 1}^n e_i^2\).

Visualizing Residuals

Visualizing Residuals (cont.)

Visualizing Residuals (cont.)

Fitting and Interpreting Regression Models

Predict Highway MPG hwy from Displacement displ

\[\widehat{hwy}_{i} = b_0 + b_1 \times displ_{i}\]

Tidymodels

Step 1: Specify Model: linear_reg()

library(tidymodels)
parsnip::linear_reg()
Linear Regression Model Specification (regression)

Computational engine: lm 

parsnip package provides a tidy, unified interface for fitting models

Step 2: Set Model Fitting Engine

  • By default, use lm() in the built-in stats package.
linear_reg() |> 
    set_engine("lm")
Linear Regression Model Specification (regression)

Computational engine: lm 
show_engines("linear_reg")
# A tibble: 7 × 2
  engine mode      
  <chr>  <chr>     
1 lm     regression
2 glm    regression
3 glmnet regression
4 stan   regression
5 spark  regression
6 keras  regression
7 brulee regression

Step 3: Fit Model & Estimate Parameters

… using formula syntax

linear_reg() |> 
    fit(hwy ~ displ, data = mpg)
parsnip model object


Call:
stats::lm(formula = hwy ~ displ, data = data)

Coefficients:
(Intercept)        displ  
      35.70        -3.53  

\[\widehat{hwy}_{i} = 35.7 -3.53 \times displ_{i}\]

  • Slope: When the engine displacement volume of a car is increased by one litre, the highway miles per gallon is expected to be lower, on average, by 3.53 miles.

Tidy Look at Model Output

  • Tidymodels output (tibble)
linear_reg() |> 
    fit(hwy ~ displ, data = mpg) |> 
    tidy()
# A tibble: 2 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)    35.7      0.720      49.6 2.12e-125
2 displ          -3.53     0.195     -18.2 2.04e- 46

\[\widehat{hwy}_{i} = 35.7 -3.53 \times displ_{i}\]

20-Simple Linear Regression

In lab.qmd ## Lab 20 section,

  • Use the mpg data to fit a simple linear regression where \(y\) is hwy and \(x\) is cty.

  • Produce the plot below. (add the layer geom_smooth(method = "lm", se = FALSE))

library(tidymodels); library(ggplot2)

Quantify Uncertainty about Coefficients

  • Uncertainty about regression coefficients \(\beta_0\) and \(\beta_1\)
reg_out <- linear_reg() |> 
    fit(hwy ~ displ, data = mpg)

reg_out_fit <- reg_out$fit

reg_out_fit$coefficients
(Intercept)       displ 
      35.70       -3.53 
confint(reg_out_fit)
            2.5 % 97.5 %
(Intercept) 34.28  37.12
displ       -3.91  -3.15

Quantify Uncertainty about Mean of \(y\)

  • Uncertainty about the mean value of \(y\) given \(X = x\) \[\mu_{Y \mid X = x} = \beta_0 + \beta_1x\]
new_input <- data.frame(displ = 3:6)

predict(reg_out_fit, newdata = new_input, interval = "confidence", level = 0.95)
   fit  lwr  upr
1 25.1 24.6 25.6
2 21.6 21.0 22.1
3 18.0 17.3 18.8
4 14.5 13.4 15.6

Quantify Uncertainty about Mean of \(y\)

p <- ggplot(data = reg_out_fit, 
            aes(x = displ, y = hwy)) +
     geom_point(alpha = 0.3) + 
     labs(title = "Highway MPG vs. Engine Displacement",
          x = "Displacement (litres)",
          y = "Highway miles per gallon") +
      coord_cartesian(ylim = c(11, 44))

p_ci <- p + geom_smooth(method = "lm", 
                        color = "#003366", 
                        fill = "blue",
                        se = TRUE)

Quantify Uncertainty about Individual \(y\)

  • Uncertainty about the individual value of \(y\) given \(X = x\), \(Y \mid X = x\)
predict(reg_out_fit, newdata = new_input, interval = "prediction", level = 0.95)
   fit   lwr  upr
1 25.1 17.53 32.7
2 21.6 14.00 29.2
3 18.0 10.45 25.6
4 14.5  6.88 22.1
## predict at current inputs
pred_y <- predict(reg_out_fit, interval = "prediction")
head(pred_y)
   fit  lwr  upr
1 29.3 21.7 36.9
2 29.3 21.7 36.9
3 28.6 21.0 36.2
4 28.6 21.0 36.2
5 25.8 18.2 33.4
6 25.8 18.2 33.4

Quantify Uncertainty about Individual \(y\)

p_ci + 
    geom_line(aes(x = displ, y = pred_y[, "lwr"]), color = "red") +
    geom_line(aes(x = displ, y = pred_y[, "upr"]), color = "red")

Model Checking

Graphical Diagnostics: Residual Plot

  • Residuals distributed randomly around 0.
  • Check it by plotting residuals against the fitted value of \(y\): \(e_i\) vs. \(\hat{y}_i\)
  • With no visible pattern along the x or y axis.

Not looking for…

Fan shapes

Not looking for…

Groups of patterns

Not looking for…

Residuals correlated with predicted values

Not looking for…

Any patterns!

MPG Data Residuals

plot(reg_out_fit, which = 1, col = "blue", las = 1)

Models with Categorical Predictors

Categorical Predictor with 2 Categories

mpg |>  
  select(hwy, trans) |>  
  print(n = 4)
# A tibble: 234 × 2
    hwy trans     
  <int> <chr>     
1    29 auto(l5)  
2    29 manual(m5)
3    31 manual(m6)
4    30 auto(av)  
# ℹ 230 more rows
mpg_new <- mpg
mpg_new$trans[grepl("auto", mpg_new$trans)] <- "auto"
mpg_new$trans[grepl("manual", mpg_new$trans)] <- "manual"

mpg_new |> 
    select(hwy, trans) |> 
    print(n = 4)
# A tibble: 234 × 2
    hwy trans 
  <int> <chr> 
1    29 auto  
2    29 manual
3    31 manual
4    30 auto  
# ℹ 230 more rows
  • trans = auto: Automatic transmission
  • trans = manual: Manual transmission

Highway MPG & Transmission Type

  • Make sure that your categorical variable is of type character or factor.
typeof(mpg_new$trans)
[1] "character"
linear_reg() |> 
    fit(hwy ~ trans, data = mpg_new) |> 
    tidy()
# A tibble: 2 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)    22.3      0.458     48.7  8.60e-124
2 transmanual     3.49     0.798      4.37 1.89e-  5
  • The baseline level is chosen to be auto transmission.

Highway MPG & Transmission Type

# A tibble: 2 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)    22.3      0.458     48.7  8.60e-124
2 transmanual     3.49     0.798      4.37 1.89e-  5

\[\widehat{hwy_{i}} = 22.3 + 3.49~trans_i\]

  • Slope: Cars with manual transmission are expected, on average, to be 3.49 more miles per gallon than cars with auto transmission.
    • Compare baseline level (trans = auto) to the other level (trans = manual)
  • Intercept: Cars with auto transmission are expected, on average, to have 22.3 highway miles per gallon.

sklearn.linear_model.LinearRegression

sklearn.linear_model.LinearRegression

library(reticulate)
py_install("scikit-learn")
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression


mpg1 = pd.read_csv('./data/mpg.csv')
# x = np.array(mpg1['displ']).reshape(-1, 1)
x = np.array(mpg1[['displ']]) ## 2d array with one column
y = np.array(mpg1['hwy'])
x[0:4]
array([[1.8],
       [1.8],
       [2. ],
       [2. ]])
y[0:4]
array([29, 29, 31, 30])

sklearn.linear_model.LinearRegression

reg = LinearRegression().fit(x, y)
reg.coef_
array([-3.53058881])
reg.intercept_
35.6976510518446


new_input = np.arange(3, 7, 1).reshape(-1, 1)
pred = reg.predict(new_input)
pred
array([25.10588463, 21.57529583, 18.04470702, 14.51411821])