MATH/COSC 3570 Introduction to Data Science
Regression models the relationship between a numerical response variable \((Y)\) and one or more numerical/categorical predictors \((X)\), which is a supervised learning method in machine learning.
A regression function \(f(X)\) describes how a response variable \(Y\) generally changes as an explanatory variable \(X\) changes.
Examples:
college GPA \((Y)\) vs. ACT/SAT score \((X)\)
sales \((Y)\) vs. advertising expenditure \((X)\)
crime rate \((Y)\) vs. median income level \((X)\)
\[\begin{align*} y_i &= f(x_i) + \epsilon_i \\ &= \beta_0 + \beta_1~x_{i} + \epsilon_i, \quad i = 1, 2, \dots, n \end{align*}\]
What are the assumption on \(\epsilon_i\)?
\(\epsilon_i \sim N(0, \sigma^2)\) and hence \(y_i \mid x_i \sim N(\beta_0+\beta_1x_i, \sigma^2)\)
Given the training data \((x_1, y_1), \dots, (x_n, y_n)\), use sample statistics \(b_0\) and \(b_1\) computed from the data to
inference: estimate \(\beta_0\) and \(\beta_1\)
fitting: estimate \(y_i\) or \(f(x_i)\) at \(x_i\) by its fitted value \[\hat{y}_{i} = \hat{f}(x_i) = b_0 + b_1~x_{i}\]
prediction: predict \(y_j\) or \(f(x_j)\) at \(x_j\) by its predicted value \[\hat{y}_{j} = \hat{f}(x_j) = b_0 + b_1~x_{j}\] where \((x_j, y_j)\) is never seen and used in training before.
hwy
from Displacement displ
\[\widehat{hwy}_{i} = b_0 + b_1 \times displ_{i}\]
linear_reg()
Linear Regression Model Specification (regression)
Computational engine: lm
parsnip package provides a tidy, unified interface for fitting models
lm()
in the built-in stats package.Linear Regression Model Specification (regression)
Computational engine: lm
… using formula syntax
parsnip model object
Call:
stats::lm(formula = hwy ~ displ, data = data)
Coefficients:
(Intercept) displ
35.70 -3.53
\[\widehat{hwy}_{i} = 35.7 -3.53 \times displ_{i}\]
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 35.7 0.720 49.6 2.12e-125
2 displ -3.53 0.195 -18.2 2.04e- 46
\[\widehat{hwy}_{i} = 35.7 -3.53 \times displ_{i}\]
p <- ggplot(data = reg_out_fit,
aes(x = displ, y = hwy)) +
geom_point(alpha = 0.3) +
labs(title = "Highway MPG vs. Engine Displacement",
x = "Displacement (litres)",
y = "Highway miles per gallon") +
coord_cartesian(ylim = c(11, 44))
p_ci <- p + geom_smooth(method = "lm",
color = "#003366",
fill = "blue",
se = TRUE)
fit lwr upr
1 25.1 17.53 32.7
2 21.6 14.00 29.2
3 18.0 10.45 25.6
4 14.5 6.88 22.1
fit lwr upr
1 29.3 21.7 36.9
2 29.3 21.7 36.9
3 28.6 21.0 36.2
4 28.6 21.0 36.2
5 25.8 18.2 33.4
6 25.8 18.2 33.4
Fan shapes
Groups of patterns
Residuals correlated with predicted values
Any patterns!
trans = auto
: Automatic transmissiontrans = manual
: Manual transmission# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 22.3 0.458 48.7 8.60e-124
2 transmanual 3.49 0.798 4.37 1.89e- 5
auto
transmission.# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 22.3 0.458 48.7 8.60e-124
2 transmanual 3.49 0.798 4.37 1.89e- 5
\[\widehat{hwy_{i}} = 22.3 + 3.49~trans_i\]
trans = auto
) to the other level (trans = manual
)array([25.10588463, 21.57529583, 18.04470702, 14.51411821])