MATH/COSC 3570 Introduction to Data Science
Regression models the relationship between a numerical response variable
A regression function
Examples:
college GPA
sales
crime rate
What are the assumption on
Given the training data
inference: estimate
fitting: estimate
prediction: predict
hwy
from Displacement displ
linear_reg()
Linear Regression Model Specification (regression)
Computational engine: lm
parsnip package provides a tidy, unified interface for fitting models
lm()
in the built-in stats package.Linear Regression Model Specification (regression)
Computational engine: lm
… using formula syntax
parsnip model object
Call:
stats::lm(formula = hwy ~ displ, data = data)
Coefficients:
(Intercept) displ
35.70 -3.53
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 35.7 0.720 49.6 2.12e-125
2 displ -3.53 0.195 -18.2 2.04e- 46
p <- ggplot(data = reg_out_fit,
aes(x = displ, y = hwy)) +
geom_point(alpha = 0.3) +
labs(title = "Highway MPG vs. Engine Displacement",
x = "Displacement (litres)",
y = "Highway miles per gallon") +
coord_cartesian(ylim = c(11, 44))
p_ci <- p + geom_smooth(method = "lm",
color = "#003366",
fill = "blue",
se = TRUE)
fit lwr upr
1 25.1 17.53 32.7
2 21.6 14.00 29.2
3 18.0 10.45 25.6
4 14.5 6.88 22.1
fit lwr upr
1 29.3 21.7 36.9
2 29.3 21.7 36.9
3 28.6 21.0 36.2
4 28.6 21.0 36.2
5 25.8 18.2 33.4
6 25.8 18.2 33.4
Fan shapes
Groups of patterns
Residuals correlated with predicted values
Any patterns!
trans = auto
: Automatic transmissiontrans = manual
: Manual transmission# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 22.3 0.458 48.7 8.60e-124
2 transmanual 3.49 0.798 4.37 1.89e- 5
auto
transmission.# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 22.3 0.458 48.7 8.60e-124
2 transmanual 3.49 0.798 4.37 1.89e- 5
trans = auto
) to the other level (trans = manual
)