MATH/COSC 3570 Introduction to Data Science
Normal vs. Spam/Phishing
Fake vs. True
Normal vs. COVID vs. Smoking
The response \(Y\) in linear regression is numerical.
In many situations, \(Y\) is categorical!
A process of predicting categorical response is known as classification.
Predict whether people will default on their credit card payment \((Y)\) yes
or no
, based on monthly credit card balance \((X)\).
Use the training sample \(\{(x_1, y_1), \dots, (x_n, y_n)\}\) to build a classifier.
Most of the time, we code categories using numbers! \(Y =\begin{cases} 0 & \quad \text{if not default}\\ 1 & \quad \text{if default} \end{cases}\)
First predict the probability of each category of \(Y\).
Predict probability of default
using a S-shaped curve.
Training data \((x_1, y_1), \dots, (x_n, y_n)\) where
default
)not default
).First predict \(P(y_i = 1 \mid x_i) = \pi(x_i) = \pi_i\)
The probability \(\pi\) changes with the value of predictor \(x\)!
\(X =\) balance
. \(x_1 = 2000\) has a larger \(\pi_1 = \pi(2000)\) than \(\pi_2 = \pi(500)\) with \(x_2 = 500\).
Credit cards with a higher balance is more likely to be default.
\[\pi = \frac{1}{1+\exp(-(\beta_0 + \beta_1x))}\]
Does the logistic function guarantee that \(\pi \in (0, 1)\) for any value of \(\beta_0\), \(\beta_1\), and \(x\)?
For \(i = 1, \dots, n\), and with one predictor \(X\): \[(Y_i \mid X = x_i) = \begin{cases} 1 & \quad \text{w/ prob } \pi(x_i)\\ 0 & \quad \text{w/ prob } 1 - \pi(x_i) \end{cases}\]
\[\pi(x_i) = \frac{1}{1+\exp(-(\beta_0+\beta_1 x_{i}))}\]
Goal: Get estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\), and therefore \(\hat{\pi}\)!
\[\small \hat{\pi} = \frac{1}{1+\exp(-\hat{\beta}_0-\hat{\beta}_1 x_{})}\]
bodydata <- read_csv("./data/body.csv")
body <- bodydata |>
select(GENDER, HEIGHT) |>
mutate(GENDER = as.factor(GENDER))
body |> slice(1:4)
# A tibble: 4 × 2
GENDER HEIGHT
<fct> <dbl>
1 0 172
2 1 186
3 0 154.
4 1 160.
GENDER = 1
if Male
GENDER = 0
if Female
Use HEIGHT
(centimeter, 1 cm = 0.39 in) to predict/classify GENDER
: whether one is male or female.
family = "binomial"
1 2 3 4 5 6
0.7369 0.9880 0.0383 0.1480 0.9383 0.4375
[1] 172 186 154 160 179 167
21-Logistic Regression
In lab.qmd ## Lab 21
section,
Use our fitted logistic regression model to predict whether you are male or female! Change 175
to your height (cm).
Use the converter to get your height in cm!
array([[0.24154889]])
array([-40.51727266])
True 0 | True 1 | |
---|---|---|
Predict 0 | True Negative (TN) | False Negative (FN) |
Predict 1 | False Positive (FP) | True Positive (TP) |
Sensitivity (True Positive Rate) \(= P( \text{predict 1} \mid \text{true 1}) = \frac{TP}{TP+FN}\)
Specificity (True Negative Rate) \(= P( \text{predict 0} \mid \text{true 0}) = \frac{TN}{FP+TN}\)
Accuracy \(= \frac{TP + TN}{TP+FN+FP+TN}\)
True 0 | True 1 | |
---|---|---|
Predict 0 | True Negative (TN) | False Negative (FN) |
Predict 1 | False Positive (FP) | True Positive (TP) |
Which model performs better?