MATH/COSC 3570 Introduction to Data Science
The model performs well on the training data, but awfully predicts the response on the new data we are interested.
The model performs well on the training data, but awfully predicts the response on the new data we are interested.
Often, we don’t have another unused data to assess the performance of our model.
Solution: Pretend we have new data by splitting our data into training set and test set (validation set)!
body
DataWhat is the value of \(J\) in our example?
KNN classification uses majority voting:
Look for the most popular class label among its neighbors.
When predicting at \(x = (x_1, x_2) = (8, 6)\),
\[\begin{align} \hat{\pi}_{3Blue}(x = (8, 6)) &= \hat{P}(Y = \text{Blue} \mid x = (8, 6))\\ &= \frac{2}{3} \end{align}\]
\[\begin{align} \hat{\pi}_{3Orange}(x = (8, 6)) &= \hat{P}(Y = \text{Orange} \mid x = (8, 6))\\ &= \frac{1}{3} \end{align}\]
Blue grid indicates the region in which a test response is assigned to the blue class.
We don’t know the true boundary (the true classification rule)!.
recipes::recipe()
Standardize predictors before doing KNN!
parsnip::nearest_neighbor()
workflows::workflow()
knn_out <-
workflows::workflow() |>
add_recipe(knn_recipe) |>
add_model(knn_mdl) |>
fit(data = df_trn)
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: nearest_neighbor()
── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step
• step_normalize()
── Model ───────────────────────────────────────────────────────────────────────
Call:
kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(3, data, 5))
Type of response variable: nominal
Minimal misclassification: 0.217
Best kernel: optimal
Best k: 3
\(K\)-nearest neighbors has no model parameters, but a tuning parameter \(K\).
This is a parameter which determines how the model is trained, not a parameter that is learned through training.
Use \(v\)-fold Cross Validation (CV) to choose tuning parameters. (MATH 4750 Computational Statistics)
Usually use \(v = 5\) or \(10\).
knn = KNeighborsClassifier(n_neighbors = 3)
X_trn = np.array(X_trn)
X_tst = np.array(X_tst)
knn.fit(X_trn, y_trn)
KNeighborsClassifier(n_neighbors=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsClassifier(n_neighbors=3)
array([[22, 5],
[12, 21]])
22-K Nearest Neighbors
In lab.qmd ## Lab 22
section,
use HEIGHT
and WAIST
to predict GENDER
using KNN with \(K = 3\).
Generate the (test) confusion matrix.
Calculate (test) accuracy rate.
Does using more predictors predict better?
library(tidymodels)
## load data
bodydata <- read_csv("./data/body.csv")
body <- bodydata |>
select(GENDER, HEIGHT, WAIST) |>
mutate(GENDER = as.factor(GENDER))
## training and test data
set.seed(2024)
df_split <- initial_split(data = body, prop = 0.8)
df_trn <- training(df_split)
df_tst <- testing(df_split)
## KNN training
knn_recipe <- recipe(GENDER ~ HEIGHT + WAIST, data = df_trn) |>
step_normalize(all_predictors())
knn_mdl <- nearest_neighbor(neighbors = 3, mode = "classification")
knn_out <- workflow() |>
add_recipe(knn_recipe) |>
add_model(knn_mdl) |>
fit(data = df_trn)
## KNN prediction
knn_pred <- pull(predict(knn_out, df_tst))
table(knn_pred, df_tst$GENDER)
mean(knn_pred == df_tst$GENDER)
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
## load data
body = pd.read_csv('./data/body.csv')
X = body[['HEIGHT', 'WAIST']]
y = body['GENDER']
## training and test data
X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.2, random_state=2024)
## KNN training
knn = KNeighborsClassifier(n_neighbors = 3)
X_trn = np.array(X_trn)
X_tst = np.array(X_tst)
knn.fit(X_trn, y_trn)
## KNN prediction
y_pred = knn.predict(X_tst)
from sklearn.metrics import confusion_matrix
confusion_matrix(y_tst, y_pred)
np.mean(y_tst == y_pred)