Data Splitting and K-Nearest Neighbors

MATH/COSC 3570 Introduction to Data Science

Dr. Cheng-Han Yu
Department of Mathematical and Statistical Sciences
Marquette University

Training and Test Data

Prediction

  • Goal: Build a good regression function or classifier in terms of prediction accuracy.
  • The mechanics of prediction is easy:
    • Plug in values of predictors to the model equation.
    • Calculate the predicted value of the response \(\hat{y}\)
  • Getting it right is hard! No guarantee that
    • the model estimates are close to the truth
    • your model performs as well with new data (test data) as it did with your sample data (training data)

Spending Our Data

  • Several steps to create a useful model:
    • Parameter estimation
    • Model selection
    • Performance assessment, etc.
  • Doing all of this using the entire data may lead to overfitting:


The model performs well on the training data, but awfully predicts the response on the new data we are interested.


Overfitting

The model performs well on the training data, but awfully predicts the response on the new data we are interested.

  • Low error rate on observed data, but high prediction error rate on future unobserved data!

Source: modified from https://i.pinimg.com/originals/72/e2/22/72e222c1542539754df1d914cb671bd7.png

Splitting Data

  • Often, we don’t have another unused data to assess the performance of our model.

  • Solution: Pretend we have new data by splitting our data into training set and test set (validation set)!

  • Training set:
    • Sandbox for model building
    • Spend most of our time using the training set to develop the model
    • Majority of the original sample data (75% - 80%)
  • Test set:
    • Held in reserve to determine efficacy of one or two chosen models
    • Critical to look at it once only, otherwise it becomes part of the modeling process
    • Remainder of the data (20% - 25%)

initial_split() in rsample

Code
bodydata <- read_csv("./data/body.csv")
body <- bodydata |> 
    select(GENDER, HEIGHT, WAIST, BMI) |> 
    mutate(GENDER = as.factor(GENDER))


set.seed(2024)
df_split <- 
    rsample::initial_split(
        data = body, 
        prop = 0.8)

df_split
<Training/Testing/Total>
<240/60/300>
df_trn <- rsample::training(df_split)
df_tst <- rsample::testing(df_split)

dim(df_trn)
[1] 240   4
dim(df_tst)
[1] 60  4

body Data

df_trn
# A tibble: 240 × 4
  GENDER HEIGHT WAIST   BMI
  <fct>   <dbl> <dbl> <dbl>
1 1        188.  82.3  20.2
2 0        160   94.8  27.3
3 0        160.  88.7  26.1
4 1        180   88.3  22.6
5 0        168   81.2  21.9
6 1        176. 101.   29.2
# ℹ 234 more rows
df_tst
# A tibble: 60 × 4
  GENDER HEIGHT WAIST   BMI
  <fct>   <dbl> <dbl> <dbl>
1 1        181.  92.5  27.4
2 0        155. 101.   31.4
3 0        156. 103    29.3
4 0        164.  90.6  21.6
5 0        165.  95.2  27.1
6 1        178.  95.3  25.5
# ℹ 54 more rows

What Makes a Good Classifier: Test Accuracy Rate

  • The test accuracy rate associated with the test data \(\{x_j, y_j\}_{j=1}^J\): \[ \frac{1}{J}\sum_{j=1}^JI(y_j = \hat{y}_j),\] where \(\hat{y}_j\) is the predicted label resulting from applying the classifier to the test response \(y_j\) with predictor \(x_j\).

What is the value of \(J\) in our example?

  • The best estimated classifier \(\hat{C}(x)\) trained from the training data for \(C(x)\) can be defined as the one producing the highest test accuracy rate or lowest test error rate.

K-Nearest Neighbors (KNN) Classifier

KNN classification uses majority voting:

Look for the most popular class label among its neighbors.

When predicting at \(x = (x_1, x_2) = (8, 6)\),

\[\begin{align} \hat{\pi}_{3Blue}(x = (8, 6)) &= \hat{P}(Y = \text{Blue} \mid x = (8, 6))\\ &= \frac{2}{3} \end{align}\]

\[\begin{align} \hat{\pi}_{3Orange}(x = (8, 6)) &= \hat{P}(Y = \text{Orange} \mid x = (8, 6))\\ &= \frac{1}{3} \end{align}\]

KNN Decision Boundary

  • Blue grid indicates the region in which a test response is assigned to the blue class.

  • We don’t know the true boundary (the true classification rule)!.

KNN Training recipes

Standardize predictors before doing KNN!

knn_recipe <- recipes::recipe(GENDER ~ HEIGHT, data = df_trn) |> 
    step_normalize(all_numeric_predictors())
── Recipe ────────────────────────────────────────────────────────────────

── Inputs 
Number of variables by role
outcome:   1
predictor: 1

── Operations 
• Centering and scaling for: all_numeric_predictors()

KNN Training parsnip

(knn_mdl <- parsnip::nearest_neighbor(mode = "classification", neighbors = 3))
K-Nearest Neighbor Model Specification (classification)

Main Arguments:
  neighbors = 3

Computational engine: kknn 

KNN Training workflows


knn_out <- 
    workflows::workflow() |> 
    add_recipe(knn_recipe) |> 
    add_model(knn_mdl) |> 
    fit(data = df_trn)
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: nearest_neighbor()

── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step

• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────

Call:
kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(3,     data, 5))

Type of response variable: nominal
Minimal misclassification: 0.217
Best kernel: optimal
Best k: 3

Prediction on Test Data

bind_cols(
    predict(knn_out, df_tst),
    predict(knn_out, df_tst, type = "prob")) |> 
    dplyr::sample_n(size = 8)
# A tibble: 8 × 3
  .pred_class .pred_0 .pred_1
  <fct>         <dbl>   <dbl>
1 0             1       0    
2 0             0.852   0.148
3 0             1       0    
4 0             1       0    
5 1             0.481   0.519
6 0             0.852   0.148
7 0             1       0    
8 1             0.370   0.630
knn_pred <- 
    pull(predict(knn_out, df_tst))

## Confusion matrix
table(knn_pred, df_tst$GENDER)
        
knn_pred  0  1
       0 25  7
       1  7 21
## Test accuracy rate
mean(knn_pred == df_tst$GENDER)
[1] 0.767

Which K Should We Use?

  • \(K\)-nearest neighbors has no model parameters, but a tuning parameter \(K\).

  • This is a parameter which determines how the model is trained, not a parameter that is learned through training.

\(v\)-fold Cross Validation

  • Use \(v\)-fold Cross Validation (CV) to choose tuning parameters. (MATH 4750 Computational Statistics)

  • Usually use \(v = 5\) or \(10\).

  • IDEA:
    • Prepare \(v\) CV data sets
    • Create a sequence of values of \(K\)
    • For each value of \(K\), run CV, and obtain an accuracy rate
    • Choose the \(K\) with the highest accuracy rate

Final Model

knn_mdl_best <- parsnip::nearest_neighbor(neighbors = 29, mode = "classification")

knn_out_best <- workflows::workflow() |>
    add_recipe(knn_recipe) |>
    add_model(knn_mdl_best) |>
    fit(data = df_trn)

Final Model Performance

knn_pred_best <- pull(predict(knn_out_best, df_tst))

## Confusion matrix
table(knn_pred_best, df_tst$GENDER)
             
knn_pred_best  0  1
            0 28  7
            1  4 21
## Test accuracy rate
mean(knn_pred_best == df_tst$GENDER)
[1] 0.817

sklearn.neighbors

sklearn.model_selection

sklearn.model_selection.train_test_split

library(reticulate); py_install("scikit-learn")
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier


body = pd.read_csv('./data/body.csv')

X = body[['HEIGHT']]
y = body['GENDER']
X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.2, random_state=2024)

sklearn.neighbors.KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors = 3)
X_trn = np.array(X_trn)
X_tst = np.array(X_tst)
knn.fit(X_trn, y_trn)
KNeighborsClassifier(n_neighbors=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Prediction

y_pred = knn.predict(X_tst)
from sklearn.metrics import confusion_matrix
confusion_matrix(y_tst, y_pred)
array([[22,  5],
       [12, 21]])


np.mean(y_tst == y_pred)
0.7166666666666667
np.mean(y_tst != y_pred)
0.2833333333333333

22-K Nearest Neighbors

In lab.qmd ## Lab 22 section,

  1. use HEIGHT and WAIST to predict GENDER using KNN with \(K = 3\).

  2. Generate the (test) confusion matrix.

  3. Calculate (test) accuracy rate.

  4. Does using more predictors predict better?

R Code

library(tidymodels)

## load data
bodydata <- read_csv("./data/body.csv")
body <- bodydata |> 
    select(GENDER, HEIGHT, WAIST) |> 
    mutate(GENDER = as.factor(GENDER))

## training and test data
set.seed(2024)
df_split <- initial_split(data = body, prop = 0.8)
df_trn <- training(df_split)
df_tst <- testing(df_split)

## KNN training
knn_recipe <- recipe(GENDER ~ HEIGHT + WAIST, data = df_trn) |> 
    step_normalize(all_predictors())
knn_mdl <- nearest_neighbor(neighbors = 3, mode = "classification")
knn_out <- workflow() |> 
    add_recipe(knn_recipe) |> 
    add_model(knn_mdl) |> 
    fit(data = df_trn)

## KNN prediction
knn_pred <- pull(predict(knn_out, df_tst))
table(knn_pred, df_tst$GENDER)
mean(knn_pred == df_tst$GENDER)

Python Code

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

## load data
body = pd.read_csv('./data/body.csv')
X = body[['HEIGHT', 'WAIST']]
y = body['GENDER']

## training and test data
X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.2, random_state=2024)

## KNN training
knn = KNeighborsClassifier(n_neighbors = 3)
X_trn = np.array(X_trn)
X_tst = np.array(X_tst)
knn.fit(X_trn, y_trn)

## KNN prediction
y_pred = knn.predict(X_tst)
from sklearn.metrics import confusion_matrix
confusion_matrix(y_tst, y_pred)
np.mean(y_tst == y_pred)