Data Splitting and K-Nearest Neighbors

Training and Test Data


  • Goal: Build a good regression function or classifier in terms of prediction accuracy.
  • The mechanics of prediction is easy:
    • Plug in values of predictors to the model equation.
    • Calculate the predicted value of the response \(\hat{y}\)
  • Getting it right is hard! No guarantee that
    • the model estimates are close to the truth
    • your model performs as well with new data (test data) as it did with your sample data (training data)

Spending Our Data

  • Several steps to create a useful model:
    • Parameter estimation
    • Model selection
    • Performance assessment, etc.
  • Doing all of this using the entire data may lead to overfitting:

The model performs well on the training data, but awfully predicts the response on the new data we are interested.

  • Low error rate on observed data, but high prediction error rate on future unobserved data!

Source: modified from

Splitting Data

  • Often, we don’t have another unused data to assess the performance of our model.

  • Solution: Pretend we have new data by splitting our data into training set and test set (validation set)!

  • Training set:
    • Sandbox for model building
    • Spend most of our time using the training set to develop the model
    • Majority of the original sample data (75% - 80%)
  • Test set:
    • Held in reserve to determine efficacy of one or two chosen models
    • Critical to look at it once only, otherwise it becomes part of the modeling process
    • Remainder of the data (20% - 25%)

initial_split() in rsample

bodydata <- read_csv("./data/body.csv")
body <- bodydata |> 
    select(GENDER, HEIGHT, WAIST, BMI) |> 
    mutate(GENDER = as.factor(GENDER))

df_split <- 
        data = body, 
        prop = 0.8)

df_trn <- rsample::training(df_split)
df_tst <- rsample::testing(df_split)

[1] 240   4
[1] 60  4

body Data

# A tibble: 240 × 4
  <fct>   <dbl> <dbl> <dbl>
1 1        188.  82.3  20.2
2 0        160   94.8  27.3
3 0        160.  88.7  26.1
4 1        180   88.3  22.6
5 0        168   81.2  21.9
6 1        176. 101.   29.2
# ℹ 234 more rows
# A tibble: 60 × 4
  <fct>   <dbl> <dbl> <dbl>
1 1        181.  92.5  27.4
2 0        155. 101.   31.4
3 0        156. 103    29.3
4 0        164.  90.6  21.6
5 0        165.  95.2  27.1
6 1        178.  95.3  25.5
# ℹ 54 more rows

What Makes a Good Classifier: Test Accuracy Rate

  • The test accuracy rate associated with the test data \(\{x_j, y_j\}_{j=1}^J\): \[ \frac{1}{J}\sum_{j=1}^JI(y_j = \hat{y}_j),\] where \(\hat{y}_j\) is the predicted label resulting from applying the classifier to the test response \(y_j\) with predictor \(x_j\).

What is the value of \(J\) in our example?

  • The best estimated classifier \(\hat{C}(x)\) trained from the training data for \(C(x)\) can be defined as the one producing the highest test accuracy rate or lowest test error rate.

K-Nearest Neighbors (KNN) Classifier

KNN classification uses majority voting:

Look for the most popular class label among its neighbors.

When predicting at \(x = (x_1, x_2) = (8, 6)\),

\[\begin{align} \hat{\pi}_{3Blue}(x = (8, 6)) &= \hat{P}(Y = \text{Blue} \mid x = (8, 6))\\ &= \frac{2}{3} \end{align}\]

\[\begin{align} \hat{\pi}_{3Orange}(x = (8, 6)) &= \hat{P}(Y = \text{Orange} \mid x = (8, 6))\\ &= \frac{1}{3} \end{align}\]

KNN Decision Boundary

  • Blue grid indicates the region in which a test response is assigned to the blue class.

  • We don’t know the true boundary (the true classification rule)!.

KNN Training recipes

Standardize predictors before doing KNN!

knn_recipe <- recipes::recipe(GENDER ~ HEIGHT, data = df_trn) |> 
── Recipe ────────────────────────────────────────────────────────────────

── Inputs 
Number of variables by role
outcome:   1
predictor: 1

── Operations 
• Centering and scaling for: all_numeric_predictors()

KNN Training parsnip

(knn_mdl <- parsnip::nearest_neighbor(mode = "classification", neighbors = 3))
K-Nearest Neighbor Model Specification (classification)

Main Arguments:
  neighbors = 3

Computational engine: kknn 

KNN Training workflows

knn_out <- 
    workflows::workflow() |> 
    add_recipe(knn_recipe) |> 
    add_model(knn_mdl) |> 
    fit(data = df_trn)
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: nearest_neighbor()

── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step

• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────

kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(3,     data, 5))

Type of response variable: nominal
Minimal misclassification: 0.217
Best kernel: optimal
Best k: 3

Prediction on Test Data

    predict(knn_out, df_tst),
    predict(knn_out, df_tst, type = "prob")) |> 
    dplyr::sample_n(size = 8)
# A tibble: 8 × 3
  .pred_class .pred_0 .pred_1
  <fct>         <dbl>   <dbl>
1 0             1       0    
2 0             0.852   0.148
3 0             1       0    
4 0             1       0    
5 1             0.481   0.519
6 0             0.852   0.148
7 0             1       0    
8 1             0.370   0.630
knn_pred <- 
    pull(predict(knn_out, df_tst))

## Confusion matrix
table(knn_pred, df_tst$GENDER)
knn_pred  0  1
       0 25  7
       1  7 21
## Test accuracy rate
mean(knn_pred == df_tst$GENDER)
[1] 0.767

Which K Should We Use?

  • \(K\)-nearest neighbors has no model parameters, but a tuning parameter \(K\).

  • This is a parameter which determines how the model is trained, not a parameter that is learned through training.

\(v\)-fold Cross Validation

  • Use \(v\)-fold Cross Validation (CV) to choose tuning parameters. (MATH 4750 Computational Statistics)

  • Usually use \(v = 5\) or \(10\).

  • IDEA:
    • Prepare \(v\) CV data sets
    • Create a sequence of values of \(K\)
    • For each value of \(K\), run CV, and obtain an accuracy rate
    • Choose the \(K\) with the highest accuracy rate

Final Model

knn_mdl_best <- parsnip::nearest_neighbor(neighbors = 29, mode = "classification")

knn_out_best <- workflows::workflow() |>
    add_recipe(knn_recipe) |>
    add_model(knn_mdl_best) |>
    fit(data = df_trn)

Final Model Performance

knn_pred_best <- pull(predict(knn_out_best, df_tst))

## Confusion matrix
table(knn_pred_best, df_tst$GENDER)
knn_pred_best  0  1
            0 28  7
            1  4 21
## Test accuracy rate
mean(knn_pred_best == df_tst$GENDER)
[1] 0.817




library(reticulate); py_install("scikit-learn")
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

body = pd.read_csv('./data/body.csv')

X = body[['HEIGHT']]
y = body['GENDER']
X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.2, random_state=2024)


knn = KNeighborsClassifier(n_neighbors = 3)
X_trn = np.array(X_trn)
X_tst = np.array(X_tst), y_trn)
y_pred = knn.predict(X_tst)
from sklearn.metrics import confusion_matrix
confusion_matrix(y_tst, y_pred)
array([[22,  5],
       [12, 21]])

np.mean(y_tst == y_pred)
np.mean(y_tst != y_pred)

22-K Nearest Neighbors

In lab.qmd ## Lab 22 section,

  1. use HEIGHT and WAIST to predict GENDER using KNN with \(K = 3\).

  2. Generate the (test) confusion matrix.

  3. Calculate (test) accuracy rate.

  4. Does using more predictors predict better?

R Code


## load data
bodydata <- read_csv("./data/body.csv")
body <- bodydata |> 
    select(GENDER, HEIGHT, WAIST) |> 
    mutate(GENDER = as.factor(GENDER))

## training and test data
df_split <- initial_split(data = body, prop = 0.8)
df_trn <- training(df_split)
df_tst <- testing(df_split)

## KNN training
knn_recipe <- recipe(GENDER ~ HEIGHT + WAIST, data = df_trn) |> 
knn_mdl <- nearest_neighbor(neighbors = 3, mode = "classification")
knn_out <- workflow() |> 
    add_recipe(knn_recipe) |> 
    add_model(knn_mdl) |> 
    fit(data = df_trn)

## KNN prediction
knn_pred <- pull(predict(knn_out, df_tst))
table(knn_pred, df_tst$GENDER)
mean(knn_pred == df_tst$GENDER)

Python Code

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

## load data
body = pd.read_csv('./data/body.csv')
X = body[['HEIGHT', 'WAIST']]
y = body['GENDER']

## training and test data
X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.2, random_state=2024)

## KNN training
knn = KNeighborsClassifier(n_neighbors = 3)
X_trn = np.array(X_trn)
X_tst = np.array(X_tst), y_trn)

## KNN prediction
y_pred = knn.predict(X_tst)
from sklearn.metrics import confusion_matrix
confusion_matrix(y_tst, y_pred)
np.mean(y_tst == y_pred)