
MATH/COSC 3570 Introduction to Data Science
Clustering: unsupervised learning technique for finding subgroups or clusters in a data set.
GOAL: Homogeneous within groups; heterogeneous between groups

Partition observations into \(K\) distinct, non-overlapping clusters: assign each to exactly one of the \(K\) clusters.
Must pre-specify the number of clusters \(K \ll n\).

Data (Let’s choose \(K=2\))

K-Means Algorithm
Random assignment

K-Means Algorithm
Compute the cluster centroid

K-Means Algorithm
Do new assignment

K-Means Algorithm
Do new assignment

Compute the cluster centroid …


Note

kmeans()K-means clustering with 3 clusters of sizes 67, 88, 85
Cluster means:
age income
1 -0.29 1.33
2 1.04 -0.32
3 -0.85 -0.71
Clustering vector:
[1] 1 2 2 3 2 3 2 3 1 2 2 3 3 2 3 3 1 3 1 1 3 3 3 2 3 3 2 2 3 1 3 2 1 1 2 1 3
[38] 3 3 1 3 3 2 1 3 1 2 3 2 2 2 3 1 3 2 3 3 2 1 2 2 3 2 2 2 3 3 2 3 2 2 3 1 1
[75] 3 2 3 2 2 3 2 2 3 2 1 1 1 2 2 2 3 3 3 3 1 2 1 3 3 2 3 3 2 3 1 2 2 3 1 1 1
[112] 2 1 2 2 2 2 3 2 2 2 1 3 1 3 2 2 2 1 1 2 1 3 2 3 1 1 1 1 3 2 1 1 3 3 3 2 2
[149] 3 2 1 2 1 3 1 3 2 3 1 2 2 1 3 3 3 2 2 1 3 3 2 3 1 1 2 3 2 3 1 3 1 3 1 2 3
[186] 2 2 1 2 2 2 1 3 2 2 2 2 1 1 1 2 1 1 3 1 3 3 1 2 1 3 3 2 2 1 3 3 3 2 1 2 1
[223] 2 3 3 3 2 1 3 1 1 1 3 2 2 2 1 3 3 1
Within cluster sum of squares by cluster:
[1] 66 41 39
(between_SS / total_SS = 69.5 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault" kmeans()# A tibble: 240 × 3
age income .cluster
<dbl> <dbl> <fct>
1 -0.878 1.00 1
2 1.44 -0.601 2
3 1.00 -0.668 2
4 -1.38 -1.63 3
5 1.64 -0.831 2
6 -0.137 -0.851 3
# ℹ 234 more rows

Clustering is not beneficial for decision making or strategic plan if the clusters found are not meaningful based on their features.
The clusters found may be heavily distorted due to outliers that do not belong to any cluster.
Clustering methods are not very robust to perturbations of the data.
23-K means Clustering
In lab.qmd ## Lab 24 section,
Install R package palmerpenguins at https://allisonhorst.github.io/palmerpenguins/
Perform K-Means to with \(K = 3\) to cluster penguins based on bill_length_mm and flipper_length_mm of data peng.

