MATH/COSC 3570 Introduction to Data Science
Supervised Learning: response \(Y\) and features \(X_1, X_2, \dots, X_p\) measured on \(n\) observations.
Unsupervised Learning: only features \(X_1, X_2, \dots, X_p\) measured on \(n\) observations.
English and Math measure an overall academic performance.
English and Math measure different abilities.
One variable represents one dimension.
With many variables in the data, we live in a high dimensional world.
GOAL:
Find a low-dimensional (usually 2D) representation of the data that captures as much of the information all of those variables provide as possible.
Use two created variables to represent all \(p\) variables, and make a scatter plot of the two created variables to learn what our observations look like as if they lived in the high dimensional space.
Why and when can we omit dimensions?
PCA is a dimension reduction tool that finds a low-dimensional representation of a data set that contains as much as possible of variation.
Each of the observations lives in a high-dimensional space (lots of variables), but not all of these dimensions (variables) are equally interesting/important.
The concept of interesting/important is measured by the amount that the data vary along each dimension.
The best line maximizes the variance of the projected points from the data points onto the line! It is called the 1st Principal Component (PC1)
PC1 is the line in the Eng-Math space that is closest to the \(n\) observations
PC1 is the best 1D representation of the 2D data
If the variance for PC1 is \(17\) and the variance for PC2 is \(2\), the total variation presented in the data is \(17+2=19\).
PC1 accounts for \(17/19 = 89\%\) of the total variation, and PC2 accounts for \(2/19 = 11\%\) of the total variation.
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21
Alaska 10.0 263 48 44
Arizona 8.1 294 80 31
Arkansas 8.8 190 50 20
California 9.0 276 91 41
Colorado 7.9 204 78 39
Connecticut 3.3 110 77 11
Delaware 5.9 238 72 16
Florida 15.4 335 80 32
Georgia 17.4 211 60 26
Hawaii 5.3 46 83 20
Idaho 2.6 120 54 14
Illinois 10.4 249 83 24
Indiana 7.2 113 65 21
Iowa 2.2 56 57 11
Kansas 6.0 115 66 18
USArrests
pca_output <- prcomp(USArrests, scale = TRUE)
## rotation matrix provides PC loadings
(pca_output$rotation <- -pca_output$rotation)
PC1 PC2 PC3 PC4
Murder 0.54 0.42 -0.34 -0.649
Assault 0.58 0.19 -0.27 0.743
UrbanPop 0.28 -0.87 -0.38 -0.134
Rape 0.54 -0.17 0.82 -0.089
-pca_output$rotation
gives us the same PCs as pca_output$rotation
does.\(\text{PC1} = 0.54 \times \text{Murder} + 0.58 \times \text{Assault} + 0.28 \times \text{UrbanPop} + 0.54 \times \text{Rape}\)
\(\text{PC2} = 0.42 \times \text{Murder} + 0.19 \times \text{Assault} - 0.87 \times \text{UrbanPop} - 0.17 \times \text{Rape}\)
pca_output$x
PC1 PC2 PC3 PC4
Alabama 0.98 1.12 -0.44 -0.15
Alaska 1.93 1.06 2.02 0.43
Arizona 1.75 -0.74 0.05 0.83
Arkansas -0.14 1.11 0.11 0.18
California 2.50 -1.53 0.59 0.34
Colorado 1.50 -0.98 1.08 0.00
Connecticut -1.34 -1.08 -0.64 0.12
Delaware 0.05 -0.32 -0.71 0.87
Florida 2.98 0.04 -0.57 0.10
Georgia 1.62 1.27 -0.34 -1.07
Hawaii -0.90 -1.55 0.05 -0.89
Idaho -1.62 0.21 0.26 0.49
Illinois 1.37 -0.67 -0.67 0.12
Indiana -0.50 -0.15 0.23 -0.42
Iowa -2.23 -0.10 0.16 -0.02
Kansas -0.79 -0.27 0.03 -0.20
PC1 PC2 PC3 PC4
Murder 0.54 0.42 -0.34 -0.649
Assault 0.58 0.19 -0.27 0.743
UrbanPop 0.28 -0.87 -0.38 -0.134
Rape 0.54 -0.17 0.82 -0.089
PCs are less interpretable than original features.
The first loading vector places approximately equal weight on Assualt
, Murder
and Rape
, with much less weights on UrbanPop
.
PC1 roughly corresponds to a overall serious crime rate.
The second loading vector places most of its weight on UrbanPop
, and much less weight on the other 3 features.
PC2 roughly corresponds to the level of urbanization.
UrbanPop
.Assualt
, Murder
and Rape
) are located close to each other.UrbanPop
is far from the other three.Assualt
, Murder
and Rape
are more correlated, and UrbanPop
is less correlated with the other three.Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 1.57 0.995 0.5971 0.4164
Proportion of Variance 0.62 0.247 0.0891 0.0434
Cumulative Proportion 0.62 0.868 0.9566 1.0000
Look for a point at which the proportion of variance explained by each subsequent PC drops off.
23-Principal Component Analysis
In lab.qmd ## Lab 23
section,
Use slice()
to print the first six rows of iris
data.
Perform PCA on Sepal.Length
, Sepal.Width
, Petal.Length
, and Petal.Width
.
Generate biplot, and explain it.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
# https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/USArrests.csv
USArrests = pd.read_csv('./data/USArrests.csv')
USArrests.head(4)
rownames Murder Assault UrbanPop Rape
0 Alabama 13.2 236 58 21.2
1 Alaska 10.0 263 48 44.5
2 Arizona 8.1 294 80 31.0
3 Arkansas 8.8 190 50 19.5
prcomp(USArrests, scale = TRUE)
)pca_output$x
)X_pc = np.round(pca.transform(X), 2)
pd.DataFrame(X_pc, columns=['PC1', 'PC2', 'PC3', 'PC4'], index=USArr.index)
PC1 PC2 PC3 PC4
rownames
Alabama 0.99 1.13 -0.44 0.16
Alaska 1.95 1.07 2.04 -0.44
Arizona 1.76 -0.75 0.05 -0.83
Arkansas -0.14 1.12 0.11 -0.18
California 2.52 -1.54 0.60 -0.34
Colorado 1.51 -0.99 1.10 0.00
Connecticut -1.36 -1.09 -0.64 -0.12
Delaware 0.05 -0.33 -0.72 -0.88
Florida 3.01 0.04 -0.58 -0.10
Georgia 1.64 1.28 -0.34 1.08
Hawaii -0.91 -1.57 0.05 0.90
Idaho -1.64 0.21 0.26 -0.50
Illinois 1.38 -0.68 -0.68 -0.12
Indiana -0.51 -0.15 0.23 0.42
Iowa -2.25 -0.10 0.16 0.02
Kansas -0.80 -0.27 0.03 0.21
Kentucky -0.75 0.96 -0.03 0.67
Louisiana 1.56 0.87 -0.78 0.45
Maine -2.40 0.38 -0.07 -0.33
Maryland 1.76 0.43 -0.16 -0.56
Massachusetts -0.49 -1.47 -0.61 -0.18
Michigan 2.11 -0.16 0.38 0.10
Minnesota -1.69 -0.63 0.15 0.07
Mississippi 1.00 2.39 -0.74 0.22
Missouri 0.70 -0.26 0.38 0.23
Montana -1.19 0.54 0.25 0.12
Nebraska -1.27 -0.19 0.18 0.02
Nevada 2.87 -0.78 1.16 0.31
New Hampshire -2.38 -0.02 0.04 -0.03
New Jersey 0.18 -1.45 -0.76 0.24
New Mexico 1.98 0.14 0.18 -0.34
New York 1.68 -0.82 -0.64 -0.01
North Carolina 1.12 2.23 -0.86 -0.95
North Dakota -2.99 0.60 0.30 -0.25
Ohio -0.23 -0.74 -0.03 0.47
Oklahoma -0.31 -0.29 -0.02 0.01
Oregon 0.06 -0.54 0.94 -0.24
Pennsylvania -0.89 -0.57 -0.40 0.36
Rhode Island -0.86 -1.49 -1.37 -0.61
South Carolina 1.32 1.93 -0.30 -0.13
South Dakota -1.99 0.82 0.39 -0.11
Tennessee 1.00 0.86 0.19 0.65
Texas 1.36 -0.41 -0.49 0.64
Utah -0.55 -1.47 0.29 -0.08
Vermont -2.80 1.40 0.84 -0.14
Virginia -0.10 0.20 0.01 0.21
Washington -0.22 -0.97 0.62 -0.22
West Virginia -2.11 1.42 0.10 0.13
Wisconsin -2.08 -0.61 -0.14 0.18
Wyoming -0.63 0.32 -0.24 -0.17
pca_output$sdev ^ 2
)