Principal Component Analysis

MATH/COSC 3570 Introduction to Data Science

Dr. Cheng-Han Yu
Department of Mathematical and Statistical Sciences
Marquette University

Unsupervised Learning

Supervised Learning: response $Y$ and features $X_1, X_2, \dots, X_p$ measured on $n$ observations.
Unsupervised Learning: only features $X_1, X_2, \dots, X_p$ measured on $n$ observations.
- Not interested in prediction (no response to be predicted)
- Discover any interesting pattern or relationships among these features.

Dimension reduction for effective data visualization or extracting most important information those features contain.
- plot a bunch of points of $\boldsymbol{x} = (x_1, x_2, \dots, x_p)$ in a 2-D scatter plot (manifold). (reduce dimension from $p$ to 2)
- use 2 variables to explain most variations or represents high data density in the $p$ variables

Clustering discovers unknown subgroups/clusters in data
- find 3 sub-groups of people based on variables income, occupation, age, etc

Background: Dimensions

Going to be very SIMPLE!

But you’ll be happy I did this.

Because PCA is about reducing dimensions!

One-Dimension (1D) Number line

# A tibble: 50 × 1
   English
     <int>
 1      41
 2      65
 3      55
 4      94
 5      66
 6      85
 7      44
 8      44
 9      67
10      73
# ℹ 40 more rows

One-Dimension (1D) Number line: Uniform students

# A tibble: 50 × 1
   English
     <int>
 1      41
 2      65
 3      55
 4      94
 5      66
 6      85
 7      44
 8      44
 9      67
10      73
# ℹ 40 more rows

1D Number line: Non-uniform students

# A tibble: 50 × 1
   English
     <int>
 1      77
 2      78
 3      81
 4      78
 5      52
 6      62
 7      47
 8      58
 9      43
10      59
# ℹ 40 more rows

Two-Dimensions (2D) X-Y Scatter plot: High Correlated

English and Math measure an overall academic performance.

# A tibble: 50 × 2
   English  Math
     <int> <dbl>
 1      41  33.2
 2      65  63.6
 3      55  44.6
 4      94  95  
 5      66  65.6
 6      85  73.1
 7      44  46.6
 8      44  51  
 9      67  69.4
10      73  66.5
# ℹ 40 more rows

Two-Dimensions (2D) X-Y Scatter plot: No correlated

English and Math measure different abilities.

# A tibble: 50 × 2
   English  Math
     <int> <int>
 1      41    27
 2      65    47
 3      55    32
 4      94    44
 5      66    20
 6      85    30
 7      44    16
 8      44    72
 9      67    86
10      73    82
# ℹ 40 more rows

Three-Dimensions (3D) X-Y-Z Scatter plot

# A tibble: 50 × 3
   English  Math Biology
     <int> <dbl>   <dbl>
 1      41  33.2    39.2
 2      65  63.6    61.6
 3      55  44.6    41.6
 4      94  95      92  
 5      66  65.6    73.6
 6      85  73.1    71.1
 7      44  46.6    56.6
 8      44  51      56  
 9      67  69.4    79.4
10      73  66.5    59.5
# ℹ 40 more rows

Four-Dimensions (4D) X-Y-Z-? Scatter plot

# A tibble: 50 × 4
   English  Math Biology History
     <int> <dbl>   <dbl>   <dbl>
 1      41  33.2    39.2      51
 2      65  63.6    61.6      53
 3      55  44.6    41.6      63
 4      94  95      92        83
 5      66  65.6    73.6      51
 6      85  73.1    71.1      74
 7      44  46.6    56.6      34
 8      44  51      56        33
 9      67  69.4    79.4      76
10      73  66.5    59.5      74
# ℹ 40 more rows

How about Pair Plots?

Tooooo Many Pair Plots!

If we have $p$ variables, there are ${p \choose 2} = p(p-1)/2$ pairs.
If $p = 10$, we have 45 such scatter plots to look at!
In real data science work, we may encounter over 100 variables!!

Dimension Reduction

One variable represents one dimension.
With many variables in the data, we live in a high dimensional world.

GOAL:

Find a low-dimensional (usually 2D) representation of the data that captures as much of the information all of those variables provide as possible.
Use two created variables to represent all $p$ variables, and make a scatter plot of the two created variables to learn what our observations look like as if they lived in the high dimensional space.

Why and when can we omit dimensions?

So in order to meaningfully represent the relationship of all variables, we need a technique, Dimension Reduction.
In mathematics, One variable represents one dimension, so with many variables in the data, we live in a high dimensional world.
We would like to find a low-dimensional (usually 2D) representation of the data that captures as much of the information all of those variables provide as possible.
We use two created variables to represent all $p$ variables, and make a scatter plot of the two created variables to learn what our observations look like as if they lived in the high dimensional space.
Of course, it’s not always a good idea to just use two variables to represent all $p$ variables. Some information will be missing when we just use two variables to represent or explain all the relationships of $p$ variables.
But sometimes, a low-dimensional representation looks very like a high dimensional space, and does not lose much information. In this situation, a low-dimensional representation is very useful.
So Let’s see why and when can we omit dimensions?

Variation mostly from One Variable

Almost all of the variation in the data is from left to right.

Variation mostly from One Variable

If we flattened the data, the graph would not look much different.

Variation mostly from One Variable

If we flattened the data, we could graph it with a 1D number line!

Variation mostly from One Variable

Both graphs say “the important variation is left to right.”

Principal Component Analysis (PCA)

Idea of PCA

PCA is a dimension reduction tool that finds a low-dimensional representation of a data set that contains as much as possible of variation.
Each of the observations lives in a high-dimensional space (lots of variables), but not all of these dimensions (variables) are equally interesting/important.
The concept of interesting/important is measured by the amount that the data vary along each dimension.

PCA is a dimension reduction tool that finds a low-dimensional representation of a data set that contains as much as possible of variation stored in the data set.
As we’ve seen before, each of the observations lives in a high-dimensional space, meaning that each observation has lots of variables associated with it, but not all of these dimensions (variables) are equally interesting/important.
The concept of interesting/important is measured by the amount that the observations vary along each dimension.
A characteristic or attribute of observations is called variable because its value varies from sample to sample.
If the variable does not vary, it becomes an irrelevant or un-important variable because we cannot use the variable to differentiate or distinguish observations. If everyone in this class gets grade A, then the data science grade is not an important variable to learn which students perform academically better than others.

PCA Illustration: 2 Variable Example

# A tibble: 50 × 2
   English  Math
     <int> <dbl>
 1      41  33.2
 2      65  63.6
 3      55  44.6
 4      94  95  
 5      66  65.6
 6      85  73.1
 7      44  46.6
 8      44  51  
 9      67  69.4
10      73  66.5
11      47  58.9
12      66  66.5
13      57  51  
14      57  33.2
15      77  83.6
16      83  78.8
# ℹ 34 more rows

Step 1: Shift (or standardize) the Data

So the two variables have both mean 0. If the variables are measured in a different unit, consider standardization, $\frac{x_i - \bar{x}}{s_x}$.
Shifting does not change how the data points are positioned relative to each other.

Step 2: Find a Line that Fits the Data the Best

Start with a line going through the origin.
Rotate the line until it fits the data as well as it can, given that it goes through the origin.

Step 2: Find a Line that Fits the Data the Best

Start with a line going through the origin.
Rotate the line until it fits the data as well as it can, given that it goes through the origin.

Step 2: Find a Line that Fits the Data the Best

Start with a line going through the origin.
Rotate the line until it fits the data as well as it can, given that it goes through the origin.

The Meaning of the Best line

The best line maximizes the variance of the projected points from the data points onto the line! It is called the 1st Principal Component (PC1)
PC1 is the line in the Eng-Math space that is closest to the $n$ observations
- PC1 minimizes the sum of squared distances between the data points and the PC1.
PC1 is the best 1D representation of the 2D data

To quantify how good this line fits the data, PCA projects the data onto it.
Then the idea is that we can either measure the distances from the data to the line and try to find the line that minimizes those distances.
Or we can try to find the line that maximizes the distances from the projected points to the origin.
The two criteria are equivalent.
DEMO
So we are find the line that maximizes the variation of the projected points from the data points onto the line!
The best line is called Principal Component 1 (PC1), which maximizes the sum of squared distances between projected points and the origin.
Regression line: minimizes the sum of squared residuals (vertical lines from the data points to the line)
Principal Component 1 (PC1): maximizes the sum of squared distances between between projected points and the origin.
PC1 is the line in the Eng-Math 2 dimensional space that is closest to the $n$ observations, i.e., PC1 minimizes the sum of squared distances between the data points and the PC1.
PC1 is the best 1D representation of the 2D data

The Meaning of the Best line

https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues

PC1 and PC2

The data points are also spread out a little above and below the PC1.
There are some variation that is not explained by the PC1.
Find the second PC, PC2, that
- explains the remaining variation
- is the line through the origin and perpendicular to PC1.

Linear Combinations

PC1 = 0.68 $\times$ English $+$ 0.74 $\times$ Math
PC2 = 0.74 $\times$ English $-$ 0.68 $\times$ Math
PC1 is like an overall intelligence index as it is a weighted average combining verbal and quantitative abilities.
PC2 accounts for individual difference in English and Math scores.
The combination weights 0.68, 0.74, etc are called PC loadings.

Now let’s look at PC1 and PC2 a little more carefully.
First PC1 and PC2 are just a vector in 2 dimensional space, right?
In other words, they are linear combinations of two standard basis, here our English axis and Math axis.
PCA help us get the linear combinations.
Here PC1 = 0.68 $\times$ English + 0.74 $\times$ Math
PC2 = 0.74 $\times$ English - 0.68 $\times$ Math
So to make PC1, we mix 0.68 part of English score with 0.74 parts of Math score.
One unit of PC1 consists of 0.68 parts of English and 0.74 parts of Math.
And because the weight of math score is a little bit larger, Math score is a little bit more important when it comes to describing how the data are spread out.
PC1 can be viewed as an overall intelligence index because it is an weighted average combining both verbal and quantitative reasoning abilities.
PC2 accounts for individual difference in English and Math scores.
$0.68^2 + 0.74^2 = 1$ (Pythagorean theorem)
The combination weights 0.68, 0.74, etc are called loadings of PC.

Variation

If the variance for PC1 is $17$ and the variance for PC2 is $2$, the total variation presented in the data is $17+2=19$.
PC1 accounts for $17/19 = 89\%$ of the total variation, and PC2 accounts for $2/19 = 11\%$ of the total variation.

How about 3 or More Variables?

PC1 spans the direction of the most variation

PC2 spans the direction of the 2nd most variation

PC3 spans the direction of the 3rd most variation

PC4 spans the direction of the 4th most variation

If we have $n$ observations and $p$ variables (dimensions), there are at most $\min(n - 1, p)$ PCs.

US Arrest Data in 1973

dim(USArrests)

[1] 50  4

head(USArrests, 16)

            Murder Assault UrbanPop Rape
Alabama       13.2     236       58   21
Alaska        10.0     263       48   44
Arizona        8.1     294       80   31
Arkansas       8.8     190       50   20
California     9.0     276       91   41
Colorado       7.9     204       78   39
Connecticut    3.3     110       77   11
Delaware       5.9     238       72   16
Florida       15.4     335       80   32
Georgia       17.4     211       60   26
Hawaii         5.3      46       83   20
Idaho          2.6     120       54   14
Illinois      10.4     249       83   24
Indiana        7.2     113       65   21
Iowa           2.2      56       57   11
Kansas         6.0     115       66   18

PC Loading Vectors on `USArrests`

pca_output <- prcomp(USArrests, scale = TRUE)

## rotation matrix provides PC loadings
(pca_output$rotation <- -pca_output$rotation)

          PC1   PC2   PC3    PC4
Murder   0.54  0.42 -0.34 -0.649
Assault  0.58  0.19 -0.27  0.743
UrbanPop 0.28 -0.87 -0.38 -0.134
Rape     0.54 -0.17  0.82 -0.089

PCs are unique up to a sign change, so -pca_output$rotation gives us the same PCs as pca_output$rotation does.

$\text{PC1} = 0.54 \times \text{Murder} + 0.58 \times \text{Assault} + 0.28 \times \text{UrbanPop} + 0.54 \times \text{Rape}$

$\text{PC2} = 0.42 \times \text{Murder} + 0.19 \times \text{Assault} - 0.87 \times \text{UrbanPop} - 0.17 \times \text{Rape}$

We have 4 PCs because $\min(n-1, p) = \min(50-1, 4) = 4$.

To perform PCA in R, it cannot be easier.
We just need to use the function prcomp(), and put the data set in the function. Then we get everything we want.
Here I choose to scale the data because variables are not measured in the same scale. For example, Murder rate and UrbanPop are measured in different units.
This makes sure that every variable has variance 1, and our analysis is not affected by units.
The PCA results are stored as a list in pca_output object.
OK first we can look at the rotation matrix because it provides PC loadings, and so we know what PC1 and PC2 are.
Changing signs for easier interpretation of PCs
Those PC loadings define how we rotates the coordinates to obtain the PCs. ???
Again, PC1 is just a linear combination or weighted average of the 4 variables, same as PC2. ???
Again, PC1 is just a linear combination or weighted average of the 4 variables, same as PC2.
PCs are unique up to a sign change, so -pca_output$rotation gives us the same PCs as pca_output$rotation does.
We have 4 PCs because $\min(n-1, k) = \min(50-1, 4) = 4$.

PC Scores

The value of the rotated data, the data values of each PC are stored in pca_output$x

head(pca_output$x <- -pca_output$x, 16) |> round(2)

              PC1   PC2   PC3   PC4
Alabama      0.98  1.12 -0.44 -0.15
Alaska       1.93  1.06  2.02  0.43
Arizona      1.75 -0.74  0.05  0.83
Arkansas    -0.14  1.11  0.11  0.18
California   2.50 -1.53  0.59  0.34
Colorado     1.50 -0.98  1.08  0.00
Connecticut -1.34 -1.08 -0.64  0.12
Delaware     0.05 -0.32 -0.71  0.87
Florida      2.98  0.04 -0.57  0.10
Georgia      1.62  1.27 -0.34 -1.07
Hawaii      -0.90 -1.55  0.05 -0.89
Idaho       -1.62  0.21  0.26  0.49
Illinois     1.37 -0.67 -0.67  0.12
Indiana     -0.50 -0.15  0.23 -0.42
Iowa        -2.23 -0.10  0.16 -0.02
Kansas      -0.79 -0.27  0.03 -0.20

Interpretation of PCs

pca_output$rotation

          PC1   PC2   PC3    PC4
Murder   0.54  0.42 -0.34 -0.649
Assault  0.58  0.19 -0.27  0.743
UrbanPop 0.28 -0.87 -0.38 -0.134
Rape     0.54 -0.17  0.82 -0.089

PCs are less interpretable than original features.
The first loading vector places approximately equal weight on Assualt, Murder and Rape, with much less weights on UrbanPop.
PC1 roughly corresponds to a overall serious crime rate.

The second loading vector places most of its weight on UrbanPop, and much less weight on the other 3 features.
PC2 roughly corresponds to the level of urbanization.

Intepretability decreases with the order of PCs.
So it’s easier to give PC1 a meaningful name than PC2, and PC2 is more meaningful than PC3, and so on. Because the PCs after the first 2 PCs usually explain quite small variation in the data, and some of them may be just noises.
Let’s see if we can interpret these PCs.
First keep in mind that PCs are less interpretable than original features. Sometimes we even don’t know how to interpret it, especially for PCs that explain small variations. So this is the price we pay for dimension reduction.
But let’s look at this example.
The first loading vector places approximately equal weight on Assualt, Murder and Rape, with much less weights on UrbanPop.
So PC1 roughly corresponds to a overall serious crime rate because PC1 explains the variations of data caused by those crime variables Assualt, Murder and Rape.
On the contrary, the second loading vector places most of its weight on UrbanPop, and much less weight on the other 3 features.
So we can say PC2 roughly corresponds to the level of urbanization.
So you get the idea, Assualt, Murder and Rape are similar each other because they all are measures of crime rate.
So when reducing dimensions, we sort of combine the three similar variables together to become a one single index that measures an overall crime rate.
Urban population measures a totally different thing. So the variation created by this variable cannot be explained well by the crime rate, and it should be absorbed in PC2.

2D Representation of the 4D data

pca_output$x |> tail(2) |> round(2)

            PC1   PC2   PC3   PC4
Wisconsin -2.06 -0.61 -0.14 -0.18
Wyoming   -0.62  0.32 -0.24  0.16

Higher value of PC1 means higher crime rate (roughly).
Higher value of PC2 means higher level of urbanization (roughly).

2D Representation of the 4D data: biplot

biplot(pca_output, xlabs = state.abb, 
       scale = 0)

Top axis: PC1 loadings
Right axis: PC2 loadings
Red arrows: PC1 and PC2 loading vector, e.g., (0.28, -0.87) for UrbanPop.
Crime-related variables (Assualt, Murder and Rape) are located close to each other.
UrbanPop is far from the other three.
Assualt, Murder and Rape are more correlated, and UrbanPop is less correlated with the other three.

We can simply use the function biplot() to show the 2D Representation of the data.
This function also provide loading vector of PC1 and PC2 that gives us an idea of which direction means a large value of a variable/feature.
This is why it is called biplot because we plot two things in one single plot.
Here, Top axis is for PC1 loadings
Right axis is PC2 loadings
Red arrows: PC1 and PC2 loading vector, e.g., (0.28, 0.87) for UrbanPop.
So NJ and Ca have pretty high urban pop rate because they are large in the UrbanPop arrow direction.
Crime-related variables (Assualt, Murder and Rape) are located close to each other.
UrbanPop is far from the other three.
Assualt, Murder and Rape are more correlated, and UrbanPop is less correlated with the other three.
Assualt, Murder and Rape sort of point to the same direction as PC1 and UrbanPop points to the same direction as PC2.

Proportion of Variance Explained

summary(pca_output)

Importance of components:
                        PC1   PC2    PC3    PC4
Standard deviation     1.57 0.995 0.5971 0.4164
Proportion of Variance 0.62 0.247 0.0891 0.0434
Cumulative Proportion  0.62 0.868 0.9566 1.0000

PC1 explains $62\%$ of the variations in the data, and PC2 explains $24.7\%$ of the variance.
PC1 and PC2 explain about $87\%$ of the variance, and the last two PCs explain only $13\%$.
2D plot provides pretty accurate summary of the data.

Scree Plot

Look for a point at which the proportion of variance explained by each subsequent PC drops off.

23-Principal Component Analysis

In lab.qmd ## Lab 23 section,

Use slice() to print the first six rows of iris data.
Perform PCA on Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width.
Generate biplot, and explain it.

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

sklearn.decomposition

sklearn.preprocessing

import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/USArrests.csv
USArrests = pd.read_csv('./data/USArrests.csv')
USArrests.head(4)

   rownames  Murder  Assault  UrbanPop  Rape
0   Alabama    13.2      236        58  21.2
1    Alaska    10.0      263        48  44.5
2   Arizona     8.1      294        80  31.0
3  Arkansas     8.8      190        50  19.5

USArr = USArrests.drop(['rownames'], axis = 1)
USArr.index = USArrests['rownames']
USArr.head(4)

          Murder  Assault  UrbanPop  Rape
rownames                                 
Alabama     13.2      236        58  21.2
Alaska      10.0      263        48  44.5
Arizona      8.1      294        80  31.0
Arkansas     8.8      190        50  19.5

Standardization

scaler = StandardScaler()
X = scaler.fit_transform(USArr.values) ## Array

Perform PCA (prcomp(USArrests, scale = TRUE))

pca = PCA(n_components=4)
pca.fit(X)

PCA(n_components=4)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

PCA Components (pca_output$rotation)

result = np.round(pca.components_.T, 2)
pd.DataFrame(result, columns=['PC1', 'PC2', 'PC3', 'PC4'], index=USArr.columns)

           PC1   PC2   PC3   PC4
Murder    0.54  0.42 -0.34  0.65
Assault   0.58  0.19 -0.27 -0.74
UrbanPop  0.28 -0.87 -0.38  0.13
Rape      0.54 -0.17  0.82  0.09

Data on PCs (pca_output$x)

X_pc = np.round(pca.transform(X), 2)
pd.DataFrame(X_pc, columns=['PC1', 'PC2', 'PC3', 'PC4'], index=USArr.index)

                 PC1   PC2   PC3   PC4
rownames                              
Alabama         0.99  1.13 -0.44  0.16
Alaska          1.95  1.07  2.04 -0.44
Arizona         1.76 -0.75  0.05 -0.83
Arkansas       -0.14  1.12  0.11 -0.18
California      2.52 -1.54  0.60 -0.34
Colorado        1.51 -0.99  1.10  0.00
Connecticut    -1.36 -1.09 -0.64 -0.12
Delaware        0.05 -0.33 -0.72 -0.88
Florida         3.01  0.04 -0.58 -0.10
Georgia         1.64  1.28 -0.34  1.08
Hawaii         -0.91 -1.57  0.05  0.90
Idaho          -1.64  0.21  0.26 -0.50
Illinois        1.38 -0.68 -0.68 -0.12
Indiana        -0.51 -0.15  0.23  0.42
Iowa           -2.25 -0.10  0.16  0.02
Kansas         -0.80 -0.27  0.03  0.21
Kentucky       -0.75  0.96 -0.03  0.67
Louisiana       1.56  0.87 -0.78  0.45
Maine          -2.40  0.38 -0.07 -0.33
Maryland        1.76  0.43 -0.16 -0.56
Massachusetts  -0.49 -1.47 -0.61 -0.18
Michigan        2.11 -0.16  0.38  0.10
Minnesota      -1.69 -0.63  0.15  0.07
Mississippi     1.00  2.39 -0.74  0.22
Missouri        0.70 -0.26  0.38  0.23
Montana        -1.19  0.54  0.25  0.12
Nebraska       -1.27 -0.19  0.18  0.02
Nevada          2.87 -0.78  1.16  0.31
New Hampshire  -2.38 -0.02  0.04 -0.03
New Jersey      0.18 -1.45 -0.76  0.24
New Mexico      1.98  0.14  0.18 -0.34
New York        1.68 -0.82 -0.64 -0.01
North Carolina  1.12  2.23 -0.86 -0.95
North Dakota   -2.99  0.60  0.30 -0.25
Ohio           -0.23 -0.74 -0.03  0.47
Oklahoma       -0.31 -0.29 -0.02  0.01
Oregon          0.06 -0.54  0.94 -0.24
Pennsylvania   -0.89 -0.57 -0.40  0.36
Rhode Island   -0.86 -1.49 -1.37 -0.61
South Carolina  1.32  1.93 -0.30 -0.13
South Dakota   -1.99  0.82  0.39 -0.11
Tennessee       1.00  0.86  0.19  0.65
Texas           1.36 -0.41 -0.49  0.64
Utah           -0.55 -1.47  0.29 -0.08
Vermont        -2.80  1.40  0.84 -0.14
Virginia       -0.10  0.20  0.01  0.21
Washington     -0.22 -0.97  0.62 -0.22
West Virginia  -2.11  1.42  0.10  0.13
Wisconsin      -2.08 -0.61 -0.14  0.18
Wyoming        -0.63  0.32 -0.24 -0.17

Explained Variance (pca_output$sdev ^ 2)

np.round(pca.explained_variance_, 2)

array([2.53, 1.01, 0.36, 0.18])

Principal Component Analysis

Unsupervised Learning

Unsupervised Learning

Background: Dimensions

Going to be very SIMPLE!

But you’ll be happy I did this.

Because PCA is about reducing dimensions!

One-Dimension (1D) Number line

One-Dimension (1D) Number line: Uniform students

1D Number line: Non-uniform students

Two-Dimensions (2D) X-Y Scatter plot: High Correlated

Two-Dimensions (2D) X-Y Scatter plot: No correlated

Three-Dimensions (3D) X-Y-Z Scatter plot

Four-Dimensions (4D) X-Y-Z-? Scatter plot

How about Pair Plots?

Tooooo Many Pair Plots!

Dimension Reduction

Variation mostly from One Variable

Variation mostly from One Variable

Variation mostly from One Variable

Variation mostly from One Variable

Principal Component Analysis (PCA)

Idea of PCA

PCA Illustration: 2 Variable Example

Step 1: Shift (or standardize) the Data

Step 2: Find a Line that Fits the Data the Best

Step 2: Find a Line that Fits the Data the Best

Step 2: Find a Line that Fits the Data the Best

The Meaning of the Best line

The Meaning of the Best line

PC1 and PC2

Linear Combinations

Variation

How about 3 or More Variables?

US Arrest Data in 1973

PC Loading Vectors on USArrests

PC Scores

Interpretation of PCs

2D Representation of the 4D data

2D Representation of the 4D data: biplot

Proportion of Variance Explained

Scree Plot

sklearn.decomposition

sklearn.preprocessing

PC Loading Vectors on `USArrests`