HomeVideos

PCA and SVD - the math explained simply

Now Playing

PCA and SVD - the math explained simply

Transcript

570 segments

0:00

In this video, we'll talk about the

0:02

mathematical details behind the

0:04

principal component analysis. I will

0:07

here assume that you have some basic

0:09

knowledge about matrix operations and

0:12

vectors. If not, check out these videos.

0:16

I will first explain the main idea

0:18

behind PCA and then show the math step

0:21

by step to compute the PCA by the

0:24

encomposition method and discuss its

0:28

relation to the singular value

0:30

decomposition method SVD before we have

0:33

a look at how to compute these methods

0:36

by using some simple R and Python code.

0:39

Finally, I will try to explain why the

0:41

EN vectors are useful in PCA.

0:45

To explain how PCA works, we will use

0:48

the following simple example data. For

0:51

example, person number one has a

0:53

diastolic blood pressure of 78 and a

0:56

systolic blood pressure of 126.

1:00

Whereas this data point corresponds to

1:02

the blood pressure of person number two

1:05

and so forth. So PCA is all about

1:09

combining variables so that we can

1:11

reduce the dimensions of the data set.

1:14

One simple way to combine these two

1:17

variables would be to just add them. 126

1:21

+ 78 is 204

1:24

and so forth. So this would then be our

1:28

combined variable that we call blood

1:30

pressure. When we add the variables, it

1:34

can be seen as we use this equation

1:36

where these weights are equal to one.

1:40

The idea with PCA is to put different

1:43

weights on the variables so that we

1:46

maximize the variance of the combined

1:49

variable. The reason why we like to

1:51

maximize the variance is because we then

1:54

keep as much information as possible in

1:56

the combined variable. Note that you

1:59

will always lose some information when

2:01

you combine variables. Since bun can

2:04

just use very large weights to increase

2:07

the variance, we need some constraint.

2:11

PCA uses the following constraint where

2:14

the sum of the squared weights is equal

2:17

to one. By using this constraint, the

2:21

total variance of the original data will

2:24

be equal to the total variance of the

2:26

principal components that we will

2:29

compute later on.

2:31

For example, the weights 0.8 and 0.6

2:35

will be valid weights when we combine

2:37

the two variables because the sum of the

2:40

squared values is equal to one. Let's

2:44

try to combine the two variables by

2:46

using these weights. Note that we here

2:50

put more weight on the systolic blood

2:52

pressure because this weight is here

2:55

larger than the weight we multiply by

2:58

the diastolic blood pressure.

3:00

By using these weights, the combined

3:03

blood pressure for the first person is

3:06

147.6

3:09

and the combined blood pressure for the

3:11

second person is 150.4

3:14

and so forth. This is the combined blood

3:17

pressure for all six individuals.

3:21

The variance of the combined blood

3:23

pressure is 11.072.

3:27

Let's try many different weights for our

3:29

linear combination to see which

3:31

combination that results in the maximum

3:35

variance of the combined variable. We

3:37

know that the weights 0.8 and 0.6 result

3:41

in the variance of the combined variable

3:44

of 11.07.

3:47

If you instead put more weight on the

3:49

diastolic blood pressure, we see that

3:52

the combined variable has a higher

3:54

variance.

3:56

But if you put too much weight on the

3:58

diastolic blood pressure, the variance

4:01

is reduced.

4:03

To maximize the variance, we should

4:05

therefore use these two weights. But how

4:08

do we find such weights?

4:11

Well, there are basically two methods to

4:13

find the weights that maximize the

4:16

variance. We will start to have a look

4:18

at the IGEN decomposition method and

4:21

then see how the SVD method works. To

4:25

compute PCA by using the IGEN

4:28

decomposition method, we can perform the

4:30

following steps where we first center or

4:34

scale our data. Then we calculate the

4:37

covariance matrix on our centered or

4:40

scaled data. Next we compute the Igen

4:43

values and the IGEN vectors of the

4:45

covariance matrix. Finally we order the

4:49

IGEN vectors and calculate the principal

4:52

components.

4:54

Usually one starts to center or scale

4:57

the data. In this case we will only

5:00

center the data. We therefore subtract

5:03

the mean systolic blood pressure from

5:06

the individual observations.

5:08

These are the centered values which tell

5:11

how far away the original values are

5:14

from the mean. We then do the same

5:17

calculations for the diastolic blood

5:20

pressure which has a mean of 82.

5:23

We can summarize the centered data in

5:26

the following table. If you like to

5:28

scale or standardize the data, you

5:30

simply divide the center data by the

5:33

standard deviation of each variable.

5:36

When we center the data, it means that

5:39

we center the data points around the

5:41

origin.

5:43

Centering the data points around the

5:45

origin will help us later when we will

5:48

rotate the data. After we have centered

5:51

the data, we have the following values

5:54

which can be plotted like this.

5:59

Next, we calculate the coariance matrix

6:01

based on the center data. If you instead

6:05

scale or standardize the data, the

6:08

covariance matrix will be equal to the

6:10

correlation matrix.

6:12

Note that the main diagonal of the

6:14

coariance matrix includes the variance

6:17

of each variable.

6:20

The sample variance of the systolic

6:22

blood pressure is calculated like this.

6:26

When we calculate the variance of the

6:28

centered data, the calculations become a

6:31

bit simpler since the mean of the

6:33

centered data is always equal to zero.

6:37

To calculate the sample variance of the

6:39

centered data, we therefore simply sum

6:42

the squared centered values and divide

6:45

by the sample size minus one.

6:49

Then we calculate the variance of the

6:51

diastolic blood pressure.

6:53

Finally, we calculate the co-variance,

6:56

which is a measure of how much the two

6:59

variables spread together. The sample

7:02

co-variance is calculated by multiplying

7:05

the centered values of the two

7:08

variables.

7:09

For example, we multiply the centered

7:11

values for person number one and then

7:14

add that to the product of the centered

7:16

values for person number two and so

7:19

forth. Finally, we divide the sum of the

7:22

products by the sample size minus one.

7:26

We see that the spread in the diastolic

7:28

blood pressure is a bit higher than the

7:31

spread in the systolic blood pressure.

7:34

The co-variance is somewhere between

7:36

these two values.

7:39

Next, we calculate the igen values of

7:42

the covariance matrix.

7:46

We substitute a by the co-variance

7:48

matrix

7:51

and this term by lambda times the

7:53

identity matrix which has the same

7:56

number of rows and columns as the

7:59

covariance matrix.

8:01

Subtracting these two matrices results

8:04

in the following matrix.

8:07

Next we calculate the determinant of

8:09

this matrix

8:12

which is the product of this diagonal

8:15

minus

8:17

the product of this diagonal.

8:20

After some simplifications

8:22

we have the following quadratic

8:25

equation. Quadratic equations like this

8:27

can be solved in different ways which

8:30

will not be discussed here. Anyway, if

8:33

you plot how the left hand side changes

8:35

as a function of different values of

8:38

lambda, we see that the left hand side

8:40

is equal to zero when lambda is equal to

8:44

either

8:46

about 0.32

8:48

or 12.08.

8:51

These two values represent our values of

8:55

the covariance matrix. Next, we

8:58

calculate the corresponding vectors to

9:00

these two values. We will start by

9:04

calculating the igen vector of the

9:06

covariance matrix with the corresponding

9:09

value 12.08.

9:12

To calculate the igen vectors of the

9:15

coariance matrix, we use the following

9:17

equation that we have discussed in the

9:19

previous video about vectors.

9:22

We plug in the coariance matrix and one

9:26

of the two values. If you multiply the

9:29

coarience matrix by this column vector

9:33

and multiply the igen value by the same

9:36

vector, we'll get the following system

9:39

of equations.

9:42

We move these two terms to the right

9:44

hand side.

9:46

After some simplifications, we have the

9:49

following system of equations.

9:52

Solving for y in the two equations

9:55

results in that y is equal to 1.37x.

10:00

For example, if you set x equal to 1,

10:05

y is equal to 1.37.

10:08

This vector is therefore an en vector of

10:11

the covariance matrix. We can illustrate

10:14

this vector in the plot like this by

10:17

drawing an arrow from the origin to the

10:20

coordinates

10:22

one and 1.37.

10:25

We will now normalize this vector to

10:28

unit length which means that it should

10:30

have a length of one.

10:33

The length of this vector is about 1.7.

10:38

And if we divide these values by 1.7,

10:42

we'll get the igon vector with unit

10:44

length.

10:46

This vector represents one out of two

10:49

vectors of the coariance matrix.

10:53

To find the second vector, we do the

10:56

same calculations as before based on the

10:59

second value. After some calculations,

11:03

this vector represents our second vector

11:06

with unit length. Since the coariance

11:09

matrix is symmetric, its igen vectors

11:12

corresponding to distinct values are

11:15

orthogonal. Therefore, their angle is

11:19

90°.

11:20

Next, we order the igen vectors based on

11:23

their corresponding igen values where

11:26

the igen vector associated with the

11:29

largest igen value becomes our first

11:32

vector. Since this igen vector is

11:35

associated with the largest value, it

11:39

will represent our first vector. We

11:42

therefore rename this vector so that it

11:44

is called v1 instead of v2. Let's put

11:49

these two vectors together into a matrix

11:52

that we called V. The first column

11:55

represents the first vector whereas the

11:58

second column represents our second

12:02

vector.

12:04

We now use this matrix to transform our

12:07

center data so that the two variables

12:09

are completely uncorrelated.

12:12

Let's define our centered data as matrix

12:16

X.

12:18

Next, we multiply our data matrix X by

12:22

matrix V. Then we get a new matrix with

12:26

a transformed data. This transformed

12:29

data is called principal component

12:31

scores or just scores. When we go from

12:35

our centered data to the transformed

12:37

data, this can be seen as we rotate the

12:40

data clockwise until the two en vectors

12:44

point in the same direction as the x and

12:47

y axis of the plot. The rotated data now

12:51

looks like this. Note that the labels of

12:55

the axis have now been changed to

12:58

principal components one and two.

13:01

Let's call the two columns of the

13:03

transformed data PC1 and PC2.

13:07

If we were to plot this data, we would

13:10

get the following plot.

13:13

Let's compare the centered data with the

13:16

transformed data.

13:18

The variance of the systolic blood

13:20

pressure is 4.4

13:23

whereas the variance of the diastolic

13:26

blood pressure is eight.

13:29

This is the covariance matrix of the

13:31

data. We see that the co-variance is 5.6

13:36

which tells us that there is a positive

13:38

correlation between the two variables.

13:41

When we transform the data, the first

13:44

variable called PC1 has a variance of

13:47

12.08

13:49

whereas PC2 has only a variance of 0.32.

13:54

This means that almost all variance is

13:57

kept in the first principal component.

14:01

If you divide the variance of the first

14:03

principal component

14:05

by the total variance, we see that the

14:08

first principal component captures 97.4%

14:13

of the total variance.

14:15

Note that the variances of the principal

14:18

components correspond to the two values

14:21

we calculated earlier. The igen values

14:24

of the covariance matrix therefore

14:26

represent the variances of the principal

14:29

components.

14:30

When we study the covariance matrix of

14:33

our transform data, we see that the

14:36

co-variance between PC1 and PC2 is equal

14:39

to zero. Which means that PC1 and PC2

14:42

are completely uncorrelated.

14:45

Note that the total variance of PC1 and

14:48

PC2 is about 12.4 four which corresponds

14:52

to the total variance of the original

14:55

variables.

14:58

Remember that when we multiplied our

15:00

centered data by the matrix that

15:04

includes the vectors as columns, we got

15:07

the transformed data. This is the same

15:11

as using the following equation to

15:13

calculate the principal components that

15:16

we saw at the beginning of this video

15:19

where the weights for the first

15:20

principal component comes from the first

15:23

vector whereas the weights for the

15:26

second principal component come from the

15:28

second vector. For example, if we would

15:32

calculate the corresponding score for

15:34

person number six, we would multiply the

15:37

centered blood pressure values of person

15:40

number six by the corresponding weights.

15:44

By adding these products, we would get a

15:46

principal component score of five.

15:51

Note that the general aim of using PCA

15:54

is to reduce the dimensionality of the

15:56

data. In other words, we like to reduce

15:59

the number of variables we have. But so

16:01

far, we have not reduced the number of

16:03

variables since we have the same number

16:05

of principal components as the number of

16:08

variables we started with. Since the

16:11

first principal component captures

16:13

almost all variance which can be

16:15

interpreted as it stores almost all

16:18

information about the two variables. We

16:21

can simply delete the second principal

16:23

component because it includes almost no

16:26

information.

16:27

As we have seen previously by using the

16:30

following equation we can combine the

16:32

two variables into just one variable in

16:34

a way that maximizes the variance of the

16:37

linear combination

16:39

we have. so far used the iggon de

16:41

composition of the coarience matrix to

16:44

compute the PCA. We'll now have a look

16:46

at how to compute the principal

16:48

component scores by instead using the

16:51

singular value de composition.

16:55

The SVD method does not involve the

16:58

computation of the covariance matrix.

17:01

Instead, SVD solves the following

17:04

equation which involves numerical

17:06

methods so that the product of these

17:09

three matrices is as similar as possible

17:12

to the data matrix X which is our

17:16

centered values.

17:18

Solving this equation is generally more

17:21

accurate than the igen decomposition

17:23

method because it works directly on the

17:26

data matrix X without computing the

17:29

covariance matrix that may cause

17:32

rounding errors. I will not go into all

17:35

the details of SVD. Instead, I will try

17:39

to explain what this equation means. We

17:42

will start by calculating the so-called

17:45

grand matrix where we multiply the

17:48

transpose of the center data by the

17:52

center data which results in the

17:55

following matrix.

17:58

Note that by multiplying for example the

18:00

second column of the center data by the

18:04

first row of its transpose

18:07

it's exactly the same calculation that

18:10

we do in the numerator when we calculate

18:12

the covariance between the two

18:14

variables.

18:16

This means that we will get the

18:17

corresponding coariance matrix if we

18:20

divide the gram matrix by n minus one.

18:25

We'll now compute the igen values of the

18:28

grand matrix

18:33

which results in the following values.

18:37

If we would divide these values by n

18:40

minus one, we'll get the exact same

18:43

values as when we compute these based on

18:46

the covariance matrix. If you now

18:48

compute the igen vector based on the

18:51

first value,

18:56

we will end up with the same vector with

19:00

the unit length as we computed based on

19:02

the covariance matrix.

19:05

So when we compute the igen vectors and

19:08

the igen values of the gram matrix

19:10

instead of the covariance matrix we get

19:13

the same vectors but different values.

19:18

We can now create an igen value matrix

19:21

of these two values

19:24

like this.

19:27

Note that when the igen vectors are

19:30

multiplied by the igen value matrix and

19:34

then multiplied by the transpose of the

19:36

igen vectors,

19:38

we get the gram matrix.

19:42

If you take the square root of these

19:45

values,

19:46

we will get the so-cal singular values.

19:50

The main idea with the SVD method in PCA

19:53

is to factoriize a matrix

19:57

into three parts

19:59

where the rows in their transpose of V,

20:02

the vectors tell us how to rotate the

20:06

coordinate axis to point in the

20:08

directions of maximum variance.

20:12

Matrix D tell us how the data stretch

20:15

along each of these new axis. Whereas

20:18

matrix U tell us where each observation

20:21

lies after the rotation before

20:25

multiplying by D which may either

20:27

stretch or compress the data points

20:30

along the new axis. But how do we

20:33

calculate matrix U?

20:36

Well, the SVD method computes this

20:38

numerically. But let's try to do this by

20:41

hand. If you multiply both sides by a

20:44

matrix V,

20:46

this product will result in the identity

20:49

matrix because the igen vector matrix is

20:52

orthonormal. Next, we multiply both

20:56

sides by the inverse of the matrix D

20:59

where this product also results in the

21:02

identity matrix.

21:05

So we now know how to calculate matrix

21:08

U.

21:09

This is the inverse of matrix D. So if

21:13

you multiply matrix X our center data by

21:19

the igen vector matrix V and then by the

21:22

inverse of matrix D we will get matrix

21:26

U.

21:30

So this is our certain data. The rows of

21:33

the transposed

21:35

vector matrix tell us how the data will

21:38

be rotated.

21:40

Whereas matrix U gives the coordinates

21:43

of the data points in their rotated

21:46

principal component space before they

21:49

are stretched by the singular values in

21:51

D. Note that when we multiply these two

21:55

matrices,

21:57

the xcoordinates of the data points in

22:00

the new space are multiplied by 7.77

22:06

whereas the ycoordinates of the points

22:08

are multiplied by 1.26.

22:12

So that this data is stretched out like

22:16

this.

22:18

So if you multiply matrix U

22:21

by matrix D,

22:24

we get the final principal component

22:26

scores. And if you multiply these by the

22:30

transpose of the igen vector matrix,

22:34

we will recreate the original center

22:36

data points, which is exactly what this

22:40

equation tells us.

22:43

To compute the final principal component

22:45

scores, we may either multiply matrix U

22:49

by matrix D

22:52

or simply multiply matrix X by the

22:55

vector matrix V.

22:59

So this is how you would compute the

23:01

decomposition in R.

23:05

We first create a matrix of our two

23:07

variables.

23:09

Then we center the data with the

23:12

function scale.

23:14

If you instead like to scale your data,

23:16

you should set this argument to true.

23:19

Then we compute the covariance matrix

23:22

and the igen vectors of the covariance

23:25

matrix. Finally, we multiply the center

23:28

data by the igen vector matrix to get

23:31

the principal component scores.

23:34

We can also use the SVD method by using

23:37

the SVD function that will give us the

23:41

D, U, and B matrices.

23:46

If you multiply the centered data by the

23:49

EN vector matrix, we'll get the

23:51

principal component scores which are

23:54

identical to these scores.

23:57

Depending on your data, you might see

23:59

some small differences between the

24:01

scores of the two methods.

24:04

Note that the prome function in R uses

24:07

the SVD method and not the en

24:10

decomposition.

24:12

The corresponding code for computing the

24:14

IGEN decomposition in Python can be

24:17

written like this.

24:20

Note that we here need to sort the igen

24:22

vectors so that the first vector is the

24:26

one associated with the largest value.

24:30

Also note that Python has negative

24:32

values for the first vector

24:36

which explains why the scores have

24:38

opposite signs compared to the example

24:41

in this video. To perform PCA with the

24:44

SVD method, we can for example use the

24:47

PCA class from the skyit learn.

24:51

Before we end this lecture, let's try to

24:54

understand why the first vector points

24:57

in the direction that has maximal

24:59

variance.

25:01

From our previous calculations, we know

25:04

that the first principal component has a

25:06

variance of 12.08.

25:09

We got these principal component scores

25:12

by multiplying the first vector by the

25:15

centered values.

25:18

So we can define it like this where the

25:20

variance of matrix X times the first

25:24

vector is 12.08.

25:28

From the previous calculations, we know

25:30

that the igen value that we computed

25:33

based on the covariance matrix is also

25:35

12.08.

25:38

The variance of the first principal

25:40

component is actually equal to the first

25:43

value. This is the definition of an igon

25:47

vector

25:48

where matrix A in PCA is our covariance

25:52

matrix which is usually denoted with

25:55

this symbol. If you solve this equation

25:58

for the value, we will obtain this

26:01

equation given that vector V has unit

26:04

length. If you plug in the right hand

26:07

side here,

26:09

we will get this equation.

26:12

We want to find a vector V with unit

26:15

length that maximizes the variance. If

26:18

the length of the vector is equal to

26:21

one, the transpose of the vector times

26:24

the vector itself should be equal to one

26:28

that I showed at the beginning of this

26:30

video. When we have a constraint like

26:32

this, one method is to use the lrange

26:35

multiplier where this term enforces the

26:39

values of the vector to fulfill these

26:42

conditions.

26:44

If we take the derivative of this

26:46

equation with respect to v1, we will

26:50

obtain the following equation. We now

26:53

set the left hand side to zero

26:56

because the derivative is equal to zero

26:59

when we reach the maximum of this

27:01

function.

27:03

We can now rearrange this equation and

27:06

cancel the twos.

27:09

So that we end up with this equation

27:11

which is identical to the equation for

27:15

the definition of an vector. So this

27:18

explains why the first igen vector

27:21

points in the direction with maximal

27:24

variance. This was the end of this video

27:26

about the fundamental math behind PCA.

27:30

Thanks for watching.

Interactive Summary

This video explains the mathematical details behind Principal Component Analysis (PCA). It covers the main idea of PCA, which is to reduce data dimensionality by combining variables while maximizing variance. The explanation is step-by-step, detailing the Eigen decomposition method, including data centering, covariance matrix calculation, Eigen value and Eigen vector computation, and finally, the calculation of principal components. The video also discusses the relationship between PCA and Singular Value Decomposition (SVD), and provides examples of how to compute PCA using R and Python code. Finally, it clarifies why Eigen vectors are crucial in PCA by showing they point in the direction of maximal variance.

Suggested questions

5 ready-made prompts