HomeVideos

UMAP explained simply

Now Playing

UMAP explained simply

Transcript

438 segments

0:00

We will here have a look at UMAP which

0:03

is a similar method to the TE method

0:05

that I covered in a previous video. UAP

0:09

stands for uniform manifold

0:11

approximation and projection which is a

0:14

dimension reduction method that captures

0:17

both local and global structures of

0:20

highdimensional data. The UMAP method

0:23

was published in 2018.

0:26

I will first show some applications

0:28

where UMAP can be used and compare it

0:31

with principal component analysis. Then

0:34

we will have a look at the simple

0:36

example that explains the basic math

0:39

behind UMAP. We'll first see how UMAP

0:42

and PCA perform on a classical amnes

0:45

data set which contains thousands of

0:48

images of handwritten digits. Every row

0:52

in this data set contains pixel values

0:54

between 0ero and 255 for each image. The

0:58

first column shows which digit the image

1:01

contains. For example, the first row

1:05

represents an image of a handwritten

1:07

five whereas the second row represents

1:10

an image of a handwritten zero and so

1:12

forth. Each image is a square that

1:16

contains 784 pixels, which means that

1:20

each image or row has 784 columns in

1:24

addition to the first column. Suppose

1:27

that we like to generate an image of the

1:29

data that is located on row number six.

1:33

We would then put the elements in column

1:35

numbers 2 to 29 here. Then the next 28

1:39

elements here, and so forth.

1:44

until we have a square matrix with 28

1:47

rows and columns. This image is supposed

1:51

to represent a handwritten two. We'll

1:54

now use UMAP and PCA to compress these

1:57

784 columns into just two columns and

2:01

plot these columns in a two-dimensional

2:04

plot.

2:06

We will then label the points in the

2:08

plot so that we can see which point that

2:11

corresponds to a certain digit. Similar

2:14

to PCA, UMAP is an unsupervised method

2:18

because it does not use these labels in

2:20

its computation to separate groups or

2:23

clusters. This is an example of a UMAP

2:27

plot of 10,000 images in the MNEST data

2:30

set. Each point in this image represents

2:33

one image. For example, this point

2:36

represents the following handwritten

2:38

eight

2:40

whereas this point represents a four. We

2:43

can see that the UMAP separates the

2:46

images of the 10 digits quite well. In

2:50

comparison, this is the corresponding

2:52

PCA plot of the MNEST data set which

2:55

fails to separate the digits. I will

2:57

explain why in a minute. Um plots are

3:01

commonly used in single cell RNA

3:04

sequencing to visualize highdimensional

3:06

gene expression data in two dimensions

3:09

to identify clusters of cells with

3:12

similar expressed genes. Each point in

3:15

this um plot represents a single cell.

3:19

Points in the same cluster can be seen

3:21

as cells that have a similar gene

3:24

expression profile. Such a cluster of

3:27

cells therefore usually represent cells

3:29

of the same cell type. For example, the

3:33

clusters may represent different cells

3:35

in our immune system. If you were to use

3:38

PCA on the same data set, the clusters

3:41

would not separate well. So why is this

3:44

the case? Well, one reason is that a PCA

3:49

plot usually only shows the first two

3:51

principal components because these

3:54

explain most of the variance. By

3:57

illustrating only the first two

3:59

components, which in this case only

4:01

explain about 5% of the total variance,

4:05

we lose a lot of information that is

4:07

stored in the other components.

4:10

For example, we can see that cluster two

4:12

and six are not well separated based on

4:15

the first two principal components. If

4:18

you instead plot component one versus

4:21

component five, clusters two and six are

4:24

now well separated. The um method does

4:28

not hold important information in

4:30

several components like this. Instead,

4:33

it includes all information in just two

4:36

dimensions, which explains why the UMAP

4:38

plot better identifies clusters that

4:42

exist in the highdimensional data.

4:44

However, the UMAP method is

4:46

computationally expensive, which is why

4:49

it is sometimes combined with PCA. One

4:52

can therefore compress the variables by

4:54

first using PCA and selecting for

4:57

example the first 30 to 50 components

5:01

that store most information and then

5:03

computing the UMAP based on these

5:06

components instead of all variables.

5:09

Another reason why PCA might fail to

5:12

separate existing clusters is that it is

5:15

a linear method. For example, if we were

5:18

to project this nonlinear

5:19

two-dimensional data set into one

5:23

dimension, the clusters will not

5:25

separate well with PCA.

5:28

Whereas the UMAP would do a good job in

5:31

separating the existing clusters because

5:33

it preserves local nonlinear

5:36

relationships.

5:38

Although these points are fairly close

5:40

in the global geometry, UMAP tries to

5:43

preserve also the local structure.

5:47

Close points are assigned a high

5:49

connectivity weight indicating a strong

5:52

likelihood of being neighbors which

5:54

means that also these points have a high

5:56

connectivity weight and so forth.

6:04

A UMAP is controlled by these two

6:07

hyperparameters

6:08

that are set by the user. The parameter

6:11

ns can be seen as the number of

6:14

neighbors each data point selects for

6:17

its local neighborhood in the

6:18

highdimensional space. The effect of

6:21

this parameter is shown on the amnest

6:24

data set. A small value focuses more on

6:28

local structures which might fragment

6:30

some clusters.

6:33

A larger value focuses the plot more on

6:35

the global structure where clusters may

6:38

overlap a lot. The default value of 15

6:42

is usually a good value to use that

6:45

captures both local and global

6:47

structures.

6:49

The parameter mean distance controls the

6:52

minimum distance between points in the

6:54

lowdimensional space that I will explain

6:57

later on. A low value results in tight

7:01

clusters

7:03

whereas a two high value might cause

7:05

overlapping clusters.

7:07

Setting these two parameters to 15 and

7:10

0.2 gives a nice um plot based on the

7:14

amnes data set.

7:16

The parameter spread was not described

7:18

in the original UMAP paper but has been

7:21

included in several implementations to

7:24

control the distance between the

7:25

clusters. Increasing the value of this

7:28

parameter increases the distance between

7:31

the clusters in the MNEST data set. We

7:34

will now discuss the basic math behind

7:37

UMAP.

7:38

The math that I will show comes mainly

7:41

from appendix C in the paper where they

7:44

explain U mapap in a similar way to the

7:46

tney method which is a lot simpler to

7:49

understand.

7:50

We will use a similar three-dimensional

7:53

data set as in the video about Tney and

7:56

use the same random initial coordinates

7:59

in two dimensions.

8:01

UMAP may either use initial random data

8:04

points like this or points initialized

8:07

based on the method of spectral

8:09

embedding.

8:11

We'll here see how UMAP reduces the

8:14

dimensions of this highdimensional data

8:17

into lowdimensional data. In other

8:20

words, how it reduces three-dimensional

8:22

data into two-dimensional data. In

8:25

comparison to Tesney, which uses a

8:28

Gaussian like distribution that I

8:30

explained in the Tesney video, um uses

8:33

an exponential kernel based on some

8:36

distance metric. I will here use the

8:39

Cleian distance. Um uses this equation

8:42

to compute a symmetric matrix whereas

8:46

Tney uses this equation. UMAP starts by

8:50

calculating for example the client

8:52

distances between all data points of the

8:55

original data in the highdimensional

8:57

space.

8:59

For example, if we were to calculate the

9:01

client distance between data points one

9:03

and five. We will plug in the X, Y and Z

9:09

coordinates in this equation

9:12

and do the math. We now plug in the

9:15

rounded ecclesian distance in a distance

9:18

matrix.

9:20

If you calculate the distances between

9:22

all data points in our data set, we

9:25

would get the following values. For

9:28

example, we can see that the ecclesian

9:30

distance between data points two and

9:32

four is 5.19.

9:36

Let's put the cleon distances based on

9:39

the highdimensional data set here.

9:42

Next, we determine the value of K, also

9:45

called N neighbors. The default value is

9:48

usually 15, but since we have a small

9:51

data set, we set this value to three.

9:55

This means that when we consider, for

9:58

example, the distances between data

10:00

point one and all the other four data

10:02

points, we only focus on the three

10:06

closest points to data point number one.

10:09

For example, these are the three closest

10:12

neighbors to data point number two. And

10:15

these are the three closest neighbors to

10:17

data point number three. And so forth.

10:22

For data point one, we plug in its

10:24

shortest distance here, which is its

10:28

distance to data point number three. And

10:30

then we plug in the distance between

10:32

data point one and two. Here we set

10:36

sigma to one for now and do the math.

10:41

Then we plug in the distance between

10:43

data points one and three. We skip this

10:47

distance because data point 4 is not one

10:50

of the three closest points to data

10:52

point one. Finally, we plug in the

10:55

distance between data points one and

10:57

five.

10:59

If you do the math in the numerator,

11:02

we will get these values. This is an

11:05

exponential distribution where sigma

11:08

determines the area under the curve.

11:11

Since sigma is set to one, the area

11:14

under the curve is now one. If you plug

11:17

in the calculated values in the

11:19

numerators in the exponential

11:21

distribution,

11:25

we can see that the heights of the curve

11:29

correspond to these values.

11:33

Sigma is now optimized

11:36

so that the sum of the heights

11:40

is equal to the log base 2 of the

11:42

hyperparameter ns.

11:45

Remember that we previously set the

11:47

hyperparameter n neighbors to three.

11:51

Sigma is now adjusted

11:54

so that the sum of the heights

11:57

is equal to 1.58.

12:01

If you set sigma to 0.648,

12:04

the sum will be equal to about 1.58.

12:09

Note that the area below the curve is

12:11

now 0.648.

12:13

Since the area below the car must be

12:16

equal to one to be defined as a

12:18

probability distribution, the

12:20

unnormalized exponential function used

12:23

in UMAP is instead referred to as an

12:26

exponential kernel. This is the adjusted

12:30

VIGs for different optimized values of

12:33

sigma for each row. Note that every row

12:37

now sum ups to about 1.58.

12:41

These dots represent the distances that

12:45

we did not consider because they were

12:47

not among the three closest neighbors.

12:50

This is therefore a sparse matrix. A

12:53

sparse matrix is more computationally

12:55

efficient to work with because the

12:58

memory use is n * k instead of n * n.

13:03

These elements, including data points

13:06

that are not the three closest

13:07

neighbors, are set to zero.

13:11

Next, we symmetriize this matrix by this

13:14

equation where the symmetric distance

13:17

value between for example data points 2

13:20

and 4 is equal to 0.2993.

13:24

If you do the same calculations for the

13:27

other values, we will get these values.

13:31

Note that the upper triangle

13:33

holds the same values as in the lower

13:36

triangle, which means that we have a

13:38

symmetric matrix.

13:41

These are the normalized distances for

13:43

the highdimensional data that will be

13:45

compared with the distances of the

13:47

lowdimensional data. Next, we compute

13:50

the distances for the lowdimensional

13:53

data with this equation. This equation

13:56

has two parameters that are optimized

13:59

based on the hyperparameter mean

14:01

distance that is set by the user. I will

14:04

explain how these values are calculated

14:07

at the end of this video. If you set the

14:10

hyperparameter mean distance to 0.5.

14:14

A will be equal to 0.576

14:18

and B will be equal to 1.363.

14:22

If a and b are set to one, we'll get the

14:26

same expression as in the numerator of

14:28

the corresponding equation for the tnne

14:30

method. We start by calculating the

14:33

cleian distance between the points in

14:36

the low dimensional space. These are our

14:39

initial random coordinates in two

14:42

dimensions.

14:43

To calculate the cleon distance between,

14:46

for example, data point one and three in

14:49

the low dimensional space, we plug in

14:51

the x and y coordinates and do the math

14:56

and plug in the resulting value in the

14:59

distance matrix. If you calculate the

15:01

distances between all data points, we

15:04

will get this matrix.

15:07

We can now plug in the client distances

15:09

in this equation and compute the first Q

15:12

value like this which we plug in into

15:16

the Q matrix.

15:19

If we compute the other values, we'll

15:21

get the following Q matrix in which the

15:24

main diagonal is set to zero. These

15:27

values will be adjusted by comparing

15:29

them

15:31

with the values in our previous

15:33

symmetric V matrix based on

15:35

highdimensional data by using the

15:38

following cross entropy cost function

15:40

that we minimize with the gradient

15:42

descent method.

15:44

We here only need to include the values

15:47

in the upper triangles.

15:50

If we compute the value of the cost

15:52

function based on the upper triangles,

15:55

we get a value of about 7.7.

15:58

We'll now change the coordinates of the

16:00

points in the lowdimensional data

16:04

and update the values in the Q matrix

16:07

to reduce the value of the cost

16:09

function. UAP uses stochastic gradient

16:12

descent to do this. But to simplify the

16:15

calculations, I've here just used the

16:18

gradient descent method.

16:20

To compute the gradient descent, we need

16:22

to find the derivative of this function,

16:26

which looks like this. We know the

16:29

values of A and B and the values in the

16:32

V and Q matrices and the coordinates of

16:36

the points in the low dimensional data

16:39

and the corresponding ukian distances.

16:42

We'll here see how to calculate the

16:44

derivative based on the first data

16:46

point.

16:49

We plug in these values

16:52

here

16:54

and then these values

16:57

here.

17:00

Then we plug in the clear distances

17:02

between data point one and all the other

17:05

data points

17:07

here.

17:08

Next, we plug in the x and y coordinates

17:11

of data point one and the coordinates of

17:14

the other data points

17:17

and do the math.

17:20

Note that this calculation is based on

17:22

more exact numbers of the values in the

17:25

Q matrix and the distance matrix.

17:29

If you do the calculations for all rows,

17:33

we'll get these values

17:36

which we plug in into the gradient

17:38

descent formula

17:40

where alpha is the learning rate that we

17:43

here set to 0.01.

17:46

These are our current coordinates in the

17:49

low dimensional space.

17:52

So if we multiply these values by the

17:54

learning rate and subtract the resulting

17:57

values from these values,

18:01

we'll get these updated values.

18:05

And the coordinates of the data points

18:07

will therefore be updated like this.

18:12

Note that data points four and five have

18:15

now moved closer to each other so that

18:18

the location of the points in the low

18:20

dimensional space are similar to the

18:23

ones we see in the high dimensional

18:25

space.

18:26

With these updated values, we have

18:29

reduced the cost function from 7.7 to

18:32

6.9.

18:34

If we repeat the same calculations 200

18:37

times, we'll get the following plot

18:41

which closely resembles what we see in

18:43

the highdimensional space.

18:46

After 200 iterations, these are the

18:49

values in the Q matrix which closely

18:52

match those in this matrix.

18:56

Finally, I will explain how the

18:57

parameter mean distance is related to

19:01

the parameters A and B. The red curve is

19:05

the target curve that takes the exact

19:07

shape based on the parameter mean

19:10

distance. The mean distance controls the

19:13

length of the flat part of this curve.

19:16

If the distance is shorter than or equal

19:19

to the mean distance parameter, the

19:21

value of the curve is equal to one. And

19:25

if the distance is longer, the curve

19:27

follows an exponential decay function.

19:31

Now we use nonlinear regression to fit

19:34

this function to the points on the

19:36

target curve for a range of different

19:39

distances

19:41

so that we get the following fitted

19:43

curve where a and b are estimated to the

19:47

following values. If you will plug in

19:50

these values here and change the value

19:52

of d between zero and six we would get

19:56

this blue curve. So when we plug in the

20:00

distances in this function, it can be

20:03

seen as we compute the heights of this

20:05

fitted curve based on the distances

20:08

between for example data point one and

20:10

the other data points.

20:12

If you increase the mean distance from

20:15

for example 0.5 to one, the flat part of

20:18

the fitted curve increases and the

20:21

distances between the data points

20:24

increase so that we get about the same

20:27

optimal Q values.

20:29

The points are therefore forced to keep

20:31

more distance between them when we

20:34

increase the parameter mean distance.

20:37

This explains why the data points are

20:39

more spread out within the clusters when

20:41

we increase the parameter mean distance.

20:44

This was the end of this video about

20:46

UMAP. Thanks for watching.

Interactive Summary

This video explains UMAP (Uniform Manifold Approximation and Projection), a dimension reduction technique that captures both local and global structures of high-dimensional data. It compares UMAP with Principal Component Analysis (PCA), demonstrating UMAP's superior performance in separating clusters in datasets like handwritten digits (MNIST) and single-cell RNA sequencing data. The explanation delves into the mathematical underpinnings of UMAP, including how it constructs a high-dimensional graph using an exponential kernel and then optimizes a low-dimensional representation using cross-entropy. Key hyperparameters like 'n neighbors' and 'min distance' are discussed, along with their impact on the resulting visualizations. The video also touches upon how UMAP can be combined with PCA for computational efficiency and explains the role of the 'spread' parameter.

Suggested questions

3 ready-made prompts