HomeVideos

t-SNE explained: Visualizing High-Dimensional Data

Now Playing

t-SNE explained: Visualizing High-Dimensional Data

Transcript

484 segments

0:00

In this video, we will have a look at

0:02

the TNE method which can be used for

0:04

visualizing highdimensional data in for

0:07

example two dimensions. TNE or TSN

0:11

stands for T distributed stochastic

0:14

neighbor embedding which is a method

0:16

that captures local structures of

0:19

highdimensional data while also

0:21

revealing the global structure such as

0:24

the presence of clusters. The Tney

0:27

method was published in 2008 and the

0:30

equations that I will show at the end of

0:32

this video are given according to this

0:35

paper and from Wikipedia. I will first

0:38

show some applications where Tesnney can

0:40

be used and compare it with principle

0:43

component analysis. Then we will have a

0:46

look at the simple example that explains

0:49

the basic idea behind Tesney before we

0:52

go into the mathematical details.

0:56

We will first see how TNE and PCA

0:59

perform on the classical Amnes data set

1:02

which contains thousands of images of

1:04

handwritten digits.

1:07

Every row in this data set contains

1:09

pixel values for each image. The first

1:13

column shows which digit the image

1:15

contains. For example, the first row

1:18

represents an image of a handwritten

1:21

five, whereas the second row represents

1:24

an image of a handwritten zero and so

1:27

forth. Each image is a square that

1:30

contains 784 pixels, which means that

1:34

each image or row has 784 columns in

1:39

addition to the first column. Suppose

1:42

that we like to generate an image of the

1:44

data that is located on row number six.

1:48

We would then put the elements in column

1:50

numbers 2 to 29 here. Then the next 28

1:54

elements here and so forth

1:59

until we have a square matrix with 28

2:02

rows and columns. This image is supposed

2:05

to represent a handwritten two.

2:08

We will now use TNE and PCA to compress

2:12

these 784 columns

2:15

into just two columns and plot these

2:18

columns in two-dimensional plot and

2:20

label the points so that we can see

2:23

which point that corresponds to a

2:25

certain digit just like PCA tney is an

2:29

unsupervised method because it does not

2:31

use these labels in its computation to

2:34

identify the groups. This is an example

2:38

of a teas plot of 10,000 images in the

2:41

amnes data set. Each point in this plot

2:45

represents one image. For example, this

2:48

point represents the following

2:50

handwritten eight whereas this point

2:53

represents a four.

2:56

We can see that the tin plot separates

2:58

the images of the 10 digits quite well.

3:02

In comparison, this is the corresponding

3:04

PCA plot of the AMNES data set which

3:08

fails to separate the digits. I will

3:11

explain why in a minute.

3:14

Tnip plots are commonly used in single

3:16

cell RNA sequencing to visualize

3:19

highdimensional gene expression data in

3:21

two dimensions to identify clusters of

3:24

cells with similar expressed genes.

3:28

Each point in this tney plot represents

3:31

a single cell. Points in the same

3:34

cluster can be seen as cells that have a

3:37

similar gene expression profile. Such a

3:40

cluster of cells therefore usually

3:42

represent cells of the same cell type.

3:46

For example, the clusters may represent

3:49

different cells in our immune system.

3:53

If you were to use PCA on the same data

3:55

set, the clusters would not separate

3:58

well. So why is this the case?

4:02

Well, one reason is that a PCA plot

4:05

usually only shows the first two

4:07

principal components because these

4:09

explain most of the variance.

4:12

By illustrating only the first two

4:14

components, which in this case only

4:17

explain about 5% of the total variance,

4:21

we lose a lot of information that is

4:24

stored in the other components. For

4:26

example, we can see that clusters two

4:28

and six are not well separated based on

4:31

the first two components.

4:34

If you instead plot component one versus

4:36

component five, clusters two and six are

4:40

now well separated. The TD method does

4:43

not hold important information in

4:45

several components like this. Instead,

4:49

it includes all information into just

4:52

two dimensions, which explains why the

4:55

TED plot barely shows the clusters that

4:58

exist in the highdimensional data.

5:01

However, the TS method is

5:03

computationally expensive, which is why

5:06

it is sometimes combined with PCA. One

5:09

can therefore compress the variables by

5:11

first using PCA and select for example

5:14

the first 30 to 50 components that store

5:18

most information and then computing the

5:21

Tnney plot based on these components

5:23

instead of all variables.

5:26

Another reason why PCA might fail to

5:29

separate the existing clusters is that

5:31

it is a linear method. For example, if

5:35

you were to project this nonlinear

5:37

two-dimensional data set in one

5:40

dimension, the clusters will not

5:42

separate well with PCA. Whereas the TI

5:46

method would do a good job in separating

5:48

the existing clusters because it

5:51

preserves local nonlinear relationships.

5:55

All of these points are fairly close in

5:58

the global geometry. Disney tries to

6:00

preserve the local neighborhoods. Points

6:03

that are close have a high probability

6:06

of being neighbors, which means that

6:09

these points have also a high

6:11

probability of being neighbors and so

6:14

forth.

6:18

Another example where Tesney is commonly

6:21

used is to visualize word embeddings.

6:24

This is the same data that I showed in

6:27

the video about transformers.

6:29

The GPT model had about 12,000 embedding

6:33

dimensions.

6:34

If we were to compress thousands of word

6:37

embedding dimensions into just two by

6:40

using Tesney, we would get the plot like

6:42

this where every point represents a word

6:45

or a token. For example, this plot shows

6:49

5,000 words. Similar words are close in

6:53

this Tesnney plot because their

6:55

highdimensional embeddings have similar

6:58

features or contextual meanings.

7:02

So when you see a Tesney plot like this,

7:04

you can interpret it like points that

7:07

are close together in the Tesnney plot

7:10

were likely close neighbors in the

7:12

original highdimensional space.

7:15

However, you should not use the plot to

7:17

compare distances between clusters. For

7:20

example, eights and sixes in amnes data

7:24

set are here far apart, although they

7:27

are expected to be close in a

7:28

highdimensional space since such

7:31

handwritten digits should be quite

7:33

similar. Also, do not try to interpret

7:36

the numbers on the axis because these

7:39

values do not mean anything. We will now

7:42

have a look at the very simple example

7:44

to illustrate the basics of how tis

7:47

works before we go into its exact

7:50

mathematical details.

7:53

Suppose that we have the following

7:55

simple three-dimensional data set with

7:57

five data points which can be visualized

8:00

like this in a three-dimensional scatter

8:03

plot. We see that data points four and

8:06

five are close

8:09

whereas data points one two and three

8:11

are close. The idea is to project these

8:15

data points into two dimensions so that

8:18

we can visualize the data in a simple

8:20

two-dimensional scatter plot. In tis the

8:23

highdimensional space refers to the

8:26

original data whereas the lowdimensional

8:28

space is a simplified representation

8:31

that preserves the local similarities

8:34

between the points. The method starts by

8:37

calculating the client distance between

8:40

all data points of the original data.

8:43

For example, if we were to calculate the

8:46

distance between data points one and

8:48

five, we would plug in the x, y, and z

8:52

coordinates in this equation and do the

8:55

math.

8:57

We now plug in the rounded ukidian

8:59

distance in a distance matrix. If we

9:02

calculate the distances between all data

9:05

points in our data set, we would get the

9:07

following values. For example, we can

9:10

see that the ecclesian distance between

9:12

data points 2 and 4 is 5.2.

9:16

Then we might normalize these distances

9:19

by for example dividing all the values

9:21

by the maximum value in this matrix so

9:25

that we get the following values.

9:29

Next, we place five data points at

9:32

random coordinates in a two-dimensional

9:34

space and calculate the client distance

9:38

between the points. For example, to

9:40

calculate the distance between data

9:42

points one and three,

9:45

we plug in the x and y coordinates of

9:48

these two data points and do the math.

9:52

Next, we normalize the distances by

9:54

dividing by the maximum value in this

9:57

matrix.

9:59

so that we get these values. The reason

10:02

why we normalize is that we can now

10:04

better compare the distances.

10:07

Data points in high dimensions have

10:10

generally longer distances to each other

10:13

than in low dimensions, which explains

10:15

why we normalize the distances. The idea

10:18

is now to move around these data points

10:22

so that the normalized distances in two

10:24

dimensions are as close as possible to

10:28

the normalized distances in the high

10:30

dimensional space. For example, we can

10:33

see that the normalized distances

10:34

between data points two and three differ

10:37

quite much in the high and low

10:39

dimensional space. Let's move this data

10:42

point from here

10:45

to here and recalculate the normalized

10:48

distances. We see that the distances now

10:51

become more similar to the normalized

10:53

distances in three dimensions, which

10:56

means that the points in the

10:57

two-dimensional plot now better resemble

11:01

the points in the highdimensional space.

11:04

If we continue to move around the data

11:07

points, we might end up with the

11:09

following plot that looks quite similar

11:11

to the plot in three dimensions. We also

11:15

see that these two matrices now have

11:17

similar values.

11:20

We will now have a look at the math

11:22

behind TES.

11:24

The method starts by calculating the

11:26

Eian distances between the data points

11:29

in the highdimensional space. Just as we

11:32

did before, the distances are here shown

11:35

with two decimal places. Note that I

11:38

also fill the upper triangle of this

11:40

matrix. And you will see why soon. So

11:44

these values are the same because they

11:47

both represent the distance between data

11:49

points one and three.

11:52

Next, we square the distances.

11:55

We'll now calculate the conditional

11:57

probabilities

11:59

where these terms correspond to our

12:01

squared distances.

12:04

This formula can be used to calculate

12:06

the normal distribution.

12:09

So what we have up here is similar to

12:12

this part of the formula for the normal

12:14

distribution.

12:16

So the numerator can be seen as the

12:18

height of a normal or Gaussian

12:20

distribution for a given distance

12:22

between two data points. where sigma

12:25

squar represents the variance which

12:28

controls the width of the normal

12:30

distribution to simplify the calculation

12:34

I will here use a fixed value of sigma

12:37

so that these two terms are equal to one

12:42

in tney the value of sigma is not fixed

12:45

and is related to the so-called

12:47

perplexity parameter that I will explain

12:50

at the end of this video let's apply

12:53

this formula to Calculate the

12:55

conditional probability in row two,

12:58

column 1. We plug in the squared

13:00

ecclesian distance between data points

13:03

one and two. Here what we get in the

13:06

numerator can be seen as the height of

13:08

the Gaussian curve for the squared

13:10

ecclesian distance between data points

13:12

one and two. Then we plug in the values

13:16

of the second row in the denominator but

13:19

not for the distance between a data

13:21

point and itself. This term can be seen

13:24

as the height of the Gaussian curve

13:26

based on the distance between data

13:28

points one and two. And this is the

13:30

height based on the distance between

13:32

data points two and three.

13:35

The heights that correspond to the

13:37

distances from data point two to data

13:40

points four and five will be super

13:43

small.

13:45

If you do the math, we get the value of

13:47

about 0.39

13:51

which we plug in into a matrix with

13:53

conditional probabilities.

13:56

If we compute the whole matrix, we'll

13:58

get these values where each row now sum

14:01

ups to one.

14:03

So this row can be seen as conditional

14:06

probabilities that data point one would

14:09

pick one of the other four data points

14:11

as its neighbor.

14:13

For nearby data points the conditional

14:16

probability is high whereas for data

14:19

points further away the conditional

14:21

probabilities will be super small for

14:24

reasonable values of sigma.

14:27

So the tney method can be seen as it

14:30

uses large parise distances between the

14:34

similar data points and small parise

14:37

distances between similar data points.

14:40

Note that the values for data points one

14:42

and three are different.

14:45

One therefore computes the average of

14:47

these two values and also divides by the

14:51

sample size.

14:53

In this case, we get a value of 0.11

14:57

that we plug in here. Then we compute

14:59

the same for the other elements.

15:02

We now have a symmetric matrix where all

15:06

its values now sum up to one. We are now

15:10

done with all the calculations for the

15:12

highdimensional data. Let's put it here

15:15

as a reference.

15:17

Next, we place five data points in a

15:20

two-dimensional space with random

15:22

coordinates, which explains why tney is

15:26

a stochastic method because the initial

15:28

coordinates will be different every time

15:31

we create a new plot. Then we calculate

15:34

the squared ecclesian distances between

15:36

the five data points in the

15:38

lowdimensional space.

15:41

We can now use the same equation based

15:43

on the Gaussian distribution as we used

15:46

previously where sigma is set to one

15:49

over the square root of two so that this

15:53

equation no longer depends on sigma.

15:56

However, we will then generate an SN

15:59

plot which tends to squeeze the points

16:02

too much which causes the crowding

16:04

problem as explained in the te paper.

16:08

They therefore used a formula based on

16:11

the t distribution with one degree of

16:14

freedom which will generate the tney

16:16

plot. The t in the name tney therefore

16:20

refers to the t distribution.

16:23

The reason for instead using the t

16:25

distribution with one degree of freedom

16:28

is that it is wider compared to the

16:31

gaussian distribution.

16:33

So to compute the value of the Q matrix,

16:36

for example, for the third row, first

16:39

column, we plug in the squared eidian

16:42

distance here

16:44

and all the other values in this matrix

16:47

that are not included in the main

16:48

diagonal in the denominator.

16:51

We then fill in this value in the Q

16:53

matrix.

16:55

If you do the same calculations for the

16:58

other elements in this matrix, it will

17:00

include these values which should sum up

17:03

to one. So the idea is now to move

17:06

around these data points so that the

17:09

values in the Q matrix become as similar

17:12

as possible to the ones in the P matrix.

17:17

This can be done by minimizing this cost

17:20

function.

17:21

If the values in the two matrices are

17:24

identical, the value of this ratio will

17:27

be equal to one. And since the log of

17:31

one is zero, the cost function would

17:34

then be equal to zero, which means that

17:37

the two matrices are identical.

17:41

Let's plug in our values except for the

17:43

values in the main diagonal to compute

17:46

the value of this cost function.

17:52

We see that our cost function results in

17:54

a value of 0.77.

17:57

What is interesting with this cost

17:59

function is that it will generate a

18:01

large penalty for a large pig if the

18:04

corresponding Q is too small.

18:08

This means that the cost function mainly

18:11

focuses on the corresponding errors for

18:14

points that are closed in the

18:15

highdimensional space which is one

18:18

reason why tney strongly preserves the

18:20

local neighborhoods

18:23

to reduce the value of this cost

18:25

function. We can use the method of

18:27

gradient descent. We therefore need to

18:30

compute the derivative of this function.

18:34

This is the derivative of the cost

18:36

function that we can use in the gradient

18:38

descent method. How to derive this

18:41

function can be found in the original

18:43

tney paper.

18:45

These are our P and Q matrices and the

18:49

Y's are the coordinates of the given

18:51

data points in the low dimensional

18:53

space.

18:55

We now plug in our values in this

18:57

function for the first row where I is

19:01

equal to one and J is equal to two.

19:10

Now J is increased by one which means

19:13

that we take the corresponding values in

19:15

the third column.

19:22

Then we increase J to four

19:26

and five and do the math which here show

19:31

values from calculations with more

19:33

decimals.

19:35

Then we increase I to two and do the

19:37

same calculations.

19:40

If you do the calculations for all rows,

19:43

we will get these values

19:46

which we plug in in the gradient descent

19:48

formula. where alpha is the learning

19:51

rate that were here set to one. These

19:54

are our current coordinates in the

19:56

lowdimensional space.

19:59

In the paper, they also added a momentum

20:02

term for a smoother update, but we will

20:05

not use that term here. So if you

20:08

substract these values

20:10

from these values, we'll get these

20:13

updated values

20:16

and the coordinates of the data points

20:18

will be updated

20:20

like this. With these updated values, we

20:24

have reduced the cost function from 0.77

20:27

to 0.56.

20:30

We then iterate like this until we reach

20:33

some convergence or until we reach the

20:36

maximum number of iterations.

20:39

After 100 iterations,

20:42

I got the following plots that closely

20:45

resembles what we see in the

20:47

highdimensional space. After 100

20:50

iterations,

20:51

these are the values in the Q matrix,

20:55

which closely match those in the P

20:57

matrix.

20:59

So the main idea with the teasing method

21:02

is to minimize the difference between

21:04

the distances based on the conditional

21:07

probabilities in the high and low

21:09

dimensional space to preserve both local

21:12

and global structures in the data.

21:15

Finally, we will discuss the perplexity

21:18

hyperparameter.

21:20

Remember that we previously used a fixed

21:22

value of sigma. Tney does not use the

21:26

same sigma for all points.

21:29

So for each point we find an appropriate

21:32

value of sigma so that two to the power

21:35

of the Shannon entropy of the

21:38

conditional probabilities is

21:39

approximately equal to the chosen

21:42

perplexity parameter. A larger

21:44

perplexity parameter generally results

21:48

in larger values of sigma. The

21:50

recommended value of the perplexity

21:52

parameter is in the range of 5 to 50.

21:57

So here are three plots of the amnes

22:00

data set which show how the perplexity

22:02

parameter changes the plots.

22:06

If you set the perplexity parameter to

22:08

five, the clusters are much closer than

22:12

if you set a larger value of the

22:14

perplexity parameter.

22:16

So you need to try a few values of this

22:19

parameter until you find an appropriate

22:22

plot that you like to show. Note that

22:24

there is a related method called UAP

22:28

which produces similar plots as the TE

22:30

method. I will explain the UMAP method

22:34

in a future video.

22:37

This was the end of this video about the

22:39

TES method. Thanks for watching.

Interactive Summary

This video explains the T-distributed Stochastic Neighbor Embedding (T-SNE) method, a technique used for visualizing high-dimensional data in lower dimensions, typically two. It contrasts T-SNE with Principal Component Analysis (PCA), highlighting T-SNE's ability to preserve local structures and reveal global patterns like clusters, which PCA often fails to do. The explanation covers applications in datasets like MNIST digits and single-cell RNA sequencing, as well as word embeddings. The video then delves into the mathematical details of T-SNE, including the calculation of conditional probabilities, the use of a cost function based on the Kullback-Leibler divergence, and the optimization process using gradient descent. Finally, it discusses the perplexity hyperparameter and its effect on the resulting visualization, mentioning UMAP as a related method.

Suggested questions

8 ready-made prompts