HomeVideos

kNN and SNN graphs, and Louvain clustering | How Seurat cluster single cells

Now Playing

kNN and SNN graphs, and Louvain clustering | How Seurat cluster single cells

Transcript

326 segments

0:00

In this video, we'll see how to build

0:02

KN&N and SNN graphs and how to identify

0:05

clusters in such graphs using the Luain

0:08

method similar to what is done in the R

0:11

package serat to identify clusters of

0:14

cells from single cell RNA seek data. We

0:17

will start by creating a K nearest

0:20

neighbor graph or a K&N graph based on

0:23

some simple data. Note that we may have

0:26

data in several dimensions because the

0:28

graph is built on distances between the

0:31

data points in a multi-dimensional

0:33

space. I will here use the cladian

0:36

distance as a distance measure. For

0:38

example, to calculate the client

0:40

distance between data points one and

0:43

six,

0:44

we plug in their x and y coordinates and

0:49

do the math. If we have, for example,

0:51

three variables with points in a

0:54

three-dimensional space, we would

0:56

instead use this equation to calculate

0:58

the distance, then we fill in the value

1:01

in the distance matrix that shows the

1:04

distance between data points one and

1:06

six. Then we compute the clarion

1:09

distances between all data points so

1:12

that we get the following distance

1:13

matrix. Next, we determine the number of

1:17

closest neighbors a data point should

1:19

connect to. As an example, I will here

1:22

set this value to two.

1:25

So for data point one, we see that its

1:28

closest neighbors are data points two

1:31

and three. We now create an adjacency

1:34

matrix that indicates that data point

1:37

one is connected to data points two and

1:40

three. So this means that data point one

1:43

is connected to

1:45

data points two and three but not to the

1:49

other data points. We can now start to

1:52

draw a directed graph where the arrows

1:55

means the data point one is connected to

1:57

data points two and three. Next, we see

2:00

that data point two has data points one

2:03

and three as its closest neighbors,

2:06

which means that we add an arrow that

2:09

points from two to three and an arrow

2:12

that points from two to one.

2:15

The two closest neighbors to data point

2:17

three are data points one and two, which

2:21

means that we add arrows from data point

2:24

three to data points one and two. We now

2:27

continue to build the graph.

2:34

So this is our final CAN graph. Note

2:38

that a data point in this graph is

2:40

called a node and that the line between

2:43

two nodes is called an edge.

2:46

Sometimes one also adds edges that point

2:50

to the node itself to show that the

2:52

point has its closest distance to itself

2:56

which means that there will be ones in

2:58

the main diagonal in the adjacency

3:00

matrix. Anyway, we will not use such

3:04

self connections for our graph.

3:07

So what we have here is a so-called

3:10

directed graph because it includes

3:12

arrows that show which data points that

3:16

are the closest neighbors for a given

3:18

data point. By the arrows we can for

3:21

example see the data point four has data

3:24

points three and five as its closest

3:27

neighbors. One can convert this directed

3:30

graph to an undirected graph by simply

3:33

removing the arrows from this graph.

3:37

like this.

3:40

This edge tells us that there is a

3:42

connection between points three and

3:44

four. But not whether three includes

3:48

point 4 as its two closest neighbors.

3:51

When we go from a directed graph to an

3:53

undirected graph, we need to update the

3:56

adjacency matrix. For example, we should

3:59

put a one here to define that there is a

4:02

link between points three and four

4:05

because we know that there is a

4:06

connection between these two points.

4:09

Same here because there is a connection

4:12

between points four and six. So this is

4:16

our updated adjacency matrix based on

4:19

the undirected graph. Note that this

4:22

matrix is symmetric because the elements

4:25

in the upper triangle hold the same

4:27

values as the ones in the lower

4:30

triangle.

4:32

We'll now see how to create a simple

4:34

weighted CANN graph.

4:37

One simple way to put weights on our

4:40

graph would be to compute the inverse of

4:42

the Eian distances.

4:45

For example, the weight between data

4:47

points one and two is 1 / 0.14,

4:51

which is 7.14.

4:54

If we were to compute the inverse of

4:56

these distances,

4:58

we would run into a problem because we

5:00

cannot divide by zero, one therefore

5:03

usually replaces the zeros with a very

5:06

small value to do such a computation.

5:09

Then we simply multiply the recrocal

5:12

values by the adjacency matrix element

5:15

wise so that we get the weighted

5:18

adjacency matrix. We can now fill in the

5:21

weights in our graph like this.

5:29

A higher value means that we put more

5:31

weight on a connection when we interpret

5:34

the graph. We'll now see how to build a

5:37

shared nearest neighbor graph or an SNN

5:39

graph which is for example used in

5:42

Serat's single cell RNA seek pipeline to

5:45

identify clusters of cells.

5:48

An SNN graph starts from a directed KN&N

5:52

graph. Then we basically copy the CANN

5:55

graph but where we do not include its

5:58

directions so that we have an undirected

6:01

K&N graph. Note that an SNN graph can be

6:04

built in many different ways and this is

6:07

only one way to create such a graph. To

6:10

create an SNN graph, we add an edge

6:14

between all pairwise points that share

6:16

at least one neighbor. Since points one

6:20

and four share the same neighbor, which

6:22

is data point three, we add an edge

6:26

between points one and four. And since

6:29

data points two and four also share the

6:32

same neighbor, we add an edge between

6:34

points two and four. For example, 0.5

6:38

does not share any neighbors with points

6:41

one, two, and three, which means that no

6:44

edge is added between these nodes.

6:48

So this is our final SN graph. We can

6:52

now add weights to this graph. One way

6:55

is to compute inverse of the ecclesian

6:58

distances that we did before. But I will

7:00

here show another method based on the

7:03

number of shared nearest neighbors.

7:06

Such weights will span between zero and

7:09

k + one since the node itself is

7:12

included in its own nearest neighbor

7:14

set. This means in our case that the

7:17

number will span between zero and three.

7:20

So for edges that point in both

7:23

directions, the weight starts at two and

7:26

then we add a one because data points

7:29

one and two share one neighbor.

7:33

So we fill in this weight here.

7:37

We will get the same weights here. Since

7:40

this arrow only points in one direction

7:43

and that data points three and four do

7:46

not share any closest neighbors, this

7:48

edge will get the weight one. Since

7:51

there is no edge between points one and

7:54

four in the cann graph, but they share

7:56

one neighbor, the edge will get the

7:59

weight one, which is also true for

8:02

points two and four.

8:05

Since points four and five do not share

8:07

any nearest neighbors and have an arrow

8:10

that points in both directions, the edge

8:13

will get the weight of two.

8:16

Since there is one arrow that points in

8:18

one direction between points four and

8:21

six, and they share one nearest

8:23

neighbor, which is five, the edge gets a

8:27

weight of two. Finally, the weight

8:29

between points five and six is equal to

8:32

three because the arrow points in both

8:35

directions and they both have 4 as their

8:39

closest neighbor. So this is our final

8:42

weighted SN graph.

8:45

So we have a strong weight here because

8:47

points one and two have themselves as

8:50

closest neighbors and they share one

8:53

neighbor in the directed KN&N graph.

8:56

There is a weak weight here because

8:58

points two and four were not among their

9:01

closest neighbors, but they shared one

9:03

neighbor in the director K&N graph. So

9:07

these are the same numbers we would get

9:09

if we use the function make SNN graph in

9:12

the R package bluster by setting type to

9:15

number. If you change type from number

9:18

to your card,

9:21

it will compute the your cardlike

9:23

weights based on this equation where n

9:25

is number we just calculated and k is

9:28

the same k we used previously which was

9:31

set to two.

9:34

So the yakard weights for these four

9:36

edges are equal to one and these three

9:40

edges will be equal to 0.2 two whereas

9:44

these two edges will get a weight of

9:47

0.5.

9:49

So this is our weighted SN graph based

9:53

on your cardlike indices that span

9:55

between zero and one. This graph has

9:59

actually the same weights

10:01

as I get if I use the function find

10:05

neighbors in the seat package version

10:08

5.3.0.

10:10

However, I was not able to find out the

10:13

exact details of how this function

10:16

computes the graph and the weights.

10:18

Anyway, I got the same graph and the

10:20

weights when I tested a range of

10:22

different inputs and use different

10:25

values of K. So, I assume that they use

10:28

the same method. Note that sout includes

10:31

the node itself as its closest neighbor

10:35

that we discussed previously which means

10:37

that we need to set the k parameter to

10:40

three and set the ones in the main

10:42

diagonal to zero to generate the same sn

10:45

graph. Serat also include this

10:48

additional pruning parameter which by

10:51

default is set to 1 over 15. If you set

10:54

this to 0.25 25. In our example, we

10:58

would delete these edges because these

11:00

edges have weights that are less than

11:03

0.25.

11:05

So that we have the following prune

11:07

graph. I will now briefly show how Lane

11:10

clustering can be used to identify

11:12

clusters or communities in a graph.

11:16

Lane clustering is a graphbased method

11:20

that tries to maximize the modularity

11:22

score Q which is high when nodes have

11:26

many edges inside their cluster and

11:28

fewer edges between clusters than

11:31

expected by chance.

11:34

Suppose that we would calculate the

11:35

modularity score given that points 1 2

11:38

and three are in one cluster and points

11:42

four five and six are in a second

11:44

cluster.

11:46

So this term will therefore be equal to

11:49

one if two nodes i and j are within the

11:52

same cluster and zero otherwise.

11:56

The sum of the weights for all edges is

11:59

in our example equal to five.

12:03

For example, the edge weight between

12:06

node one and two is one and the sum of

12:09

the edge weights for node one is two

12:13

which is true also for node two. Based

12:17

on this equation, the modularity score

12:20

for a certain cluster can be calculated

12:23

by the following simplified equation

12:26

where sum in is the sum of the edge

12:29

weights between nodes within cluster C

12:32

where each edge is considered twice. The

12:36

sum total is the sum of all edge weights

12:39

between nodes within cluster C and also

12:42

edges between clusters.

12:45

Since nodes in different clusters do not

12:48

contribute to the modularity score Q, we

12:50

only need to sum the scores for each

12:52

cluster to get the final score.

12:56

So the modularity score for cluster one

12:58

is 0.24

13:01

because the sum of the edge weights in

13:03

cluster one is three and since each edge

13:06

is considered twice, the sum within the

13:09

cluster is six. And since no node within

13:13

this cluster is connected to nodes in

13:15

the other cluster, the total sum is also

13:19

six.

13:20

The modularity score for the second

13:22

cluster is also 0.24.

13:26

The sum of these scores is therefore

13:28

0.48

13:30

which is our final modularity score.

13:33

Suppose that there will be an edge

13:35

between two nodes in different clusters

13:38

that has a weight of one. The modularity

13:41

score would then be lower because edges

13:44

between clusters will reduce the

13:46

modularity score. If we would use this

13:49

type of clustering where nodes one and

13:52

two are in one cluster and nodes three

13:55

to six are in the second cluster, we

13:58

will get a much lower modularity score

14:02

because edges between clusters will

14:04

increase the value in the numerator of

14:06

this term which reduces the modularity

14:09

score. If you try to use three clusters,

14:13

the modularity score is also relatively

14:16

low because a cluster with a single node

14:19

will get the value of zero for this term

14:22

and a negative value here due to its

14:25

connections to cluster one which results

14:27

in a negative score for this cluster.

14:30

And since the nodes in cluster one now

14:32

connect to the node in cluster three,

14:35

the modularity score is almost zero for

14:38

this cluster.

14:40

Finally, if you have just one cluster,

14:43

the total modularity score is zero

14:46

because these two terms are equal to

14:49

one. By evaluating all possible

14:52

clusters, two clusters with these nodes

14:55

will give the highest modularity score.

14:58

The leane clustering does therefore not

15:00

only put the nodes in the correct

15:02

clusters, it also identifies the most

15:05

appropriate number of clusters.

15:08

However, one can include the so-called

15:10

resolution parameter in this equation.

15:13

This is a hyperparameter

15:15

which is set by the user to adjust the

15:18

number of generated clusters. A larger

15:21

value of this parameter results in more

15:23

clusters whereas a smaller value results

15:26

in fewer clusters.

15:29

This was the end of this video about

15:31

KN&N and SN graphs and Louane

15:33

clustering. Thanks for watching.

Interactive Summary

This video explains how to build K-Nearest Neighbor (KNN) and Shared Nearest Neighbor (SNN) graphs, and how to identify clusters within these graphs using the Louvain clustering method. It starts by detailing the process of creating a KNN graph, including calculating Euclidean distances between data points, determining neighbors, and constructing adjacency matrices for both directed and undirected graphs. Weighted KNN graphs are then introduced, where edge weights are derived from the inverse of Euclidean distances. The video proceeds to explain the construction of SNN graphs, which are based on shared neighbors between data points, and how to assign weights to these SNN graphs using methods like the count of shared neighbors or a Jaccard-like calculation. Finally, the Louvain clustering algorithm is presented as a method to identify communities within a graph by optimizing a modularity score, with an explanation of how the resolution parameter can influence the number of clusters found.

Suggested questions

5 ready-made prompts