HomeVideos

The diffusion model (DDPM) – the basics | Generative AI

Now Playing

The diffusion model (DDPM) – the basics | Generative AI

Transcript

539 segments

0:00

Today we will discuss the diffusion

0:03

model that is used in generative AI to

0:05

produce for example images and videos. I

0:09

will first explain what diffusion is and

0:12

then explain how this equation works to

0:15

apply forward diffusion on an image.

0:18

This equation is the key to understand

0:20

the diffusion model. Next, we will

0:24

discuss how a simple neural network can

0:26

be trained to learn the added noise from

0:29

the forward diffusion and finally see

0:32

how the train network can recreate an

0:34

image from complete noise by backward

0:38

diffusion. I will mainly focus on the

0:41

DDPM model that was published in 2020

0:45

and show the basic equations from this

0:48

paper at the end of this video.

0:50

Suppose that we add a drop of ink in

0:53

water. The dye molecules in the ink then

0:57

start to diffuse and will eventually be

1:01

evenly spread out due to the diffusion.

1:05

What you see here is a simulation of

1:08

diffusion that might represent the

1:10

particle that diffuses in a

1:12

two-dimensional space. The random

1:15

movement of the particle is caused by

1:18

collisions with other particles which

1:21

make the particle move in a random path.

1:26

One way we can simulate such random

1:28

movements is to place a point in a

1:31

two-dimensional plot. The initial

1:33

position of the point is here placed at

1:36

an xcoordinate of 10 and a ycoordinate

1:40

of 20. Then we draw two random values

1:44

from a standard normal or Gaussian

1:46

distribution which has a mean of zero

1:49

and a standard deviation of one. This

1:52

notation tells us that epsilon is

1:55

distributed according to a normal

1:58

distribution with a mean of zero and a

2:00

standard deviation of one. Remember that

2:04

the variance is the square of the

2:06

standard deviation which means that the

2:09

standard normal distribution has also a

2:12

variance of one. Suppose that we draw

2:15

these two random values from the

2:18

distribution.

2:20

So to update the x and y positions of

2:24

the point, we add the random values to

2:28

the current x and y coordinates and do

2:32

the math. Then we move the point to the

2:35

updated position.

2:37

Next, we draw two new random values from

2:40

the distribution which happen to be one

2:43

and -3.

2:45

Note that it is more likely to draw

2:48

values between -1 and positive1 from the

2:51

normal distribution than values in the

2:54

tails.

2:56

Then we add these two random values to

2:59

the current coordinates and calculate

3:02

the new position of the point

3:05

and update its coordinates. Then we just

3:08

continue like this.

3:21

Now let's say that a particle can only

3:24

diffuse in one dimension which is here

3:27

the x dimension. This means that we now

3:30

just draw one random value from the

3:32

distribution and add this to the current

3:35

xcoordinate.

3:37

Then we only need to update its position

3:40

in X since the Y-coordinate is fixed.

3:44

Then we draw a new random value and

3:47

update its position and so forth. To

3:52

easier track the trajectory path over

3:55

time in one dimension, let's put the X

3:58

coordinate on the Y axis and T for time

4:02

on the Xaxis.

4:04

So at time point Z the initial position

4:08

of the point is 10. Then we draw a

4:12

random value from the standard normal

4:14

distribution and add that to the current

4:17

position

4:19

which means that the x position for the

4:21

next time point is 8. Then we draw a new

4:26

random value and update the current x

4:30

position and so forth.

4:39

If you simulate for 500 time steps, the

4:42

trajectory of the X position over time

4:45

might look something like this. Note

4:48

that every time we simulate, we will get

4:50

a different trajectory due to the

4:53

stochastic process. The distribution of

4:56

the X values can be illustrated by the

4:59

following histogram. We see that the x

5:02

values span between 0 and 45

5:06

which corresponds to the range in this

5:09

plot.

5:11

If instead simulate 5,000 time steps,

5:15

the distribution now becomes wider. The

5:18

values now span between80

5:21

and positive40.

5:23

So the variance increases with time for

5:26

this kind of process. However, in

5:29

diffusion models, we like to have the

5:32

following diffusion process.

5:34

I've here started at 100 so that we

5:38

clearly can see how the signal goes down

5:40

to zero. This initial signal may

5:44

represent a pixel value in an image.

5:48

So, every plot like this may represent

5:51

the value in a pixel over time. Each

5:55

pixel will have its own random

5:57

trajectory.

5:59

We want that the signal or the value in

6:01

the pixel to go down to zero where the

6:05

variance of the noise should stabilize

6:08

around one

6:10

because the image then contains pure

6:13

noise.

6:14

So this pixel may have a value of 100

6:18

that represents white color. After some

6:22

time the pixel may have a value of zero

6:26

which represents gray color. This

6:30

process is called forward diffusion

6:33

because we add noise in the image over

6:36

time. This is an example of forward

6:39

diffusion in a real image.

6:43

After some time the image will contain

6:46

pure noise.

6:52

The distribution of the noise after

6:54

about 2,000 time steps reflects the

6:58

normal distribution with a mean of zero

7:01

and a variance of one

7:04

to generate this type of noisy curve.

7:07

We can use the following equation to

7:11

simplify things. I will here use a fixed

7:14

value of beta. But in diffusion models,

7:18

beta is varied over time. In the DDPM

7:22

model, beta increases linearly over

7:25

time. Note that the value of beta should

7:27

always be somewhere between zero and

7:30

one.

7:32

If you plug in our fixed value of beta

7:35

and simplify,

7:37

we see that the previous x value will be

7:40

multiplied by a value that is smaller

7:43

than one. which means that the

7:46

subsequent x values will approach zero.

7:50

One can see this as we remove 0.5%

7:54

of the signal at each time step. If you

7:58

compute the product of 1,000 of these

8:01

values

8:03

and multiply with the initial signal, we

8:06

see that the initial signal has been

8:08

reduced close to zero after 1,000 time

8:12

steps. After 2,000 time steps, the

8:16

initial signal is very close to zero,

8:19

which means that the signal in the image

8:21

has been almost completely lost and that

8:24

the image now contains pure noise. In a

8:28

diffusion model, we like to train a

8:31

neural network to predict a specific

8:33

noise sample drawn from a normal

8:36

distribution that was added to the

8:38

current x value.

8:41

So for training we do not need the full

8:44

trajectory over time. We only need the

8:47

current x value and the added noise at

8:50

this time point because the network can

8:52

then predict the added noise at the

8:54

given time point. If you generate 1,000

8:58

of these random trajectories, we can see

9:01

how they have spread out. If you plot

9:04

the 1,000 values at time point 5,000, we

9:08

see that they are normally distributed

9:10

with a mean of zero and a variance of

9:13

one. If you delete this term, we can

9:17

plot how the signal decays over time

9:20

without the noise.

9:22

We can see that the signal has gone down

9:25

to zero after enough time steps. Instead

9:29

of creating many trajectories like this,

9:32

we can generate similar random values

9:34

around the curve with the following

9:37

closed form formula assuming a fixed

9:40

value of beta.

9:42

Note that we here use only the initial

9:45

value of x instead of iterating over

9:48

time.

9:51

If you set t to for example 5,000, we

9:55

can compute a random value around the

9:57

curve directly without the need to

10:00

iterate until we reach that time step

10:03

because this equation depends only on

10:06

the initial value and the current time

10:09

point. If you plug in the value of beta

10:12

which is here assumed to be fixed and

10:15

simplified we see that if time goes to

10:18

infinity

10:20

this term will go to zero whereas this

10:24

term will be equal to one.

10:27

This explains why we have pure Gaussian

10:30

noise with a mean of zero and a variance

10:32

of one after many time steps because the

10:36

forward diffusion has eliminated the

10:38

information about the initial signal.

10:41

The initial value of x

10:43

when t is small the value of x for the

10:47

next time step will depend a lot on the

10:50

initial signal where very little noise

10:53

is added because this term will be

10:55

almost zero for small t's. The variance

10:59

of the noise added at the given time

11:02

point can be defined like this. And the

11:05

standard deviation of the added noise is

11:08

therefore the square root of the

11:10

variance. For example, the variance of

11:13

the noise that is added at time 20 is

11:17

about 0.18.

11:19

about 0.95

11:21

of the 300 time steps and about one of

11:25

the 1,000 time steps.

11:28

Note that in diffusion models, beta is

11:32

allowed to change during the forward

11:34

diffusion. In the DDPM model, beta

11:38

starts with a small value which is

11:41

increased for each time step.

11:44

This means that we need to compute the

11:46

cumulative product for the current time

11:48

point. Instead of simply taking 1 minus

11:52

beta to the power of t. If we use the

11:55

cumulative product, we use instead this

11:58

equation for the forward diffusion. I

12:01

will come back to this equation at the

12:02

end of this video when we have a look at

12:04

the DDPM paper. Anyway, since we use a

12:08

fixed value of beta, we can use a

12:12

simpler form of the equation.

12:14

A reduced value of beta means that it

12:17

will take longer time to reach pure

12:20

noise.

12:22

Now, let's see how we can train a simple

12:24

neural network.

12:26

I will here use a super simple neural

12:29

network without any hidden layers and

12:32

with a linear activation function. Note

12:35

that this network is way too simple to

12:38

learn the forward diffusion, but it will

12:41

help us to understand the basics.

12:44

As input, the network takes the value of

12:47

X at the random time point and the time

12:51

at that time point.

12:53

It then uses these inputs to predict the

12:56

noise that was added at this time point.

13:00

The two weights and the bias weight will

13:03

now be updated by the method of gradient

13:06

descent. To minimize this loss function,

13:10

which is the square difference between

13:12

the predicted noise and the true noise,

13:15

let's use some numbers to understand

13:18

this better.

13:19

During training, a random time point is

13:22

selected. To select a random time point

13:25

in this example, we draw a random

13:28

integer from a uniform distribution

13:31

between 1 and 5,000. This means that

13:35

every time point is equally likely to be

13:38

selected. In this case, the time point

13:41

2,00 was randomly selected. Note that

13:45

the time points are usually normalized

13:48

so that the values span between zero and

13:51

one. But to keep things simple, we here

13:54

use the raw time points. This time point

13:58

is now plugged into this equation

14:01

together with a random value drawn from

14:03

a normal distribution with a mean of

14:06

zero and a variance of one.

14:09

Suppose that this value happens to be

14:12

0.5.

14:15

The computed value of X is therefore

14:18

about 0.504

14:21

at the time point 2000

14:24

because we still have some small signal

14:26

left at this time point.

14:30

This value can be seen as we pick a

14:32

random value around the curve at the

14:35

time point 2000.

14:38

We now plug in the value of x and the

14:40

value of t as inputs in the network.

14:45

Now suppose that the current values of

14:47

the weights are equal to these values.

14:51

If you plug in these weights and the

14:54

input values, we see that the network

14:57

predicts the added noise to about 0.8.

15:01

We can now calculate the loss where we

15:04

plug in the true noise that was added

15:07

and the predicted noise by the network.

15:12

We see that the loss is 0.09.

15:16

Now the network will try to reduce this

15:18

loss by updating the values of these

15:22

weights with the method of gradient

15:25

descent.

15:27

A new random time point is then selected

15:30

and the network then predicts the added

15:33

noise at this time point and updates its

15:36

weights to reduce the loss function.

15:39

This process will be repeated thousands

15:42

of times.

15:44

Once the network has been trained, we

15:47

will use the train network to perform

15:49

backward diffusion.

15:52

To perform backward diffusion, we start

15:55

at the final time step where we have

15:58

pure noise. Since the train network has

16:01

now optimized these weights, the values

16:05

of these weights will be fixed during

16:07

the backward diffusion.

16:10

We know that the signal is zero at this

16:13

time step and that the noise has a

16:15

variance of one.

16:18

The initial value for the backward

16:19

diffusion process can therefore be drawn

16:22

from a normal distribution with a mean

16:24

of zero and a variance of one. It is

16:28

therefore important that we run the

16:30

forward diffusion long enough to make

16:32

sure that the signal is close to zero so

16:35

that the backward diffusion can be

16:37

initialized by drawing a value from a

16:40

normal distribution with a mean of zero.

16:43

For example, suppose that this value

16:46

would be 0.3.

16:48

We now plug in the value of XT

16:52

and the current time point in the input

16:54

nodes.

16:57

Let's assume that the network predicts

16:59

that the added noise at this time point

17:02

was 0.4.

17:05

We can now use the following equation to

17:07

compute the backward diffusion. If you

17:10

use a fixed value of beta, one can

17:13

interpret this equation as it removes

17:16

the predicted noise

17:19

from the signal at time t so that we

17:23

predict the signal without the noise one

17:26

time step backward in time.

17:29

These terms can be seen as scaling

17:32

factors that are used so that the

17:34

backward diffusion reflects the forward

17:37

diffusion.

17:39

But why do we use this term?

17:42

Zed is actually a random value drawn

17:44

from a normal distribution. The reason

17:47

why we add new noise in the backward

17:49

diffusion process is that the network

17:52

was trained to predict the noise during

17:54

the forward diffusion and it must

17:57

therefore receive a similar type of data

17:59

as it was trained on. Also by adding new

18:03

noise we will end up with different

18:06

values of X0 every time we use backward

18:09

diffusion which produces slightly

18:12

different images and therefore results

18:14

in different outputs from the same

18:17

trained model.

18:19

Let's plug in our numbers

18:28

and do the math.

18:31

We now plug in this as input to the

18:34

network which predicts the noise that we

18:37

plug in here. We update the current time

18:41

point and the current value of X and

18:46

draw a new random value

18:48

and calculate the predicted value of X

18:51

at time point 4,998

18:56

that we plug in here. We then continue

18:59

to iterate like this until we end up

19:03

with a predicted value of x at time

19:05

point zero which should be close to the

19:08

initial signal. Note that in the last

19:11

time step of the iteration when we

19:13

compute x0 we do not add any noise.

19:19

So when we apply backward diffusion on

19:22

the real image it may look like this.

19:27

To explain how all this works for image

19:30

generation, let's assume that we have

19:33

the following image with 20 pixels that

19:36

is supposed to represent the digit four.

19:40

Each pixel in this image holds a certain

19:43

normalized value. In this case, the

19:46

white color is displayed in the image if

19:49

the pixel value is equal to one. Whereas

19:53

a black color is shown if the pixel

19:55

value is equal to -1.

19:59

We now apply forward diffusion to this

20:02

image. Each pixel with white color will

20:06

start at the value one and decay down to

20:09

zero. Whereas the pixels with black

20:12

color start at -1 and increase up to

20:16

zero. After about 200 time steps, the

20:20

pixels with white color have changed

20:23

value from one to about 0.8, which means

20:27

that the color has changed towards a

20:30

light gray color.

20:32

A pixel with black color has changed

20:35

value from -1 to about0.8,

20:40

which means that it has a dark gray

20:42

color. After about 500 time steps, the

20:47

image might look something like this.

20:50

After about 800 time steps, we can

20:53

barely see the original signal. And

20:56

after about 1,500 time steps, the image

21:00

contains just noise, which is expected

21:04

because the forward diffusion is now in

21:06

the phase with pure noise.

21:10

Once we have trained the network, we can

21:13

use it to perform backward diffusion

21:16

where we start with an image with pure

21:18

noise

21:20

and move backward in time

21:26

until we reach the time of zero. Note

21:29

that this image will probably not look

21:31

like the original image because noise is

21:35

added in the backward process.

21:39

The type of network that is used in

21:41

diffusion models is usually a so-called

21:44

unit that I have explained in a previous

21:47

video.

21:49

A unit is commonly used for image

21:51

segmentation where it can take for

21:54

example a satellite image to generate a

21:57

segmented image that identifies roads,

22:01

lakes, and houses.

22:05

In diffusion models, the input in the

22:08

unit during training is instead the

22:10

image at the certain time point. The

22:13

unit also integrates the current time

22:16

point in its training.

22:19

Some noise has been added to the pixels

22:22

in the image at this time point and the

22:25

initial values in the pixels have been

22:28

reduced down to zero due to the forward

22:31

diffusion where the unit tries to

22:34

predict the added noise at this time

22:36

point which is compared to the true

22:39

noise that was added to compute the

22:42

loss. The network then updates its

22:45

weights that are used during the down

22:48

and up sampling to minimize the loss.

22:51

Once the unit has been trained, it can

22:54

take an image with pure noise and

22:58

predict the added noise which is removed

23:02

from the current image where also some

23:04

new noise is added.

23:07

This new image is then used as input to

23:11

move backward in time

23:17

until it reaches the time of zero.

23:21

Okay, let's see if you now can

23:24

understand the basic things in a DDPM

23:26

paper. On page five, we can see that

23:30

they have used 1,000 time steps for the

23:33

forward diffusion

23:35

and that beta starts at 0.00001

23:40

and ends at 0.02 at the last time step.

23:45

We can also see that they used a unit.

23:48

On page two, they show that alpha t is

23:52

the cumulative product of 1 minus beta t

23:55

up to the given time point, which is

23:59

similar to the equation that I showed

24:01

previously.

24:03

Since the value of beta is not fixed,

24:07

they use this equation for the forward

24:10

diffusion.

24:11

But what does this equation mean?

24:14

Well, this denotes the probability

24:17

distribution of X at time t given the

24:20

value of x0 which is equal to a normal

24:24

distribution of x for a certain time

24:27

point which has a mean that is equal to

24:30

the original signal multiplied by the

24:33

square root of alpha t and a variance

24:36

equal to 1 minus alpha t which is

24:40

similar to the equation that I showed

24:42

previously but where we used a fixed

24:45

value of beta for simplicity.

24:48

I is just an identity matrix which

24:51

indicates that the noise is independent

24:54

in each pixel.

24:56

On page four, we see that the pixel

24:59

values in the image that usually span

25:01

between 0 and 255 were scaled so that

25:05

the values span between -1 and one.

25:10

Similar to our previous example,

25:13

if we move to the table on page four, we

25:16

can find the training procedure.

25:18

So we start by randomly selecting an

25:21

image from our data set where X0 is the

25:25

normalized pixel values in that image.

25:29

Then we draw a random time point based

25:31

on a uniform distribution which means

25:34

that all time points are equally likely

25:36

to be selected. Then we draw random

25:40

values one for each pixel in the image

25:43

from a normal distribution with a mean

25:45

of zero and a variance of one which is

25:48

the added noise in the forward diffusion

25:51

process.

25:53

Next, we compute the square difference

25:56

between the added noise and the

25:59

predicted noise by the network, which is

26:02

computed based on its current weights

26:06

and the inputs XT and a time point t

26:11

to compute the loss function. These

26:14

steps are repeated over all images in a

26:17

data set for several epochs until the

26:20

network converges where the loss has

26:23

stopped decreasing.

26:25

Once we have trained the network, we can

26:28

use backward diffusion to generate an

26:30

image. The initial values in the pixels

26:34

at the last time point are drawn from a

26:37

normal distribution with a mean of zero

26:39

and a variance of one. Then we go

26:43

backward from the last time point

26:45

stepwise to time point one. When t is

26:49

equal to one, we will compute x0.

26:54

In each step, we draw random values that

26:57

will be added to the pixels as long as t

27:00

is greater than one. No noise will be

27:04

added in the final step when we compute

27:06

x0.

27:08

In each iteration, we compute X or the

27:11

pixel values one step backward in time.

27:16

Sigma T was computed like this. Once we

27:20

have ended the loop, we will have our

27:23

final AI generated image.

27:26

This was the end of this video about the

27:28

basics of diffusion models. Thanks for

27:32

watching.

Interactive Summary

This video explains the fundamentals of diffusion models used in generative AI for producing images and videos. It starts by defining diffusion as a process where particles spread out over time, illustrating it with a simulation of random particle movement. The explanation then delves into how this concept is applied to images, introducing the forward diffusion process where noise is gradually added to an image until it becomes pure noise. The video details the mathematical equations governing this process, emphasizing how the signal decays to zero and the noise variance stabilizes around one. It explains the role of a neural network trained to predict this added noise. Subsequently, the video covers the backward diffusion process, where a trained network reconstructs an image from pure noise by iteratively removing predicted noise and adding a small amount of new noise. The explanation uses a simple neural network and a U-Net architecture, detailing the training and generation phases. Finally, it references the DDPM paper, highlighting key aspects like the number of time steps, the varying beta values, the specific equations used for forward diffusion, pixel value scaling, and the training and generation procedures described in the paper.

Suggested questions

5 ready-made prompts