Monte Carlo simulations in R

Watch on YouTube

Now Playing

Transcript

527 segments

0:00

Okay, so let's start to generate some

0:02

random numbers from a uniform

0:04

distribution in the range 0 to one. This

0:08

means that the numbers between zero and

0:10

one are equally likely to be drawn. To

0:14

generate such numbers, we use the

0:16

function run if

0:19

r stands for random and unif for

0:22

uniform.

0:24

If you run this line, we happen to

0:27

generate the following random number. If

0:30

you run this line several times,

0:34

we get new random numbers between zero

0:37

and one. If you change this one here to

0:41

five and run the code again,

0:45

we will generate five random numbers at

0:48

once.

0:49

And if you run this line again, we get

0:52

five new random numbers. Sometimes we

0:56

like to reproduce our Monte Carlo

0:58

simulations. We then need to generate

1:01

the same random numbers over and over.

1:04

To do this, we can use the function set

1:09

before we generate the random numbers.

1:11

If you now run these two lines, we'll

1:14

generate the following five random

1:16

numbers.

1:18

Note that if you run these lines again,

1:22

we will generate the exact same random

1:25

numbers. If you change the value of the

1:27

seed and run the lines again, we will

1:31

generate five new random numbers. If you

1:34

run these two lines again, we will

1:37

generate the same five random numbers.

1:41

As long as you use the same seed, you

1:43

will generate the same random numbers

1:46

over and over.

1:48

Let's save our five random numbers in

1:51

the variable R

1:54

and create the histogram of the random

1:56

numbers by using a green color for the

1:59

bars.

2:07

We see that we have one value in the

2:09

range between 0 and 0.2 which means that

2:13

the height of this bar will be equal to

2:15

one. We have two random values in the

2:18

range between 0.4 and 0.6

2:22

which explains why the height of this

2:24

bar is equal to two. We have for example

2:28

no random values between 0.2 and 0.4

2:32

which explains why the height of this

2:34

bar is equal to zero. Let's try to run

2:38

these two lines without including the

2:41

seed.

2:45

Note that this histogram changes every

2:48

time we run these two lines because we

2:51

generate five new random values every

2:54

time.

2:57

If you now run all three lines including

3:01

the function set, we will generate the

3:04

same histogram every time because the

3:07

same five random numbers will now be

3:09

generated over and over. Let's now try

3:13

to generate 50,000 random numbers

3:16

instead of just five.

3:22

Note that the width of each bar is now

3:24

0.05.

3:26

We now therefore have 20 bars and the

3:29

height of each bar is about 2500.

3:33

2500 * 20 is 50,000 which is equal to

3:39

the number of random values we generated

3:42

in comparison to the previous histogram

3:44

with only five numbers using 50,000

3:48

random numbers now clearly shows that

3:51

this is a uniform distribution where all

3:54

values between zero and one are equally

3:56

likely to be drawn.

3:59

Note that the distribution only slightly

4:01

changes when we run these two lines many

4:05

times.

4:07

Again, if you also include the seed when

4:10

we run the lines, we will generate the

4:13

same histogram every time we run the

4:15

code. If you like, we can set the range

4:19

from which the numbers should be drawn.

4:21

For example, we will here draw 50,000

4:24

random numbers between zero and 10.

4:28

Note that the range of the histogram now

4:31

spans between zero and 10 and that each

4:34

bar has a width of 0.5.

4:39

Okay, we will now use the function

4:42

sample from which we draw five numbers

4:46

without replacement from the numbers 1 2

4:49

3 4 5 and six.

4:53

Note that we here happened to get the

4:55

numbers 4 5 3 2 and 1. The value six was

5:00

not selected. Since we draw numbers

5:03

without replacement, we cannot draw the

5:06

same number several times because once a

5:10

value has been drawn, it cannot be

5:12

selected again. If we run again, we see

5:16

that we get the following five numbers

5:19

and that the value one was not selected

5:21

among the five numbers. If you run

5:24

again, the value five was not selected.

5:29

Now let's set replace to true, which

5:32

means that we can now select the same

5:35

number many times.

5:38

If we run this code,

5:41

we now see that the value two happened

5:44

to be selected three times and that the

5:47

value four happened to be selected two

5:50

times. These numbers might represent the

5:53

values we may get if we roll a six-sided

5:56

die five times.

6:00

Let's save the random numbers into the

6:02

variable R and create a table of these

6:05

random numbers.

6:08

If you run these two lines,

6:11

we can see that we happen to get the

6:13

value one one time and the value four

6:16

one time and the value five one time and

6:19

the value six two times out of the five

6:23

rows. Let's make a bar plot of this

6:26

table.

6:33

If we run these lines, we will see the

6:36

simulated outcome of five rows.

6:44

If you include the function set when we

6:47

run the lines,

6:49

we will get the same outcome every time.

6:52

By using this seed, we get two ones, one

6:56

two, one, four, and one five. From

7:00

simulating five rolls of a six-sided

7:03

die,

7:04

we will now compute the probability of

7:07

getting three sixes in three rolls.

7:12

To do this, we will simulate 10,000

7:15

experiments where we roll the dice three

7:18

times in each experiment. In each of

7:22

these 10,000 simulations, we will count

7:25

how many times we get three sixes.

7:29

To run 10,000 simulations, we will here

7:32

create a for loop

7:35

that will be iterated 10,000 times.

7:40

In each loop, we will simulate three

7:42

rows of a six-sided die.

7:52

If we happen to get three sixes from one

7:55

experiment,

7:59

we increase our counter by one.

8:03

To compute the probability of getting

8:06

three sixes in three rows, we divide the

8:09

number of times we got three sixes by

8:12

10,000, which is the number of

8:14

simulations we do.

8:18

Finally, we print the probability

8:21

and run the code.

8:25

So, in 4.6% of the cases, we happen to

8:30

get three sixes.

8:32

This should be close to the theoretical

8:34

probability of getting three sixes in

8:37

three rows.

8:40

Let's try to run the simulations many

8:42

times. The computer probabilities should

8:46

vary around the theoretical probability.

8:53

This is the theoretical probability.

8:58

Note that if you increase the number of

8:59

simulations from 10,000 to 100,000,

9:04

we will get closer to the theoretical

9:06

probability.

9:20

Okay. Now let's compute the probability

9:23

of winning a lottery.

9:26

As an example, we'll compute the chance

9:28

of getting these five numbers.

9:31

The order of the numbers does not

9:33

matter.

9:35

We will perform 100,000 simulations.

9:47

In each simulation, we select five

9:50

random numbers between 1 and 35.

9:54

We here draw the five numbers without

9:56

replacement, which means that the

9:59

numbers can only be drawn once.

10:02

If our five random numbers are equal to

10:05

the correct five numbers,

10:12

we will increase the counter by one.

10:22

In the first simulation, none of the

10:26

100,000 iterations happened to generate

10:28

the correct five numbers.

10:31

If you try again, the counter is equal

10:34

to zero, which means that we did not win

10:37

the lottery, although we tried 100,000

10:40

times.

10:42

But in this simulation, we happened to

10:44

win the lottery two times. Which means

10:47

that we happened to get the numbers 1,

10:49

12, 20, 25, and 31 times out of the

10:54

100,000 iterations.

10:58

Let's simulate 1 million experiments.

11:07

We now won the lottery three times out

11:10

of the 1 million tries. If you try to

11:13

run again, we now won eight times.

11:18

We can calculate the theoretical

11:20

probability of winning this lottery by

11:22

using the following code.

11:27

So we expect to win this lottery about

11:29

three times if you try 1 million times.

11:36

Okay, let's now generate some random

11:38

numbers from a normal distribution with

11:41

the function r norm.

11:44

We will here generate 10 such random

11:47

numbers. Note that by default the mean

11:50

of the distribution that we sample from

11:53

is zero and has a standard deviation of

11:56

one. We will therefore draw numbers from

11:58

a standard normal distribution.

12:02

If you run this slide, we happen to get

12:05

these 10 random numbers. We expect that

12:08

half of these numbers will be negative

12:10

and half will be positive. Let's save

12:13

the generated 10 random numbers

12:17

and make a histogram of these.

12:27

Note that when we have only 10 values,

12:30

it is quite hard to see that this data

12:33

comes from a normal distribution.

12:35

This sample looks a bit better, but

12:38

let's increase the sample size to

12:40

10,000.

12:44

When we use a bigger sample size, it is

12:47

now easier to see that this sample comes

12:50

from a normal distribution.

12:54

Let's take a sample of 1 million.

12:59

Let's also use 100 bins in the histogram

13:03

to better see the distribution.

13:07

This histogram perfectly resembles the

13:10

shape of a normal distribution.

13:14

If you calculate the mean of these 1

13:16

million random values, we should expect

13:19

a value close to zero

13:23

and a standard deviation close to one

13:26

because we draw values from a standard

13:28

normal distribution.

13:30

If you like, we can change the mean to,

13:33

for example, 10.

13:36

and the standard deviation to three of

13:39

the distribution that we draw the random

13:41

values from. If you now run these lines,

13:45

the distribution should center around 10

13:49

and the mean and the standard deviation

13:51

of the random values should be close to

13:53

10 and three.

13:56

Now let's generate 1 million values

13:59

again from a standard normal

14:01

distribution which has a mean of zero

14:04

and a standard deviation of one.

14:09

Let's sum how many of these 1 million

14:11

values that are greater than 1.96.

14:15

We see that we have about 25,000 values

14:18

that are greater than 1.96.

14:22

This is what we expect because the upper

14:25

tail on the right hand side of 1.96 in

14:28

the standard normal distribution should

14:31

theoretically cover 2.5% of the

14:34

distribution. We can compute the

14:36

proportional values that are greater

14:38

than 1.96. like this.

14:53

We see that this value is close to the

14:55

expected 2.5%

14:57

of this upper tail.

15:00

Similarly, the proportion of values that

15:02

are less than -1.96

15:05

should be close to 2.5%.

15:13

Okay, let's do a simulation that

15:15

explains the central limit theorem which

15:18

states that if the sample size becomes

15:21

large, the distribution of the sample

15:24

means approaches a normal distribution

15:27

regardless of the distribution we sample

15:29

from. We will here do 100,000

15:32

simulations.

15:34

In each simulation, we will draw five

15:37

random values from a distribution.

15:40

We will save the mean values from the

15:43

100,000 samples in the following vector

15:46

that we initialize with zeros.

15:50

Then we create a for loop

15:56

and in each loop we'll simulate five

15:59

rolls of a six-sided die.

16:02

This means that we sample from a uniform

16:05

distribution.

16:07

We save the mean value from the five

16:10

rows of the die in each iteration.

16:16

Finally, we generate a histogram of the

16:19

100,000 sample means.

16:24

If you run this code, we see that the

16:27

sample means are normally distributed,

16:30

which is exactly what the central limit

16:32

theorem tells us.

16:37

Let's change the sample size to 15 and

16:40

run the code.

16:42

We will then also see that the means

16:44

from such samples are normally

16:46

distributed.

16:48

Let's generate 100,000 random values

16:51

from an exponential distribution

16:57

with a mean equal to 1 over 0.3.

17:02

If you generate a histogram of these

17:04

100,000 random values, we can see the

17:08

exponential distribution.

17:11

By using 100 bins in the histogram, we

17:14

can see the distribution even more

17:16

clearly.

17:18

The mean of this distribution is 1 /

17:21

0.3.

17:24

Let's now simulate how the distribution

17:27

of 100,000 sample means from the

17:29

exponential distribution would look

17:31

like.

17:44

We change the variable name from roles

17:46

to R and simulate.

17:52

Note that the distribution is a bit

17:54

skewed and that it deviates from the

17:56

bellshaped normal distribution.

18:00

If we reduce the sample size to five,

18:03

the distribution of the sample means

18:06

will be even more skewed.

18:09

And if the sample size is just one,

18:13

we get the exponential distribution.

18:17

Now let's try a sample size of 50.

18:29

We see that the distribution of the

18:31

sample means now approaches the normal

18:34

distribution.

18:37

If you use a sample size of 500, the

18:41

distribution of the 100,000 sample means

18:44

resembles a normal distribution.

18:47

This explains the central limit theorem.

18:51

Even though we sample from any

18:53

distribution like the exponential

18:55

distribution, the sample means will be

18:57

normally distributed if the sample size

19:00

is large enough.

19:02

Okay, we will now simulate the risk to

19:05

commit a type one error.

19:12

We will draw four random values from a

19:15

normal distribution with a standard

19:17

deviation of two and a mean of 10. We

19:20

will do this for both groups.

19:25

We will compute 100,000 t tests. We save

19:29

the p values from these 100,000 tests in

19:33

the following vector.

19:41

In each loop, we draw two samples X and

19:45

Y from a normal distribution with a mean

19:48

of 10 and a standard deviation of two.

19:56

Then we compute a t test based on the

19:58

two samples

20:01

and use a t test that assumes an equal

20:03

variance of the two groups. From the t

20:07

test function, we extract the p value.

20:10

Finally, we create the histogram of the

20:13

100,000 p values

20:21

and run the code.

20:24

In this case, we know that the null

20:26

hypothesis is true because both samples

20:30

are drawn from the same distribution.

20:33

If the null hypothesis is true, the p

20:37

values follow a uniform distribution in

20:40

the interval 0 to 1. Let's calculate the

20:44

number of p values that are less than

20:46

0.05.

20:53

We see that about 5,000 p values are

20:56

less than 0.05.

20:59

If we divide by the number of tests that

21:01

we have done, we see that 5% of the p

21:04

values are less than 0.05.

21:08

If you run this code several times, we

21:11

see that the proportion varies around

21:13

the expected proportion of 5%.

21:18

So this means that there is a 5% risk

21:21

that we incorrectly reject a true null

21:24

hypothesis when we set the significance

21:27

level to 0.05.

21:31

Let's now try to compute the risk for a

21:34

type two error which means that we will

21:36

compute the proportion of times when we

21:39

do not reject a false null hypothesis.

21:42

To simulate a false null hypothesis, the

21:45

two samples will be drawn from two

21:48

distributions with different means.

21:52

We therefore know that the null

21:53

hypothesis of equal means is false.

21:58

The two distributions will now therefore

22:00

have different mean values.

22:03

Let's run the code and see how the p

22:06

values are now distributed.

22:10

So when the null hypothesis is false,

22:13

the distribution is now skewed because

22:16

most p values will be close to zero.

22:19

Although the null hypothesis is false,

22:22

due to chance, we may draw two similar

22:25

samples that generate a p value that is

22:29

greater than 0.05, which means that we

22:32

have committed a type two error since we

22:35

did not reject the false null

22:37

hypothesis.

22:39

If you compute the proportion of P

22:41

values that are less than 0.05,

22:45

we see that about 43% are less than

22:48

0.05.

22:50

This actually corresponds to the

22:52

statistical power because there is a 43%

22:55

chance that we do not commit a type two

22:58

error.

23:00

To compute the theoretical statistical

23:02

power by using a formula, we can load

23:05

the PVR library

23:08

and use the PVR.T.est

23:11

function

23:15

where we set N to our sample size, D to

23:18

the absolute difference between the

23:20

means

23:22

divided by the standard deviation.

23:26

We set the significance level to 0.05.

23:30

and leave the parameter power blank

23:33

which will be computed.

23:36

Then we say that we like to use a

23:38

two-sided two sample t test.

23:42

Given our parameters,

23:45

the theoretical statistical power is

23:47

about 43%.

23:51

Which is what we computed with our Monte

23:53

Carlo simulations.

23:56

If we run the code again, we see that

23:58

the statistical power is close to 43%.

24:03

Which again is close to the theoretical

24:05

value.

24:15

We will now set up a code to compute the

24:18

false discovery rate. We'll here

24:20

simulate 5,000 tests where the null

24:24

hypothesis is true and 5,000 tests where

24:29

the null hypothesis is false.

24:33

In total, we will therefore compute

24:35

10,000 t tests.

24:38

For each test, we use a sample size of

24:41

four. When the null hypothesis is false,

24:45

we will draw samples from two

24:47

distributions with different means. We

24:50

here set the standard deviation equal to

24:53

one.

24:55

The significance level is here set to

24:57

5%.

25:00

We will save the p values from the 5,000

25:03

tests where the null hypothesis is true

25:06

in the following vector.

25:12

and the p values from the 5,000 test

25:15

where the null hypothesis is false in

25:17

the following vector.

25:20

Next, we create a for loop that iterates

25:23

5,000 times.

25:30

In each loop, we draw two samples from a

25:34

normal distribution with the same mean

25:36

where the null hypothesis is therefore

25:39

true

25:47

and save the p values from such a t

25:50

list.

25:57

Then we draw random values from a normal

26:00

distribution where the mean is equal to

26:03

26:11

and save the p values

26:15

that are applied on two samples drawn

26:17

from two distributions with different

26:20

means where the null hypothesis is

26:23

therefore false.

26:25

After the loop, we combine all the

26:28

10,000p values in a vector.

26:43

Let's run the code and generate the

26:46

histogram of the 10,000p values.

26:57

Note that the flat part of this

26:59

histogram is expected to include the p

27:02

values simulated from where the null

27:05

hypothesis is true. Whereas the other p

27:08

values are expected to include the p

27:10

values simulated from where the null

27:13

hypothesis is false.

27:16

Let's calculate how many false positives

27:19

we have, which is the number of t tests

27:22

from which the p values are less than

27:25

0.05

27:26

when the null hypothesis is true.

27:30

Then we compute the number of true

27:31

positives we have, which is the number

27:34

of tests from which the p values are

27:37

less than 0.05

27:39

when the null hypothesis is false.

27:43

We see that the number of false

27:45

positives is 248

27:48

and that the number of true positives is

27:51

3,244.

27:55

We can now compute the false discovery

27:57

rate which is the number of false

28:00

positives divided by the total number of

28:03

positive results.

28:07

We see that the false discovery rate is

28:09

about 7%.

28:12

One method to control the false

28:14

discovery rate is to adjust the P values

28:17

by using the Benyamin hawkbar method.

28:23

After we have adjusted the p values,

28:26

we extract the adjusted p values based

28:29

on the true null hypothesis

28:39

and calculate how many false positives

28:42

we now have.

28:47

We do the same for the adjusted p values

28:50

from the tests where we know that the

28:52

null hypothesis is false

28:57

which are the last 5,000 p values in

29:00

this vector.

29:15

So when we did not adjust the p values,

29:18

we got 248 false positives and 3,244

29:23

true positives.

29:26

And when we adjust the P values,

29:37

we now get 34 false positives

29:43

and 1,081 true positives.

29:50

The false discovery rate is now 3%.

29:55

Let's try to run the code a few times.

29:59

So by adjusting the P values using the

30:02

Benjamin Hawkbar method, we control the

30:05

expected force discovery rate at 5%.

30:09

This was the end of this video. Thanks

30:12

for watching.

Interactive Summary

Ask follow-up questions or revisit key timestamps.

This video explains how to generate random numbers from uniform and normal distributions, and how to use these to simulate various statistical scenarios. It covers setting seeds for reproducibility, creating histograms to visualize distributions, and understanding sampling without replacement. The video also demonstrates how to simulate experiments like rolling dice and winning lotteries, and touches upon the Central Limit Theorem, explaining how sample means tend to follow a normal distribution. Finally, it delves into concepts of hypothesis testing, specifically type I and type II errors, and introduces the False Discovery Rate and methods to control it using p-value adjustments like the Benjamini-Hochberg method.