Monte Carlo simulations in R
527 segments
Okay, so let's start to generate some
random numbers from a uniform
distribution in the range 0 to one. This
means that the numbers between zero and
one are equally likely to be drawn. To
generate such numbers, we use the
function run if
r stands for random and unif for
uniform.
If you run this line, we happen to
generate the following random number. If
you run this line several times,
we get new random numbers between zero
and one. If you change this one here to
five and run the code again,
we will generate five random numbers at
once.
And if you run this line again, we get
five new random numbers. Sometimes we
like to reproduce our Monte Carlo
simulations. We then need to generate
the same random numbers over and over.
To do this, we can use the function set
before we generate the random numbers.
If you now run these two lines, we'll
generate the following five random
numbers.
Note that if you run these lines again,
we will generate the exact same random
numbers. If you change the value of the
seed and run the lines again, we will
generate five new random numbers. If you
run these two lines again, we will
generate the same five random numbers.
As long as you use the same seed, you
will generate the same random numbers
over and over.
Let's save our five random numbers in
the variable R
and create the histogram of the random
numbers by using a green color for the
bars.
We see that we have one value in the
range between 0 and 0.2 which means that
the height of this bar will be equal to
one. We have two random values in the
range between 0.4 and 0.6
which explains why the height of this
bar is equal to two. We have for example
no random values between 0.2 and 0.4
which explains why the height of this
bar is equal to zero. Let's try to run
these two lines without including the
seed.
Note that this histogram changes every
time we run these two lines because we
generate five new random values every
time.
If you now run all three lines including
the function set, we will generate the
same histogram every time because the
same five random numbers will now be
generated over and over. Let's now try
to generate 50,000 random numbers
instead of just five.
Note that the width of each bar is now
0.05.
We now therefore have 20 bars and the
height of each bar is about 2500.
2500 * 20 is 50,000 which is equal to
the number of random values we generated
in comparison to the previous histogram
with only five numbers using 50,000
random numbers now clearly shows that
this is a uniform distribution where all
values between zero and one are equally
likely to be drawn.
Note that the distribution only slightly
changes when we run these two lines many
times.
Again, if you also include the seed when
we run the lines, we will generate the
same histogram every time we run the
code. If you like, we can set the range
from which the numbers should be drawn.
For example, we will here draw 50,000
random numbers between zero and 10.
Note that the range of the histogram now
spans between zero and 10 and that each
bar has a width of 0.5.
Okay, we will now use the function
sample from which we draw five numbers
without replacement from the numbers 1 2
3 4 5 and six.
Note that we here happened to get the
numbers 4 5 3 2 and 1. The value six was
not selected. Since we draw numbers
without replacement, we cannot draw the
same number several times because once a
value has been drawn, it cannot be
selected again. If we run again, we see
that we get the following five numbers
and that the value one was not selected
among the five numbers. If you run
again, the value five was not selected.
Now let's set replace to true, which
means that we can now select the same
number many times.
If we run this code,
we now see that the value two happened
to be selected three times and that the
value four happened to be selected two
times. These numbers might represent the
values we may get if we roll a six-sided
die five times.
Let's save the random numbers into the
variable R and create a table of these
random numbers.
If you run these two lines,
we can see that we happen to get the
value one one time and the value four
one time and the value five one time and
the value six two times out of the five
rows. Let's make a bar plot of this
table.
If we run these lines, we will see the
simulated outcome of five rows.
If you include the function set when we
run the lines,
we will get the same outcome every time.
By using this seed, we get two ones, one
two, one, four, and one five. From
simulating five rolls of a six-sided
die,
we will now compute the probability of
getting three sixes in three rolls.
To do this, we will simulate 10,000
experiments where we roll the dice three
times in each experiment. In each of
these 10,000 simulations, we will count
how many times we get three sixes.
To run 10,000 simulations, we will here
create a for loop
that will be iterated 10,000 times.
In each loop, we will simulate three
rows of a six-sided die.
If we happen to get three sixes from one
experiment,
we increase our counter by one.
To compute the probability of getting
three sixes in three rows, we divide the
number of times we got three sixes by
10,000, which is the number of
simulations we do.
Finally, we print the probability
and run the code.
So, in 4.6% of the cases, we happen to
get three sixes.
This should be close to the theoretical
probability of getting three sixes in
three rows.
Let's try to run the simulations many
times. The computer probabilities should
vary around the theoretical probability.
This is the theoretical probability.
Note that if you increase the number of
simulations from 10,000 to 100,000,
we will get closer to the theoretical
probability.
Okay. Now let's compute the probability
of winning a lottery.
As an example, we'll compute the chance
of getting these five numbers.
The order of the numbers does not
matter.
We will perform 100,000 simulations.
In each simulation, we select five
random numbers between 1 and 35.
We here draw the five numbers without
replacement, which means that the
numbers can only be drawn once.
If our five random numbers are equal to
the correct five numbers,
we will increase the counter by one.
In the first simulation, none of the
100,000 iterations happened to generate
the correct five numbers.
If you try again, the counter is equal
to zero, which means that we did not win
the lottery, although we tried 100,000
times.
But in this simulation, we happened to
win the lottery two times. Which means
that we happened to get the numbers 1,
12, 20, 25, and 31 times out of the
100,000 iterations.
Let's simulate 1 million experiments.
We now won the lottery three times out
of the 1 million tries. If you try to
run again, we now won eight times.
We can calculate the theoretical
probability of winning this lottery by
using the following code.
So we expect to win this lottery about
three times if you try 1 million times.
Okay, let's now generate some random
numbers from a normal distribution with
the function r norm.
We will here generate 10 such random
numbers. Note that by default the mean
of the distribution that we sample from
is zero and has a standard deviation of
one. We will therefore draw numbers from
a standard normal distribution.
If you run this slide, we happen to get
these 10 random numbers. We expect that
half of these numbers will be negative
and half will be positive. Let's save
the generated 10 random numbers
and make a histogram of these.
Note that when we have only 10 values,
it is quite hard to see that this data
comes from a normal distribution.
This sample looks a bit better, but
let's increase the sample size to
10,000.
When we use a bigger sample size, it is
now easier to see that this sample comes
from a normal distribution.
Let's take a sample of 1 million.
Let's also use 100 bins in the histogram
to better see the distribution.
This histogram perfectly resembles the
shape of a normal distribution.
If you calculate the mean of these 1
million random values, we should expect
a value close to zero
and a standard deviation close to one
because we draw values from a standard
normal distribution.
If you like, we can change the mean to,
for example, 10.
and the standard deviation to three of
the distribution that we draw the random
values from. If you now run these lines,
the distribution should center around 10
and the mean and the standard deviation
of the random values should be close to
10 and three.
Now let's generate 1 million values
again from a standard normal
distribution which has a mean of zero
and a standard deviation of one.
Let's sum how many of these 1 million
values that are greater than 1.96.
We see that we have about 25,000 values
that are greater than 1.96.
This is what we expect because the upper
tail on the right hand side of 1.96 in
the standard normal distribution should
theoretically cover 2.5% of the
distribution. We can compute the
proportional values that are greater
than 1.96. like this.
We see that this value is close to the
expected 2.5%
of this upper tail.
Similarly, the proportion of values that
are less than -1.96
should be close to 2.5%.
Okay, let's do a simulation that
explains the central limit theorem which
states that if the sample size becomes
large, the distribution of the sample
means approaches a normal distribution
regardless of the distribution we sample
from. We will here do 100,000
simulations.
In each simulation, we will draw five
random values from a distribution.
We will save the mean values from the
100,000 samples in the following vector
that we initialize with zeros.
Then we create a for loop
and in each loop we'll simulate five
rolls of a six-sided die.
This means that we sample from a uniform
distribution.
We save the mean value from the five
rows of the die in each iteration.
Finally, we generate a histogram of the
100,000 sample means.
If you run this code, we see that the
sample means are normally distributed,
which is exactly what the central limit
theorem tells us.
Let's change the sample size to 15 and
run the code.
We will then also see that the means
from such samples are normally
distributed.
Let's generate 100,000 random values
from an exponential distribution
with a mean equal to 1 over 0.3.
If you generate a histogram of these
100,000 random values, we can see the
exponential distribution.
By using 100 bins in the histogram, we
can see the distribution even more
clearly.
The mean of this distribution is 1 /
0.3.
Let's now simulate how the distribution
of 100,000 sample means from the
exponential distribution would look
like.
We change the variable name from roles
to R and simulate.
Note that the distribution is a bit
skewed and that it deviates from the
bellshaped normal distribution.
If we reduce the sample size to five,
the distribution of the sample means
will be even more skewed.
And if the sample size is just one,
we get the exponential distribution.
Now let's try a sample size of 50.
We see that the distribution of the
sample means now approaches the normal
distribution.
If you use a sample size of 500, the
distribution of the 100,000 sample means
resembles a normal distribution.
This explains the central limit theorem.
Even though we sample from any
distribution like the exponential
distribution, the sample means will be
normally distributed if the sample size
is large enough.
Okay, we will now simulate the risk to
commit a type one error.
We will draw four random values from a
normal distribution with a standard
deviation of two and a mean of 10. We
will do this for both groups.
We will compute 100,000 t tests. We save
the p values from these 100,000 tests in
the following vector.
In each loop, we draw two samples X and
Y from a normal distribution with a mean
of 10 and a standard deviation of two.
Then we compute a t test based on the
two samples
and use a t test that assumes an equal
variance of the two groups. From the t
test function, we extract the p value.
Finally, we create the histogram of the
100,000 p values
and run the code.
In this case, we know that the null
hypothesis is true because both samples
are drawn from the same distribution.
If the null hypothesis is true, the p
values follow a uniform distribution in
the interval 0 to 1. Let's calculate the
number of p values that are less than
0.05.
We see that about 5,000 p values are
less than 0.05.
If we divide by the number of tests that
we have done, we see that 5% of the p
values are less than 0.05.
If you run this code several times, we
see that the proportion varies around
the expected proportion of 5%.
So this means that there is a 5% risk
that we incorrectly reject a true null
hypothesis when we set the significance
level to 0.05.
Let's now try to compute the risk for a
type two error which means that we will
compute the proportion of times when we
do not reject a false null hypothesis.
To simulate a false null hypothesis, the
two samples will be drawn from two
distributions with different means.
We therefore know that the null
hypothesis of equal means is false.
The two distributions will now therefore
have different mean values.
Let's run the code and see how the p
values are now distributed.
So when the null hypothesis is false,
the distribution is now skewed because
most p values will be close to zero.
Although the null hypothesis is false,
due to chance, we may draw two similar
samples that generate a p value that is
greater than 0.05, which means that we
have committed a type two error since we
did not reject the false null
hypothesis.
If you compute the proportion of P
values that are less than 0.05,
we see that about 43% are less than
0.05.
This actually corresponds to the
statistical power because there is a 43%
chance that we do not commit a type two
error.
To compute the theoretical statistical
power by using a formula, we can load
the PVR library
and use the PVR.T.est
function
where we set N to our sample size, D to
the absolute difference between the
means
divided by the standard deviation.
We set the significance level to 0.05.
and leave the parameter power blank
which will be computed.
Then we say that we like to use a
two-sided two sample t test.
Given our parameters,
the theoretical statistical power is
about 43%.
Which is what we computed with our Monte
Carlo simulations.
If we run the code again, we see that
the statistical power is close to 43%.
Which again is close to the theoretical
value.
We will now set up a code to compute the
false discovery rate. We'll here
simulate 5,000 tests where the null
hypothesis is true and 5,000 tests where
the null hypothesis is false.
In total, we will therefore compute
10,000 t tests.
For each test, we use a sample size of
four. When the null hypothesis is false,
we will draw samples from two
distributions with different means. We
here set the standard deviation equal to
one.
The significance level is here set to
5%.
We will save the p values from the 5,000
tests where the null hypothesis is true
in the following vector.
and the p values from the 5,000 test
where the null hypothesis is false in
the following vector.
Next, we create a for loop that iterates
5,000 times.
In each loop, we draw two samples from a
normal distribution with the same mean
where the null hypothesis is therefore
true
and save the p values from such a t
list.
Then we draw random values from a normal
distribution where the mean is equal to
12
and save the p values
that are applied on two samples drawn
from two distributions with different
means where the null hypothesis is
therefore false.
After the loop, we combine all the
10,000p values in a vector.
Let's run the code and generate the
histogram of the 10,000p values.
Note that the flat part of this
histogram is expected to include the p
values simulated from where the null
hypothesis is true. Whereas the other p
values are expected to include the p
values simulated from where the null
hypothesis is false.
Let's calculate how many false positives
we have, which is the number of t tests
from which the p values are less than
0.05
when the null hypothesis is true.
Then we compute the number of true
positives we have, which is the number
of tests from which the p values are
less than 0.05
when the null hypothesis is false.
We see that the number of false
positives is 248
and that the number of true positives is
3,244.
We can now compute the false discovery
rate which is the number of false
positives divided by the total number of
positive results.
We see that the false discovery rate is
about 7%.
One method to control the false
discovery rate is to adjust the P values
by using the Benyamin hawkbar method.
After we have adjusted the p values,
we extract the adjusted p values based
on the true null hypothesis
and calculate how many false positives
we now have.
We do the same for the adjusted p values
from the tests where we know that the
null hypothesis is false
which are the last 5,000 p values in
this vector.
So when we did not adjust the p values,
we got 248 false positives and 3,244
true positives.
And when we adjust the P values,
we now get 34 false positives
and 1,081 true positives.
The false discovery rate is now 3%.
Let's try to run the code a few times.
So by adjusting the P values using the
Benjamin Hawkbar method, we control the
expected force discovery rate at 5%.
This was the end of this video. Thanks
for watching.
Ask follow-up questions or revisit key timestamps.
This video explains how to generate random numbers from uniform and normal distributions, and how to use these to simulate various statistical scenarios. It covers setting seeds for reproducibility, creating histograms to visualize distributions, and understanding sampling without replacement. The video also demonstrates how to simulate experiments like rolling dice and winning lotteries, and touches upon the Central Limit Theorem, explaining how sample means tend to follow a normal distribution. Finally, it delves into concepts of hypothesis testing, specifically type I and type II errors, and introduces the False Discovery Rate and methods to control it using p-value adjustments like the Benjamini-Hochberg method.
Videos recently processed by our community