Missing Data Imputation: Mean, Median & KNN Explained
254 segments
In this video, we will have a look at
how to impute missing data with various
methods and especially focus on the CANN
method. Missing data refers to the
absence of values for variables in a
data set where values were expected but
not recorded. Imputation means that we
fill in missing data with plausible
values. We usually only impute missing
data if the method we like to use such
as PCA cannot handle missing data. The
imputation methods shown in this video
should primarily only be used for data
that are missing at random. Also,
imputing a variable with more than 30 to
50% missing observations is generally
not recommended. Such a variable should
be removed before the analysis.
Let's assume that we were supposed to
measure the weight of six individuals
who were included in our study. Due to
various reasons, we were not able to
measure the weight of person number six,
which means that we here have a missing
data point. Since we only have one
variable in our data set, it is hard to
estimate a plausible weight for person
number six. However, if the variable is
normally distributed, a plausible weight
for this person could be the average
weight of the individuals in our data
set because such a value is the most
likely value for a random person. The
average weight of the five individuals
is 74 kilos, which we can use to fill in
for the missing data. This is called
mean imputation.
If the data is skewed, a better estimate
of the missing value would instead be to
calculate the median.
If you have just a few missing values
and only one variable, it makes sense to
use the mean or median imputation.
But if you have several missing data
points, mean or median imputation will
result in many identical values which
will reduce the variance of the data.
This will cause a bias in statistical
estimates that involves the calculations
of the sample variance.
Suppose that we now have two variables
in our data set in addition to the
weight. We now also have the body
heights of the individuals.
If we were to use mean imputation in
this case, we would get the data point
that looks a bit strange
because this data point is quite far
away from the other data points and it
is not likely that this person who is
quite tall has a weight of only 74
kilos. A more plausible value given the
data is that the person has a weight of
around 90 kilos.
When there is a dependency between the
variables in our data set where weight
and height seem to be positively
correlated, we can make use of the
height of person number six to better
estimate its weight. One way is to fit a
regression line to the data and use this
regression line to estimate a reasonable
weight for person number six, which in
this case would be about 95 kilos.
By using regression imputation, we would
fill in the following weight here, which
seems more reasonable given the height
of the person. If we would have several
missing data points, we would get the
same problem as we saw with the mean and
median imputation because the spread of
the data points around the line would be
reduced when the impute by placing the
points directly on the line. One
therefore usually adds some randomness
so that the data points are not placed
exactly on the line to preserve the
natural variability in the data. The
problem with imputation which relies on
linear regression is that it will fail
to predict good values of the missing
data if the data is nonlinear.
This leads us to the method of CAN
imputation.
CANN imputation is a simple
nonparametric method that can capture
nonlinear relationships and roughly
preserve the natural variance in the
data. To illustrate how KN&N imputation
works, we will use the following data
set of diastolic and systolic blood
pressure as well as the body mass index
of six individuals. We will use the
method to impute a plausible value for
the missing body mass index of person
number six. The method starts by looking
at the values of the other variables for
the person with missing BMI.
Then it finds the k closest neighbors to
this point. One usually sets this value
to five, but since we have a small data
set, I will here set it to three. The
method then identifies the BMI of the
three closest neighbors and computes the
mean of these values which is used to
fill in a reasonable value. One can also
use the median value of the three
closest neighbors.
So how do we find the three closest
neighbors?
Well, to calculate the distances in
space between data points, one commonly
uses the Accidian distance which can be
calculated by the following equation in
two dimensions. If you would calculate
the distance in three dimensions, we
would simply add this term to the
equation and the same equation can be
used for as many variables as we want to
include in the cann imputation. One
problem with the ecclesian distance is
that variables with big values will
contribute more to the distance compared
to variables with small values. One
therefore first standardizes or scales
the data so that all variables
contribute equally to the distance.
To for example standardize this
variable,
we compute its mean and its sample
standard deviation and plug in these
values here. Then we plug in the first
value here and do the math. This is the
standardized diastolic blood pressure of
the first person and this is the
standardized value for person number two
and so forth. We can now calculate the
distance between data point number six
and all the other data points based on
the standardized data.
For example, to calculate the EDIN
distance between data points four and
six, we plug in their X and Y
coordinates and do the math. If you
calculate the client distances between
data point six and all the other data
points, we will get the following
values. We can see that the three
closest data points to data point number
six are data points 2, four and five. We
have therefore identified the k closest
points. We now calculate the mean of the
BMI of these three data points which we
can use to fill in for the missing
value. One can also calculate the median
value. So what would happen if we had a
missing value also here?
Well, this point does not exist because
person number two has a missing
diastolic blood pressure. The three
closest points are now points 3, four
and five. And the average BMI of these
individuals is 28.33
which we fill in for the missing data.
One can also use the median
to impute this missing value. We check
the distance between data point 2 and
all the other rows with complete values
and compute the average value of the
diastolic blood pressure of the three
closest neighbors that we replace the
missing data with. In this example, we
only computed the distances based on
complete cases. However, to compute this
missing value, we can also include to
study the distance between these points
when evaluating the closest neighbors.
Although this second person has a
missing value for their diastolic blood
pressure,
including also rows that do not have
complete values is important when there
is a lot of missing data in the data set
because if we do not include them only
these two rows will be used for imputing
all the missing values.
I will here show how the R package vim
computes the distances in cannon
imputation. In this package, they use a
weighted mean distance where the
following formula is used to compute the
so-called go distance which divides the
absolute difference between two values
of one variable by the range of that
variable. This distance also works on
categorical variables such as a binary
variable. Have a look at the paper to
see how they deal with other types of
categorical variables. If two
individuals belong to the same category,
delta is equal to zero and is equal to
one if the two individuals belong to
different categories.
Let's try to calculate the go distance
between data points five and six based
on the two variables diastolic blood
pressure and systolic blood pressure. We
plug in the diastolic blood pressures
here. The range is simply the maximum
value minus the minimum value of the
variable.
Then we compute the same for the second
variable.
We now plug in the distances in this
equation.
We here put equal weights on the two
variables.
Which means that we just compute the
average of the two distances.
If you compute the distances between
data point 6 and all the other data
points, we will get these values.
These are the three closest points.
We therefore calculate the mean or the
median of the BMI for persons 2, four
and five for which we replace the
missing value with. Previously we used
an equal weight to compute the mean of
the two distances.
But if we think that the systolic blood
pressure is more important than the
diastolic blood pressure in predicting
the BMI, we can put more weight on this
variable. One way to find appropriate
weights is to use some kind of machine
learning method that can compute
variable importance. For example, if you
would use a random forest model to
predict the BMI based on these
predictors, the model can compute the
importance of these variables which can
be used as weights in this equation.
This was the end of this video about
KN&N imputation. Thanks for watching.
Ask follow-up questions or revisit key timestamps.
The video explains various methods for imputing missing data, focusing on situations where data is missing at random and the proportion of missing data isn't too high. It starts with simple methods like mean and median imputation for single variables, then moves to regression imputation for correlated variables, highlighting their limitations (reduced variance, inability to handle nonlinearity). The core of the video introduces CANN (k-Nearest Neighbors) imputation, a nonparametric method suitable for nonlinear relationships. It details how CANN works by finding k closest neighbors using distances (Euclidean for numerical, Gower for mixed data types including categorical) and then calculating the mean or median of their values. The video also covers data standardization, handling incomplete cases, and using weighted distances to improve imputation accuracy.
Videos recently processed by our community