HomeVideos

Missing Data Imputation: Mean, Median & KNN Explained

Now Playing

Missing Data Imputation: Mean, Median & KNN Explained

Transcript

254 segments

0:00

In this video, we will have a look at

0:02

how to impute missing data with various

0:05

methods and especially focus on the CANN

0:08

method. Missing data refers to the

0:11

absence of values for variables in a

0:14

data set where values were expected but

0:17

not recorded. Imputation means that we

0:20

fill in missing data with plausible

0:23

values. We usually only impute missing

0:26

data if the method we like to use such

0:29

as PCA cannot handle missing data. The

0:32

imputation methods shown in this video

0:35

should primarily only be used for data

0:38

that are missing at random. Also,

0:41

imputing a variable with more than 30 to

0:44

50% missing observations is generally

0:47

not recommended. Such a variable should

0:50

be removed before the analysis.

0:53

Let's assume that we were supposed to

0:55

measure the weight of six individuals

0:58

who were included in our study. Due to

1:00

various reasons, we were not able to

1:03

measure the weight of person number six,

1:06

which means that we here have a missing

1:08

data point. Since we only have one

1:11

variable in our data set, it is hard to

1:14

estimate a plausible weight for person

1:17

number six. However, if the variable is

1:20

normally distributed, a plausible weight

1:23

for this person could be the average

1:26

weight of the individuals in our data

1:28

set because such a value is the most

1:30

likely value for a random person. The

1:34

average weight of the five individuals

1:36

is 74 kilos, which we can use to fill in

1:40

for the missing data. This is called

1:42

mean imputation.

1:45

If the data is skewed, a better estimate

1:48

of the missing value would instead be to

1:51

calculate the median.

1:53

If you have just a few missing values

1:56

and only one variable, it makes sense to

1:59

use the mean or median imputation.

2:02

But if you have several missing data

2:04

points, mean or median imputation will

2:07

result in many identical values which

2:11

will reduce the variance of the data.

2:14

This will cause a bias in statistical

2:16

estimates that involves the calculations

2:19

of the sample variance.

2:22

Suppose that we now have two variables

2:24

in our data set in addition to the

2:27

weight. We now also have the body

2:30

heights of the individuals.

2:32

If we were to use mean imputation in

2:35

this case, we would get the data point

2:38

that looks a bit strange

2:41

because this data point is quite far

2:43

away from the other data points and it

2:46

is not likely that this person who is

2:49

quite tall has a weight of only 74

2:52

kilos. A more plausible value given the

2:56

data is that the person has a weight of

2:58

around 90 kilos.

3:01

When there is a dependency between the

3:03

variables in our data set where weight

3:06

and height seem to be positively

3:08

correlated, we can make use of the

3:11

height of person number six to better

3:14

estimate its weight. One way is to fit a

3:18

regression line to the data and use this

3:22

regression line to estimate a reasonable

3:24

weight for person number six, which in

3:27

this case would be about 95 kilos.

3:31

By using regression imputation, we would

3:34

fill in the following weight here, which

3:37

seems more reasonable given the height

3:39

of the person. If we would have several

3:42

missing data points, we would get the

3:45

same problem as we saw with the mean and

3:47

median imputation because the spread of

3:50

the data points around the line would be

3:53

reduced when the impute by placing the

3:55

points directly on the line. One

3:58

therefore usually adds some randomness

4:00

so that the data points are not placed

4:03

exactly on the line to preserve the

4:05

natural variability in the data. The

4:08

problem with imputation which relies on

4:11

linear regression is that it will fail

4:14

to predict good values of the missing

4:16

data if the data is nonlinear.

4:19

This leads us to the method of CAN

4:22

imputation.

4:24

CANN imputation is a simple

4:26

nonparametric method that can capture

4:29

nonlinear relationships and roughly

4:32

preserve the natural variance in the

4:34

data. To illustrate how KN&N imputation

4:38

works, we will use the following data

4:41

set of diastolic and systolic blood

4:43

pressure as well as the body mass index

4:46

of six individuals. We will use the

4:48

method to impute a plausible value for

4:51

the missing body mass index of person

4:54

number six. The method starts by looking

4:57

at the values of the other variables for

5:00

the person with missing BMI.

5:03

Then it finds the k closest neighbors to

5:06

this point. One usually sets this value

5:09

to five, but since we have a small data

5:12

set, I will here set it to three. The

5:15

method then identifies the BMI of the

5:18

three closest neighbors and computes the

5:21

mean of these values which is used to

5:24

fill in a reasonable value. One can also

5:28

use the median value of the three

5:30

closest neighbors.

5:32

So how do we find the three closest

5:35

neighbors?

5:37

Well, to calculate the distances in

5:40

space between data points, one commonly

5:43

uses the Accidian distance which can be

5:45

calculated by the following equation in

5:48

two dimensions. If you would calculate

5:51

the distance in three dimensions, we

5:54

would simply add this term to the

5:56

equation and the same equation can be

5:58

used for as many variables as we want to

6:01

include in the cann imputation. One

6:04

problem with the ecclesian distance is

6:06

that variables with big values will

6:09

contribute more to the distance compared

6:12

to variables with small values. One

6:15

therefore first standardizes or scales

6:18

the data so that all variables

6:21

contribute equally to the distance.

6:24

To for example standardize this

6:26

variable,

6:28

we compute its mean and its sample

6:31

standard deviation and plug in these

6:34

values here. Then we plug in the first

6:38

value here and do the math. This is the

6:42

standardized diastolic blood pressure of

6:45

the first person and this is the

6:47

standardized value for person number two

6:50

and so forth. We can now calculate the

6:53

distance between data point number six

6:56

and all the other data points based on

6:58

the standardized data.

7:01

For example, to calculate the EDIN

7:03

distance between data points four and

7:06

six, we plug in their X and Y

7:10

coordinates and do the math. If you

7:13

calculate the client distances between

7:15

data point six and all the other data

7:18

points, we will get the following

7:21

values. We can see that the three

7:23

closest data points to data point number

7:26

six are data points 2, four and five. We

7:30

have therefore identified the k closest

7:33

points. We now calculate the mean of the

7:36

BMI of these three data points which we

7:40

can use to fill in for the missing

7:42

value. One can also calculate the median

7:45

value. So what would happen if we had a

7:49

missing value also here?

7:52

Well, this point does not exist because

7:55

person number two has a missing

7:57

diastolic blood pressure. The three

8:00

closest points are now points 3, four

8:03

and five. And the average BMI of these

8:07

individuals is 28.33

8:10

which we fill in for the missing data.

8:13

One can also use the median

8:16

to impute this missing value. We check

8:20

the distance between data point 2 and

8:23

all the other rows with complete values

8:27

and compute the average value of the

8:29

diastolic blood pressure of the three

8:32

closest neighbors that we replace the

8:35

missing data with. In this example, we

8:38

only computed the distances based on

8:41

complete cases. However, to compute this

8:45

missing value, we can also include to

8:48

study the distance between these points

8:51

when evaluating the closest neighbors.

8:54

Although this second person has a

8:56

missing value for their diastolic blood

8:58

pressure,

8:59

including also rows that do not have

9:02

complete values is important when there

9:05

is a lot of missing data in the data set

9:08

because if we do not include them only

9:11

these two rows will be used for imputing

9:15

all the missing values.

9:17

I will here show how the R package vim

9:21

computes the distances in cannon

9:24

imputation. In this package, they use a

9:27

weighted mean distance where the

9:30

following formula is used to compute the

9:32

so-called go distance which divides the

9:36

absolute difference between two values

9:39

of one variable by the range of that

9:42

variable. This distance also works on

9:45

categorical variables such as a binary

9:49

variable. Have a look at the paper to

9:51

see how they deal with other types of

9:54

categorical variables. If two

9:56

individuals belong to the same category,

9:59

delta is equal to zero and is equal to

10:03

one if the two individuals belong to

10:06

different categories.

10:08

Let's try to calculate the go distance

10:10

between data points five and six based

10:13

on the two variables diastolic blood

10:16

pressure and systolic blood pressure. We

10:18

plug in the diastolic blood pressures

10:21

here. The range is simply the maximum

10:24

value minus the minimum value of the

10:27

variable.

10:29

Then we compute the same for the second

10:31

variable.

10:34

We now plug in the distances in this

10:37

equation.

10:38

We here put equal weights on the two

10:41

variables.

10:43

Which means that we just compute the

10:45

average of the two distances.

10:48

If you compute the distances between

10:51

data point 6 and all the other data

10:53

points, we will get these values.

10:56

These are the three closest points.

11:00

We therefore calculate the mean or the

11:03

median of the BMI for persons 2, four

11:06

and five for which we replace the

11:09

missing value with. Previously we used

11:11

an equal weight to compute the mean of

11:14

the two distances.

11:16

But if we think that the systolic blood

11:19

pressure is more important than the

11:21

diastolic blood pressure in predicting

11:23

the BMI, we can put more weight on this

11:27

variable. One way to find appropriate

11:30

weights is to use some kind of machine

11:32

learning method that can compute

11:34

variable importance. For example, if you

11:38

would use a random forest model to

11:40

predict the BMI based on these

11:43

predictors, the model can compute the

11:46

importance of these variables which can

11:49

be used as weights in this equation.

11:52

This was the end of this video about

11:54

KN&N imputation. Thanks for watching.

Interactive Summary

The video explains various methods for imputing missing data, focusing on situations where data is missing at random and the proportion of missing data isn't too high. It starts with simple methods like mean and median imputation for single variables, then moves to regression imputation for correlated variables, highlighting their limitations (reduced variance, inability to handle nonlinearity). The core of the video introduces CANN (k-Nearest Neighbors) imputation, a nonparametric method suitable for nonlinear relationships. It details how CANN works by finding k closest neighbors using distances (Euclidean for numerical, Gower for mixed data types including categorical) and then calculating the mean or median of their values. The video also covers data standardization, handling incomplete cases, and using weighted distances to improve imputation accuracy.

Suggested questions

6 ready-made prompts