Imputing missing data is the act of replacing missing data by nonmissing values. Mean imputation replaces missing data in a numerical variable by the mean value of the nonmissing values. This article shows how to perform mean imputation in SAS. It also presents three statistical drawbacks of mean imputation.
How to perform mean imputation in SAS
The easiest way to perform mean imputation in SAS is to use PROC STDIZE. PROC STDIZE supports the REPONLY and the METHOD=MEAN options, which tells it to replace missing values with the mean for the variables on the VAR statement. To demonstrate mean imputation, the following statements randomly add missing values to the Sashelp.Class data set. The call to PROC STDIZE then replaces the missing values and creates a data set called IMPUTED that contains the results:
/* Create "original data" by randomly inserting missing values for some heights */ data Have; set sashelp.class; call streaminit(12345); Replaced = rand("Bernoulli", 0.4); /* indicator variable is 1 about 40% of time */ if Replaced then Height = .; run; /* Mean imputation: Use PROC STDIZE to replace missing values with mean */ proc stdize data=Have out=Imputed oprefix=Orig_ /* prefix for original variables */ reponly /* only replace; do not standardize */ method=MEAN; /* or MEDIAN, MINIMUM, MIDRANGE, etc. */ var Height; /* you can list multiple variables to impute */ run; proc print data=Imputed; format Orig_Height Height BESTD8.1; var Name Orig_Height Height Weight Replaced; run; |
The output shows that the missing data (such as observations 6 and 8) are replaced by 61.5, which is the mean value of the observed heights. For a subsequent visualization, I have included a binary variable (Replaced) that indicates whether an observation was originally missing. The METHOD= option in PROC STDIZE supports several statistics. You can use METHOD=MEDIAN to replace missing values by the median, METHOD=MINIMUM to replace by the minimum value, and so forth.
Problems with mean imputation
Most software packages deal with missing data by using listwise deletion: observations that have missing data are dropped from the analysis. Throwing away hard-collected data is painful and can result in a substantial loss of power for statistical tests. Mean imputation, which is easy to implement, enables analysts to use every observation. However, mean imputation has three serious disadvantages that can lead to problems in your statistical analysis. Mean imputation is a univariate method that ignores the relationships between variables and makes no effort to represent the inherent variability in the data. In particular, when you replace missing data by a mean, you commit three statistical sins:
- Mean imputation reduces the variance of the imputed variables.
- Mean imputation shrinks standard errors, which invalidates most hypothesis tests and the calculation of confidence interval.
- Mean imputation does not preserve relationships between variables such as correlations.
These problems are discussed further in a subsequent article. Most experts agree that the drawbacks far outweigh the advantages, especially since most software supports modern alternatives to single imputation, such as multiple imputation. My advice: don't use mean imputation if you can use a more sophisticated alternative.
Epilogue
When I was in college, an actor friend smoked cigarettes. He knew that he should stop, but his addiction was too strong. When he lit up he would recite the following verse and dramatically punctuate the final phrase by blowing a smoke ring:
If you don't smoke, don't start.
If you do smoke, stop.
If you do smoke and won't stop, smoke with style. (*blows smoke ring*)
I don't recommend mean imputation. It is bad for the health of your data. But I can't dissuade from using mean imputation, remember the following verse:
If you don't use mean imputation, don't start.
If you do use mean imputation, stop.
If you do use mean imputation and won't stop, use PROC STDIZE.
9 Comments
Love the epilogue, very powerful analogy and advice. Quote of the day: "Mean imputation is bad for the health of your data". Instead, one should use "nice" imputation.
:-) Funny joke.
Another great post, Rick. Thank you so much. I can't wait to read the follow-up.
Nice text! Note, that when you calculate the variation in the mean, you should ONLY use the original values.
Imputation has been studied a lot in Mathemtical Statistics. My knowledge tells me that I do not know the subejct.
Another question is: Why are some values missing ? Systematic effect ?
/ Br Anders
p.s. Jens Malmros has studied this. His thesis about this subject did win him a scientific price.
He now works at SCB Statistics Sweden.
Thanks for your comments. I wanted to keep this article short, but the follow-up article contains a few references to the extensive literature on this topic.
Do you have a reference for his work in English. It sounds interesting but my Swedish is sadly deficient.
My thought was similar to Anders - are the data missing at random? For example, I once analyzed a large data set examining, among other things, high school dropout. A small percentage of the students did not know their mother's educational level and that had been set to missing. In further analysis, those students did not live with their mothers, which is very unusual, and, on top of that, apparently had little contact - even if your dad has custody you usually know if your mom graduated from high school or not. My point is that you can miss some interesting findings when you just gloss over missing data. Looking forward to your next post.
Thanks for your thoughts and anecdote. I agree that an analyst should look into causes of missingness before blindly proceeding with the analysis.
Since you mentioned the missing at random (MAR) assumption, I want to add a few thoughts:
1. The assumption is often used to assess the bias of estimators. For the regression example, I believe that if the X are MAR, then the expected value of the intercept for the imputed variable is same as the intercept for the missing data. Unfortunately, even if a theorem shows that an estimate is unbiased "on average," for a particular set of data (such as the regression example) the missing X values might correspond to Y values that are larger (or smaller) than expected.
2. For small data sets, it can be hard to verify whether values are MAR.
3. Although MAR assumption can help you assess bias in point estimates, it doesn't change the most damning aspect of mean imputation, which is the shrunken variance estimates.
Thanks for mentioning MAR. There is much more that can (and has!) been said on this topic.
It would be nice if the example showed how to mean imputation (I've used it before, and it may be too late to stop) according to by variables rather than the mean of the whole data set. Thanks.