Mean imputation in SAS

9

Imputing missing data is the act of replacing missing data by nonmissing values. Mean imputation replaces missing data in a numerical variable by the mean value of the nonmissing values. This article shows how to perform mean imputation in SAS. It also presents three statistical drawbacks of mean imputation.

How to perform mean imputation in SAS

The easiest way to perform mean imputation in SAS is to use PROC STDIZE. PROC STDIZE supports the REPONLY and the METHOD=MEAN options, which tells it to replace missing values with the mean for the variables on the VAR statement. To demonstrate mean imputation, the following statements randomly add missing values to the Sashelp.Class data set. The call to PROC STDIZE then replaces the missing values and creates a data set called IMPUTED that contains the results:

/* Create "original data" by randomly inserting missing values for some heights */
data Have;
set sashelp.class;
call streaminit(12345);
Replaced = rand("Bernoulli", 0.4);  /* indicator variable is 1 about 40% of time */
if Replaced then Height = .;        
run;
 
/* Mean imputation: Use PROC STDIZE to replace missing values with mean */
proc stdize data=Have out=Imputed 
      oprefix=Orig_         /* prefix for original variables */
      reponly               /* only replace; do not standardize */
      method=MEAN;          /* or MEDIAN, MINIMUM, MIDRANGE, etc. */
   var Height;              /* you can list multiple variables to impute */
run;
 
proc print data=Imputed;
   format Orig_Height Height BESTD8.1;
   var Name Orig_Height Height Weight Replaced;
run;
Mean imputation in SAS

The output shows that the missing data (such as observations 6 and 8) are replaced by 61.5, which is the mean value of the observed heights. For a subsequent visualization, I have included a binary variable (Replaced) that indicates whether an observation was originally missing. The METHOD= option in PROC STDIZE supports several statistics. You can use METHOD=MEDIAN to replace missing values by the median, METHOD=MINIMUM to replace by the minimum value, and so forth.

Problems with mean imputation

Most software packages deal with missing data by using listwise deletion: observations that have missing data are dropped from the analysis. Throwing away hard-collected data is painful and can result in a substantial loss of power for statistical tests. Mean imputation, which is easy to implement, enables analysts to use every observation. However, mean imputation has three serious disadvantages that can lead to problems in your statistical analysis. Mean imputation is a univariate method that ignores the relationships between variables and makes no effort to represent the inherent variability in the data. In particular, when you replace missing data by a mean, you commit three statistical sins:

  • Mean imputation reduces the variance of the imputed variables.
  • Mean imputation shrinks standard errors, which invalidates most hypothesis tests and the calculation of confidence interval.
  • Mean imputation does not preserve relationships between variables such as correlations.

These problems are discussed further in a subsequent article. Most experts agree that the drawbacks far outweigh the advantages, especially since most software supports modern alternatives to single imputation, such as multiple imputation. My advice: don't use mean imputation if you can use a more sophisticated alternative.

Epilogue

When I was in college, an actor friend smoked cigarettes. He knew that he should stop, but his addiction was too strong. When he lit up he would recite the following verse and dramatically punctuate the final phrase by blowing a smoke ring:

     If you don't smoke, don't start.
     If you do smoke, stop.
     If you do smoke and won't stop, smoke with style. (*blows smoke ring*)

I don't recommend mean imputation. It is bad for the health of your data. But I can't dissuade from using mean imputation, remember the following verse:

     If you don't use mean imputation, don't start.
     If you do use mean imputation, stop.
     If you do use mean imputation and won't stop, use PROC STDIZE.

Share

About Author

Rick Wicklin

Distinguished Researcher in Computational Statistics

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of SAS/IML software. His areas of expertise include computational statistics, simulation, statistical graphics, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.

9 Comments

  1. Anders Sköllermo on

    Nice text! Note, that when you calculate the variation in the mean, you should ONLY use the original values.
    Imputation has been studied a lot in Mathemtical Statistics. My knowledge tells me that I do not know the subejct.

    Another question is: Why are some values missing ? Systematic effect ?
    / Br Anders

    p.s. Jens Malmros has studied this. His thesis about this subject did win him a scientific price.
    He now works at SCB Statistics Sweden.

  2. My thought was similar to Anders - are the data missing at random? For example, I once analyzed a large data set examining, among other things, high school dropout. A small percentage of the students did not know their mother's educational level and that had been set to missing. In further analysis, those students did not live with their mothers, which is very unusual, and, on top of that, apparently had little contact - even if your dad has custody you usually know if your mom graduated from high school or not. My point is that you can miss some interesting findings when you just gloss over missing data. Looking forward to your next post.

    • Rick Wicklin

      Thanks for your thoughts and anecdote. I agree that an analyst should look into causes of missingness before blindly proceeding with the analysis.
      Since you mentioned the missing at random (MAR) assumption, I want to add a few thoughts:

      1. The assumption is often used to assess the bias of estimators. For the regression example, I believe that if the X are MAR, then the expected value of the intercept for the imputed variable is same as the intercept for the missing data. Unfortunately, even if a theorem shows that an estimate is unbiased "on average," for a particular set of data (such as the regression example) the missing X values might correspond to Y values that are larger (or smaller) than expected.
      2. For small data sets, it can be hard to verify whether values are MAR.
      3. Although MAR assumption can help you assess bias in point estimates, it doesn't change the most damning aspect of mean imputation, which is the shrunken variance estimates.

      Thanks for mentioning MAR. There is much more that can (and has!) been said on this topic.

  3. It would be nice if the example showed how to mean imputation (I've used it before, and it may be too late to stop) according to by variables rather than the mean of the whole data set. Thanks.

Leave A Reply

Back to Top