Winsorization: The good, the bad, and the ugly

13

On discussion forums, I often see questions that ask how to Winsorize variables in SAS. For example, here are some typical questions from the SAS Support Community:

  • I want an efficient way of replacing (upper) extreme values with (95th) percentile. I have a data set with around 600 variables and want to get rid of extreme values of all 600 variables with 95th percentile.
  • I have several (hundreds of) variables that I need to “Winsorize” at the 95% and 5%. I want all the observations with values greater 95th percentile to take the value of the 95th percentile, and all observations with values less than the 5th percentile to take the value of the 5th percentile.

It is clear from the questions that the programmer wants to modify the extreme values of dozens or hundreds of variables. As we will soon learn, neither of these requests satisfy the standard definition of Winsorization. What is Winsorization of data? What are the pitfalls and what are alternative methods?

Winsorization: Definition, pitfalls, and alternatives #StatWisdom Click To Tweet

What is Winsorization?

The process of replacing a specified number of extreme values with a smaller data value has become known as Winsorization or as Winsorizing the data. Let's start by defining Winsorization.

Winsorization began as a way to "robustify" the sample mean, which is sensitive to extreme values. To obtain the Winsorized mean, you sort the data and replace the smallest k values by the (k+1)st smallest value. You do the same for the largest values, replacing the k largest values with the (k+1)st largest value. The mean of this new set of numbers is called the Winsorized mean. If the data are from a symmetric population, the Winsorized mean is a robust unbiased estimate of the population mean.

The graph to right provides a visual comparison. The top graph shows the distribution of the original data set. The bottom graph shows the distribution of Winsorized data for which the five smallest and five largest values have been modified. The extreme values were not deleted but were replaced by the sixth smallest or largest data value.

I consulted the Encyclopedia of Statistical Sciences (Kotz et al. (Eds), 2nd Ed, 2006) which has an article "Trimming and Winsorization " by David Ruppert (Vol 14, p. 8765). According to the article:

  • Winsorizaion is symmetric: Some people want to modify only the large data values. However, Winsorization is a symmetric process that replaces the k smallest and the k largest data values.
  • Winsorization is based on counts: Some people want to modify values based on quantiles, such as the 5th and 95th percentiles. However, using quantiles might not lead to a symmetric process. Let k1 be the number of values less than the 5th percentile and let k2 be the number of values greater than the 95th percentile. If the data contain repeated values, then k1 might not equal to k2, which means that you are potentially changing more values in one tail than in the other.

As shown by the quotes at the top of this article, posts on discussion forums sometimes muddle the definition of Winsorization. If you modify the data in an unsymmetric fashion, you will produce biased statistics.

Winsorization: The good

Why do some people want to Winsorize their data? There are a few reasons:

  • Classical statistics such as the mean and standard deviation are sensitive to extreme values. The purpose of Winsorization is to "robustify" classical statistics by reducing the impact of extreme observations.
  • Winsorization is sometimes used in the automated processing of hundreds or thousands of variables when it is impossible for a human to inspect each and every variable.
  • If you compare a Winsorized statistic with its classical counterpart, you can identify variables that might contain contaminated data or are long-tailed and require special handling in models.

Winsorization: The bad

There is no built-in procedure in SAS that Winsorizes variables, but there are some user-defined SAS macros on the internet that claim to Winsorize variables. BE CAREFUL! Some of these macros do not correctly handle missing values. Others use percentiles to determine the extreme values that are modified. If you must Winsorize, I have written a SAS/IML function that Winsorizes data and correctly handles missing values.

As an alternative to Winsorizing your data, SAS software provides many modern robust statistical methods that have advantages over a simple technique like Winsorization:

Winsorization: The ugly

If the data contains extreme values, then classical statistics are influenced by those values. However, modifying the data is a draconian measure. Recently I read an article by John Tukey, one of the early investigator of robust estimation. In the article "A survey of sampling from contaminated distributions" (1960), Tukey says (p. 457) that when statisticians encounter a few extreme values in data,

we are likely to think of them as 'strays' [or]'wild shots' ... and to focus our attention on how normally distributed the rest of the distribution appears to be. One who does this commits two oversights, forgetting Winsor's principle that 'all distributions are normal in the middle,' and forgetting that the distribution relevant to statistical practice is that of the values actually provided and not of the values which ought to have been provided.

A little later in the essay (p. 458), he says

Sets of observations which have been de-tailed by over-vigorous use of a rule for rejecting outliers are inappropriate, since they are not samples.

I love this second quote. All of the nice statistical formulas that are used to make inferences (such as standard errors and confidence intervals) are based on the assumption that the data are a random sample that contains all of the observed values, even extreme values. The tails of a distribution are extremely important, and indiscriminately modifying large and small values invalidates many of the statistical analyses that we take for granted.

Summary

Should you Winsorize data? Tukey argues that indiscriminately modifying data is "inappropriate." In SAS, you can get the Winsorized mean directly from PROC UNIVARIATE. SAS also provides alternative robust methods such the ones in the ROBUSTREG and QUANTREG procedures.

If you decide to use Winsorization to modify your data, remember that the standard definition calls for the symmetric replacement of the k smallest (largest) values of a variable with the (k+1)st smallest (largest). If you download a program from the internet, be aware that some programs use quantiles and others do not handle missing values correctly.

What are your thoughts about Winsorizing data? Share them in the comments.

Share

About Author

Rick Wicklin

Distinguished Researcher in Computational Statistics

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of PROC IML and SAS/IML Studio. His areas of expertise include computational statistics, simulation, statistical graphics, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.

13 Comments

  1. Peter Clemmensen on

    Very nice article and different points of views.

    A quick question. You write: "You do the same for the largest values, replacing the k largest values with the (k+1)st largest value."

    Wouldn't this be the (k-1)st largest value that you should replace the k largest values with?

    Regards Peter

    • Rick Wicklin

      It depends from which direction you are counting. I was counting "from the outside in." It is the greatest data value that is less than or equal to the k_th largest value.

    • Rick Wicklin

      Thanks for asking this question, Chris. For the trimmed mean, you EXCLUDE the k largest and k smallest values and compute the mean of the remaining N - 2k values. For the trimmed mean, extreme values have NO EFFECT on the estimate of the mean.

      For the Winsorized mean, you REPLACE the extreme values by another (not as extreme) data value. You then compute the mean of the modified N values. Thus extreme values still have SOME effect on the estimate, but not as large as they did before being modified.

      I think some practitioners prefer to Winsorize data (rather than trimming) because it keeps the number of observations constant. If you replace extreme values by missing values, you get trimmed data, but the missing values wreak havoc on multivariate analyses.

      • I think practitioners prefer Winzorized over trimmed because it keeps the weight of extreme observations in the tails of the distribution and thus has a lesser effect on estimates of scale.

  2. Great post. Love the quotes. In my first stats class I was lucky to have an instructor who repeatedly emphasized the importance of investigating outliers rather than simply discarding them (or blindly including them). He had a couple key examples. One was Bob Beamon's famous long jump at the 1968 Olympics where he set a world record by nearly two feet (http://blog.minitab.com/blog/fun-with-statistics/visualizing-the-greatest-olympic-outlier-of-all-time). The other was a college that reported one year that the average salary for the graduating class was $200,000 (they happened to have an outlier graduate who was signed to a multi-million dollar pro basketball contract).

  3. i am winsorizing some data but having the problem of handling missing values. My codes are replacing most of the missing values. can you suggest some efficient micro or any simple codes which donot replace any missing value rather consider only available values but keep missing values as it is?
    Any help will be appreciated.
    Good day

Leave A Reply

Back to Top