Winsorization: The good, the bad, and the ugly

38

On discussion forums, I often see questions that ask how to Winsorize variables in SAS. For example, here are some typical questions from the SAS Support Community:

  • I want an efficient way of replacing (upper) extreme values with (95th) percentile. I have a data set with around 600 variables and want to get rid of extreme values of all 600 variables with 95th percentile.
  • I have several (hundreds of) variables that I need to “Winsorize” at the 95% and 5%. I want all the observations with values greater 95th percentile to take the value of the 95th percentile, and all observations with values less than the 5th percentile to take the value of the 5th percentile.

It is clear from the questions that the programmer wants to modify the extreme values of dozens or hundreds of variables. As we will soon learn, neither of these requests satisfy the standard definition of Winsorization. What is Winsorization of data? What are the pitfalls and what are alternative methods?

Winsorization: Definition, pitfalls, and alternatives #StatWisdom Click To Tweet

What is Winsorization?

The process of replacing a specified number of extreme values with a smaller data value has become known as Winsorization or as Winsorizing the data. Let's start by defining Winsorization.

Winsorization began as a way to "robustify" the sample mean, which is sensitive to extreme values. To obtain the Winsorized mean, you sort the data and replace the smallest k values by the (k+1)st smallest value. You do the same for the largest values, replacing the k largest values with the (k+1)st largest value. The mean of this new set of numbers is called the Winsorized mean. If the data are from a symmetric population, the Winsorized mean is a robust unbiased estimate of the population mean.

The graph to right provides a visual comparison. The top graph shows the distribution of the original data set. The bottom graph shows the distribution of Winsorized data for which the five smallest and five largest values have been modified. The extreme values were not deleted but were replaced by the sixth smallest or largest data value.

I consulted the Encyclopedia of Statistical Sciences (Kotz et al. (Eds), 2nd Ed, 2006) which has an article "Trimming and Winsorization " by David Ruppert (Vol 14, p. 8765). According to the article:

  • Winsorizaion is symmetric: Some people want to modify only the large data values. However, Winsorization is a symmetric process that replaces the k smallest and the k largest data values.
  • Winsorization is based on counts: Some people want to modify values based on quantiles, such as the 5th and 95th percentiles. However, using quantiles might not lead to a symmetric process. Let k1 be the number of values less than the 5th percentile and let k2 be the number of values greater than the 95th percentile. If the data contain repeated values, then k1 might not equal to k2, which means that you are potentially changing more values in one tail than in the other.

As shown by the quotes at the top of this article, posts on discussion forums sometimes muddle the definition of Winsorization. If you modify the data in an unsymmetric fashion, you will produce biased statistics.

Winsorization: The good

Why do some people want to Winsorize their data? There are a few reasons:

  • Classical statistics such as the mean and standard deviation are sensitive to extreme values. The purpose of Winsorization is to "robustify" classical statistics by reducing the impact of extreme observations.
  • Winsorization is sometimes used in the automated processing of hundreds or thousands of variables when it is impossible for a human to inspect each and every variable.
  • If you compare a Winsorized statistic with its classical counterpart, you can identify variables that might contain contaminated data or are long-tailed and require special handling in models.

Winsorization: The bad

There is no built-in procedure in SAS that Winsorizes variables, but there are some user-defined SAS macros on the internet that claim to Winsorize variables. BE CAREFUL! Some of these macros do not correctly handle missing values. Others use percentiles to determine the extreme values that are modified. If you must Winsorize, I have written a SAS/IML function that Winsorizes data and correctly handles missing values.

As an alternative to Winsorizing your data, SAS software provides many modern robust statistical methods that have advantages over a simple technique like Winsorization:

Winsorization: The ugly

If the data contains extreme values, then classical statistics are influenced by those values. However, modifying the data is a draconian measure. Recently I read an article by John Tukey, one of the early investigator of robust estimation. In the article "A survey of sampling from contaminated distributions" (1960), Tukey says (p. 457) that when statisticians encounter a few extreme values in data,

we are likely to think of them as 'strays' [or]'wild shots' ... and to focus our attention on how normally distributed the rest of the distribution appears to be. One who does this commits two oversights, forgetting Winsor's principle that 'all distributions are normal in the middle,' and forgetting that the distribution relevant to statistical practice is that of the values actually provided and not of the values which ought to have been provided.

A little later in the essay (p. 458), he says

Sets of observations which have been de-tailed by over-vigorous use of a rule for rejecting outliers are inappropriate, since they are not samples.

I love this second quote. All of the nice statistical formulas that are used to make inferences (such as standard errors and confidence intervals) are based on the assumption that the data are a random sample that contains all of the observed values, even extreme values. The tails of a distribution are extremely important, and indiscriminately modifying large and small values invalidates many of the statistical analyses that we take for granted.

Summary

Should you Winsorize data? Tukey argues that indiscriminately modifying data is "inappropriate." In SAS, you can get the Winsorized mean directly from PROC UNIVARIATE. SAS also provides alternative robust methods such the ones in the ROBUSTREG and QUANTREG procedures.

If you decide to use Winsorization to modify your data, remember that the standard definition calls for the symmetric replacement of the k smallest (largest) values of a variable with the (k+1)st smallest (largest). If you download a program from the internet, be aware that some programs use quantiles and others do not handle missing values correctly.

What are your thoughts about Winsorizing data? Share them in the comments.

Share

About Author

Rick Wicklin

Distinguished Researcher in Computational Statistics

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of SAS/IML software. His areas of expertise include computational statistics, simulation, statistical graphics, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.

38 Comments

  1. Peter Clemmensen on

    Very nice article and different points of views.

    A quick question. You write: "You do the same for the largest values, replacing the k largest values with the (k+1)st largest value."

    Wouldn't this be the (k-1)st largest value that you should replace the k largest values with?

    Regards Peter

    • Rick Wicklin

      It depends from which direction you are counting. I was counting "from the outside in." It is the greatest data value that is less than or equal to the k_th largest value.

    • Christopher Lewis on

      I had the same thought. I've never heard of anyone "counting 'from the outside in'". It seems like it would only generate confusion (as it has here) and errors.

      • Rick Wicklin

        I'm sorry that my response was confusing. When you sort the data from smallest to largest, you replace the k_th largest values with the (k+1)st largest value. For example, if there are 10 data values and k=3, then the first 3 are replaced by x[4] and the last three (x[8], x[9], and x[10]) are replaced by x[7]. The value x[7] is the 4th largest value, not the third largest.

        For a more mathematical description, see the documentation for PROC UNIVARIATE.

    • Rick Wicklin

      Thanks for asking this question, Chris. For the trimmed mean, you EXCLUDE the k largest and k smallest values and compute the mean of the remaining N - 2k values. For the trimmed mean, extreme values have NO EFFECT on the estimate of the mean.

      For the Winsorized mean, you REPLACE the extreme values by another (not as extreme) data value. You then compute the mean of the modified N values. Thus extreme values still have SOME effect on the estimate, but not as large as they did before being modified.

      I think some practitioners prefer to Winsorize data (rather than trimming) because it keeps the number of observations constant. If you replace extreme values by missing values, you get trimmed data, but the missing values wreak havoc on multivariate analyses.

      • I think practitioners prefer Winzorized over trimmed because it keeps the weight of extreme observations in the tails of the distribution and thus has a lesser effect on estimates of scale.

  2. Great post. Love the quotes. In my first stats class I was lucky to have an instructor who repeatedly emphasized the importance of investigating outliers rather than simply discarding them (or blindly including them). He had a couple key examples. One was Bob Beamon's famous long jump at the 1968 Olympics where he set a world record by nearly two feet (http://blog.minitab.com/blog/fun-with-statistics/visualizing-the-greatest-olympic-outlier-of-all-time). The other was a college that reported one year that the average salary for the graduating class was $200,000 (they happened to have an outlier graduate who was signed to a multi-million dollar pro basketball contract).

  3. i am winsorizing some data but having the problem of handling missing values. My codes are replacing most of the missing values. can you suggest some efficient micro or any simple codes which donot replace any missing value rather consider only available values but keep missing values as it is?
    Any help will be appreciated.
    Good day

  4. A couple of (hopefully) amusing abecdotes.

    1. A colleague of mine working in toxicology in the pharmaceutical industry was always trying to identify the unusual cases, where the drug caused extremely bad reactions. He said "All this talk of truncation and Winsorization is backwards. In my work, I throw away all the "good" data and study the outliers."

    2. Another colleague (not a statistician) was a consultant for a medical device start-up company and was seeking advice from me. His client company was making a device that purports to measure the same thing as electroencephalograph (EEG) data on the human heart, but one that can be used easily at home to give early warning signals of heart problems in people with heart conditions. The gold standard is the EEG, but the patient has to be in the medical office hooked up to electrodes to get such data. The medical device company has lots of data comparing the EEG with the device, and found that most of the time, the device works very well, giving measurements within 1% of the gold standard EEG. However, in the rare occasions where the heart is unusually stressed, the device misses the EEG target badly. My colleague had heard that in some academic disciplines it is considered “OK” to delete or Winsorize the outliers. So he deleted the rare occasions where the device badly missed the EEG target. After deletion, he found there were no cases where the device performed badly. Based on that analysis, he was thinking of recommending to his client company that they continue with product development, “as is”!

    • Marco Stamazza on

      Funny story, I wouldn't be surprised to know that the same approach has been used elsewhere, Just a detail: he wanted to use an electroencephalograph EEG data for the heart? EEC is about the head, electrocardiogram ECG maybe?

  5. I am thinking of Winsorising at the Tukey outlier bounds, i.e. replace all outliers and extremes (both sides of the distribution) with, below by Q1 - 1.5*IQR and above by Q3 + 1.5*IQR. That way the Tukey-outliers are also not discarded but downweighted. I wonder what the influence of such a "Winsorisation" on the new mean and precision would be.

  6. Thomas Wolff on

    Hey! Great article, that really helps a lot!

    Do you have any recommendations on which percentiles should be used for winsorizing? Or how to find out which percentile to use?

    Thanks!

    • Rick Wicklin

      My advice is:
      1. Use robust statistics, rather than Winsorization, when possible.
      2. If you must Winsorize, exclude the smallest percentile that eliminates the problematic outliers. I feel more comfortable with using the 0.01 quantile (1% in each tail) than using the 0.05 quantile (5% in each tail).
      3. Realize that when you winsorize you are obtaining statistics that have less variance than the true data. That means any inferences you make (for example, confidence intervals) are likely to be too small.

  7. Virginia Wesner on

    Thank you for your wonderful explanation. I'm looking at the CMS proposed bundled payment project for radiation and noted that they use winsorization as part of their model. It was such a pleasure to find something that quickly described the process and explained limitations and good points. Now I can explain to my team in laymen's terms what winsorization means and how it affects the model and why it is important that all of us have a basic understanding of the way calculations are being done.

  8. Hi Rick, Thank you for a very informative post, as always. I have a variable that will be log transformed for regression. Should I winsorize the variable before or after the log transformation? Thanks in advance

    • Rick Wicklin

      Assuming that all values are positive, it doesn't matter. If you Winsoized and then take the log, you'll get the same values as if you take the log and then Winsorize. In fact, that result is true for any monotonic increasing transformation of the data.

    • Rick Wicklin

      Using the term "z-score" makes me nervous. It is true that you can standardize any variable by subtracting its mean and dividing the centered data by the standard deviation. However, if you use a robust estimate of the mean and a robust estimate of the standard deviation, you obtain a distribution that does not have mean 0 nor standard deviation 1. Therefore it is not a z-score in the classical sense.

      Furthermore, the term "Z-score" is usually used when the distribution is normal. The normal CDF and quantiles are used with z-scores to obtain probabilities and to test hypotheses. Non-normal distributions are not usually called z-scores, and definitely not if they do not have mean=0 and std dev=1.

  9. Thank you for your post. This question is related to above questions:

    I need to take logs and difference a panel dataset. Then I standardize (I typically demean both the cross section, the time series and then add back the grand mean to have centered data in both dimensions), as I will be running principal components analysis to extract factors.

    At the moment, I winsorize before all steps listed above. Do you believe I should winsorize after PCA?

  10. Quick question if it would make any sense to run robustreg in sas just on the intercept term and retain obs that are identified as outliers using say the M estimation method--as opposed to using the MAD or IQR approaches? Thanks for any thoughts!

  11. I'd welcome your thought on this situation: We have a dataset of patients with a rare condition (n=~10 for each of 2 groups). The participants completed a battery of standardized tests that use a mean of 100 and a standard deviation of 15. In some cases, participants did not pass the 'practice items' so did not start the task, and therefore there was no raw data to convert to a standard score. This resulted in 'missing' data throughout the dataset for different tests. We want to represent the performance of this participants in these cases. I was advised to enter the lowest standard score (e.g., 55) in these cases to be able to represent these cases and therefore have a full dataset. My questions are: (1) is this a version of Winsorizing if the missing data were replaced with a stand-in (such as standard scores of 55) recognizing that we do not change scores on the other end (since the rationale does not extend to that group)? (2) if this is statistically sound, are there citations or papers you could share so we could refer to them? (3) if this is statistically sound, would you recommend that we use the same score across standardized tests (e.g., 55), or look up the exact lowest possible standard score for each person based on their age and subtest?

      • Thanks for taking the time to respond! You have great resources and I've been learning a lot. I actually *did* get the advice from a senior statistician to use the approach I mentioned who works often with clinical scientists and has seen the method used in published work and among colleagues. I look forward to learning more about different approaches by different researchers to find the best approach. I can cross off Winsorization from the list of options!

  12. Thanks for the article! How would you then handle an extremely right-skewed dataset with several close to zero but nonzero observations?

      • Rick Wicklin

        I don't know. The article is from a collection of essays.
        Tukey, J. W. (1960). A survey of sampling from contaminated distributions. Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling. Stanford University Press, Stanford, 448-485.

Leave A Reply

Back to Top