How to Winsorize data in SAS

Recently a SAS customer asked how to Winsorize data in SAS. Winsorization is best known as a way to construct robust univariate statistics. The Winsorized mean is a robust estimate of location.

The Winsorized mean is similar to the trimmed mean, and both are described in the documentation for PROC UNIVARIATE. Both statistics require that you specify an integer k. For the trimmed mean, you exclude the smallest and largest k nonmissing values and take the mean of the remaining values. Thus for a variable with n observations, the trimmed mean is the mean of the central n – 2k values.

In contrast, when you Winsorize data you replace the k smallest values with the (k+1)st ordered value and you replace the k largest values with the (n–k)th largest value. You then take the mean of the new n observations.

Winsorize data in SAS

In a 2010 paper I described how to use SAS/IML software to trim data. Trimming is the act of truncating the upper and lower tails of the empirical distribution of the data.

Winsorizing is slightly more complicated, especially if the data contain missing values or repeated values. You can sort the data, but sorting puts missing values first, which makes some computations more challenging. Instead, the following code uses the RANK function to compute the rank of the data values. The values with ranks less than or equal to k are then replaced, and similarly for the values with the k largest ranks:

%let DSName = sashelp.heart;
proc iml;
/* SAS/IML module to Winsorize each column of a matrix. 
   Input proportion of observations to Winsorize: prop < 0.5. 
   Ex:  y = Winsorize(x, 0.1) computes the two-side 10% Winsorized data */
start Winsorize(x, prop);
   p = ncol(x);            /* number of columns */
   w = x;                  /* copy of x */
   do i = 1 to p;
      z = x[,i];           /* copy i_th column */
      n = countn(z);       /* count nonmissing values */
      if n < 2 then continue; /* too few observations; leave unchanged */
      k = ceil(prop*n);    /* number of obs to trim from each tail */
      r = rank(z);         /* rank values in i_th column */
      /* find target values and obs with smaller/larger values */
      lowIdx = loc(r<=k & r^=.);
      lowVal = z[loc(r=k+1)]; 
      highIdx = loc(r>=n-k+1);
      highVal = z[loc(r=n-k)]; 
      /* Winsorize (replace) k smallest and k largest values */
      w[lowIdx,i] = lowVal;
      w[highIdx,i] = highVal;
   end;
   return(w);
finish;
 
/* test the algorithm on numerical vars in a data set */
use &DSName;
read all var _NUM_ into X[colname=varNames];
close;
winX = Winsorize(X, 0.1);

The matrix winX contains the Winsorized data, where the extreme values in each column have been replaced by a less extreme value. (If you want to print the Winsorized data, use %let DSName = sashelp.class;, which is a small data set.) To verify that the data are Winsorized correctly, you can compute the Winsorized means in SAS/IML and compare them to the Winsorized means that are computed by PROC UNIVARIATE. The SAS/IML computation is simply the mean of the Winsorized data:

/* Compute Winsorized mean, which is  mean of the Winsorized data */
winMean = mean(winX);
print winMean[c=varNames f=8.4];

With this data you can compute many robust statistics, such as the Winsorized standard deviation or the Winsorized covariance or correlation matrix. You can even compute t tests for a Winsorized mean.

As validation, the following call to PROC UNIVARIATE computes the Winsorized means for each of the numeric variables in the &DSName data set. The results are not shown, but are equivalent to the SAS/IML computations:

/* Validation: compute Winsorized means by using UNIVARIATE */
ods exclude all;
proc univariate data=&dsname winsorized=0.1;
   ods output WinsorizedMeans=winMeans;
run;
ods exclude none;
 
proc print data=winMeans;
   var VarName Mean;
run;

Unsymmetric Winsorization

The symmetric Winsorization results in a Winsorized mean that has nice theoretical properties. In particular, John Tukey and colleagues derived standard errors, confidence intervals, and other distributional properties for the Winsorized mean. These inferential statistics are computed by PROC UNIVARIATE.

Some software enables you to "Winsorize" data in an unsymmetric manner. Specifically, you can specify quantiles α < 0.5 and β > 0.5 and the software will replace values x < x_(α) with x_(α) and values x > x_(β) with x_(β), where x_(α) is the value of the αth quantile. You can use the QNTL subroutine in SAS/IML to carry out this computation, or you can use a SAS macro.

However, I do not know whether the distributions of the resulting statistics are known. The interested reader can use a search engine such as Google Scholar to search for "asymmetric Winsorized means." For symmetric distributions, I recommend the classic symmetric Winsorization.

12 Comments

Ksharp on July 15, 2015 9:25 pm

Rick,
I think using RANKTIE() is better than RANK() ?

Xia Keshan

- Rick Wicklin on July 16, 2015 7:25 am
  
  No. The goal is to replace replace the k smallest values with the (k+1)st ordered value, so you have to either sort the data or (equivalently) use RANK. You also need to be able to handle the case where x[k]=x[k+1].
  
Liz on October 31, 2016 6:26 pm

Thank you for the very useful code and instructions! I am still a beginner so sorry if my question is silly but if I am working with a data set in which there are identifiers (i.e. company name, ticker, etc) for each observation, is there a way to keep the identifier variables with the corresponding winsorized variables for each observation in the output data set.

Thank you in advance for your time and your help!

- Rick Wicklin on November 1, 2016 5:41 am
  
  Sure. Write the Winsorized data to a SAS data set and merge it with the original data that contains the ID variables.
  
  - Liz on November 1, 2016 11:12 am
    
    Thank you very much!
    
Pingback: Winsorization: The good, the bad, and the ugly - The DO Loop
Rick Wicklin on January 25, 2019 6:04 am
Someone asked me how to get the Winsorized data into a SAS data set:
winX = Winsorize(X, 0.1); create WinData from winX[colnama=varNames]; append from winX;; close;
Then you can call PROC MEANS to get Winsorized means, standard deviations, percentiles, and so forth:
proc means data=WinData; run;
Ali on January 25, 2019 10:34 am

Thank you Rick. It works just fine.

Yasmine on January 23, 2020 12:35 am

Hi Rick,
Can you please tell me how to add a code to keep a character variable along with the numeric variables. I am new to SAS IM and I tried a couple of different tricks but I must be missing something. My data is panel data so I need to keep a variable as an ID. My ID variable is currently a numeric one but since your code winsorizes all the numeric variables it does it for my ID variable as well. I thought the best thing is to use my date variable as an ID which is a character variable. any help would be appreciated.

- Rick Wicklin on January 23, 2020 5:48 am
  You don't have to read all numeric variables into the matrix. You can specify the variables to read. For example:
  read all var {x y z w} into X; /* create a matrix with 4 columns */ read all var {ID}; /* create a vector from the ID variable */
  If that doesn't answer your question, post a portion of your program to the Support Community for SAS/IML.
  
Lassana on February 3, 2020 11:51 am

Hi Rick,
I was able to use proc univariate to winsorize my variable (which has many outliers) using the below code: How do I save the winsorized variable to a new dataset? I do not have proc iml.

proc univariate data=new.cov_psmd_models_1 winsor=1;
var psmdv1;
ods select BasicMeasures TrimmedMeans WinsorizedMeans;
run;

Thanks so much!

- Rick Wicklin on February 3, 2020 11:57 am
  
  PROC UNIVARIATE computes STATISTICS (mean, standard error, and CI) for the Winsorized data, but it does not output the transformed data.

Blogs

Blogs

How to Winsorize data in SAS

Winsorize data in SAS

Unsymmetric Winsorization

About Author

12 Comments

Leave A Reply Cancel Reply

Follow Us

What is...