How to Winsorize data in SAS

8

Recently a SAS customer asked how to Winsorize data in SAS. Winsorization is best known as a way to construct robust univariate statistics. The Winsorized mean is a robust estimate of location.

The Winsorized mean is similar to the trimmed mean, and both are described in the documentation for PROC UNIVARIATE. Both statistics require that you specify an integer k. For the trimmed mean, you exclude the smallest and largest k nonmissing values and take the mean of the remaining values. Thus for a variable with n observations, the trimmed mean is the mean of the central n – 2k values.

In contrast, when you Winsorize data you replace the k smallest values with the (k+1)st ordered value and you replace the k largest values with the (n–k)th largest value. You then take the mean of the new n observations.

Winsorize data in SAS

In a 2010 paper I described how to use SAS/IML software to trim data. Trimming is the act of truncating the upper and lower tails of the empirical distribution of the data.

Winsorizing is slightly more complicated, especially if the data contain missing values or repeated values. You can sort the data, but sorting puts missing values first, which makes some computations more challenging. Instead, the following code uses the RANK function to compute the rank of the data values. The values with ranks less than or equal to k are then replaced, and similarly for the values with the k largest ranks:

%let DSName = sashelp.heart;
proc iml;
/* SAS/IML module to Winsorize each column of a matrix. 
   Input proportion of observations to Winsorize: prop < 0.5. 
   Ex:  y = Winsorize(x, 0.1) computes the two-side 10% Winsorized data */
start Winsorize(x, prop);
   p = ncol(x);            /* number of columns */
   w = x;                  /* copy of x */
   do i = 1 to p;
      z = x[,i];           /* copy i_th column */
      n = countn(z);       /* count nonmissing values */
      k = ceil(prop*n);    /* number of obs to trim from each tail */
      r = rank(z);         /* rank values in i_th column */
      /* find target values and obs with smaller/larger values */
      lowIdx = loc(r<=k & r^=.);
      lowVal = z[loc(r=k+1)]; 
      highIdx = loc(r>=n-k+1);
      highVal = z[loc(r=n-k)]; 
      /* Winsorize (replace) k smallest and k largest values */
      w[lowIdx,i] = lowVal;
      w[highIdx,i] = highVal;
   end;
   return(w);
finish;
 
/* test the algorithm on numerical vars in a data set */
use &DSName;
read all var _NUM_ into X[colname=varNames];
close;
winX = Winsorize(X, 0.1);

The matrix winX contains the Winsorized data, where the extreme values in each column have been replaced by a less extreme value. (If you want to print the Winsorized data, use %let DSName = sashelp.class;, which is a small data set.) To verify that the data are Winsorized correctly, you can compute the Winsorized means in SAS/IML and compare them to the Winsorized means that are computed by PROC UNIVARIATE. The SAS/IML computation is simply the mean of the Winsorized data:

/* Compute Winsorized mean, which is  mean of the Winsorized data */
winMean = mean(winX);
print winMean[c=varNames f=8.4];
t_winsorize

With this data you can compute many robust statistics, such as the Winsorized standard deviation or the Winsorized covariance or correlation matrix. You can even compute t tests for a Winsorized mean.

As validation, the following call to PROC UNIVARIATE computes the Winsorized means for each of the numeric variables in the &DSName data set. The results are not shown, but are equivalent to the SAS/IML computations:

/* Validation: compute Winsorized means by using UNIVARIATE */
ods exclude all;
proc univariate data=&dsname winsorized=0.1;
   ods output WinsorizedMeans=winMeans;
run;
ods exclude none;
 
proc print data=winMeans;
   var VarName Mean;
run;

Unsymmetric Winsorization

The symmetric Winsorization results in a Winsorized mean that has nice theoretical properties. In particular, John Tukey and colleagues derived standard errors, confidence intervals, and other distributional properties for the Winsorized mean. These inferential statistics are computed by PROC UNIVARIATE.

Some software enables you to "Winsorize" data in an unsymmetric manner. Specifically, you can specify quantiles α < 0.5 and β > 0.5 and the software will replace values x < x(α) with x(α) and values x > x(β) with x(β), where x(α) is the value of the αth quantile. You can use the QNTL subroutine in SAS/IML to carry out this computation, or you can use a SAS macro.

However, I do not know whether the distributions of the resulting statistics are known. The interested reader can use a search engine such as Google Scholar to search for "asymmetric Winsorized means." For symmetric distributions, I recommend the classic symmetric Winsorization.

Share

About Author

Rick Wicklin

Distinguished Researcher in Computational Statistics

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of PROC IML and SAS/IML Studio. His areas of expertise include computational statistics, simulation, statistical graphics, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.

8 Comments

    • Rick Wicklin

      No. The goal is to replace replace the k smallest values with the (k+1)st ordered value, so you have to either sort the data or (equivalently) use RANK. You also need to be able to handle the case where x[k]=x[k+1].

  1. Thank you for the very useful code and instructions! I am still a beginner so sorry if my question is silly but if I am working with a data set in which there are identifiers (i.e. company name, ticker, etc) for each observation, is there a way to keep the identifier variables with the corresponding winsorized variables for each observation in the output data set.

    Thank you in advance for your time and your help!

  2. Pingback: Winsorization: The good, the bad, and the ugly - The DO Loop

  3. Rick Wicklin

    Someone asked me how to get the Winsorized data into a SAS data set:

    winX = Winsorize(X, 0.1);
    create WinData from winX[colnama=varNames];
    append from winX;;
    close;

    Then you can call PROC MEANS to get Winsorized means, standard deviations, percentiles, and so forth:

    proc means data=WinData;
    run;

Leave A Reply

Back to Top