Recently a SAS customer asked how to Winsorize data in SAS. Winsorization is best known as a way to construct robust univariate statistics. The Winsorized mean is a robust estimate of location.
The Winsorized mean is similar to the trimmed mean, and both are described in the documentation for PROC UNIVARIATE. Both statistics require that you specify an integer k. For the trimmed mean, you exclude the smallest and largest k nonmissing values and take the mean of the remaining values. Thus for a variable with n observations, the trimmed mean is the mean of the central n – 2k values.
In contrast, when you Winsorize data you replace the k smallest values with the (k+1)st ordered value and you replace the k largest values with the (n–k)th largest value. You then take the mean of the new n observations.
Winsorize data in SAS
In a 2010 paper I described how to use SAS/IML software to trim data. Trimming is the act of truncating the upper and lower tails of the empirical distribution of the data.
Winsorizing is slightly more complicated, especially if the data contain missing values or repeated values. You can sort the data, but sorting puts missing values first, which makes some computations more challenging. Instead, the following code uses the RANK function to compute the rank of the data values. The values with ranks less than or equal to k are then replaced, and similarly for the values with the k largest ranks:
%let DSName = sashelp.heart; proc iml; /* SAS/IML module to Winsorize each column of a matrix. Input proportion of observations to Winsorize: prop < 0.5. Ex: y = Winsorize(x, 0.1) computes the two-side 10% Winsorized data */ start Winsorize(x, prop); p = ncol(x); /* number of columns */ w = x; /* copy of x */ do i = 1 to p; z = x[,i]; /* copy i_th column */ n = countn(z); /* count nonmissing values */ if n < 2 then continue; /* too few observations; leave unchanged */ k = ceil(prop*n); /* number of obs to trim from each tail */ r = rank(z); /* rank values in i_th column */ /* find target values and obs with smaller/larger values */ lowIdx = loc(r<=k & r^=.); lowVal = z[loc(r=k+1)]; highIdx = loc(r>=n-k+1); highVal = z[loc(r=n-k)]; /* Winsorize (replace) k smallest and k largest values */ w[lowIdx,i] = lowVal; w[highIdx,i] = highVal; end; return(w); finish; /* test the algorithm on numerical vars in a data set */ use &DSName; read all var _NUM_ into X[colname=varNames]; close; winX = Winsorize(X, 0.1); |
The matrix winX contains the Winsorized data, where the extreme values in each column have been replaced by a less extreme value. (If you want to print the Winsorized data, use %let DSName = sashelp.class;, which is a small data set.) To verify that the data are Winsorized correctly, you can compute the Winsorized means in SAS/IML and compare them to the Winsorized means that are computed by PROC UNIVARIATE. The SAS/IML computation is simply the mean of the Winsorized data:
/* Compute Winsorized mean, which is mean of the Winsorized data */ winMean = mean(winX); print winMean[c=varNames f=8.4]; |
With this data you can compute many robust statistics, such as the Winsorized standard deviation or the Winsorized covariance or correlation matrix. You can even compute t tests for a Winsorized mean.
As validation, the following call to PROC UNIVARIATE computes the Winsorized means for each of the numeric variables in the &DSName data set. The results are not shown, but are equivalent to the SAS/IML computations:
/* Validation: compute Winsorized means by using UNIVARIATE */ ods exclude all; proc univariate data=&dsname winsorized=0.1; ods output WinsorizedMeans=winMeans; run; ods exclude none; proc print data=winMeans; var VarName Mean; run; |
Unsymmetric Winsorization
The symmetric Winsorization results in a Winsorized mean that has nice theoretical properties. In particular, John Tukey and colleagues derived standard errors, confidence intervals, and other distributional properties for the Winsorized mean. These inferential statistics are computed by PROC UNIVARIATE.
Some software enables you to "Winsorize" data in an unsymmetric manner. Specifically, you can specify quantiles α < 0.5 and β > 0.5 and the software will replace values x < x(α) with x(α) and values x > x(β) with x(β), where x(α) is the value of the αth quantile. You can use the QNTL subroutine in SAS/IML to carry out this computation, or you can use a SAS macro.
However, I do not know whether the distributions of the resulting statistics are known. The interested reader can use a search engine such as Google Scholar to search for "asymmetric Winsorized means." For symmetric distributions, I recommend the classic symmetric Winsorization.
12 Comments
Rick,
I think using RANKTIE() is better than RANK() ?
Xia Keshan
No. The goal is to replace replace the k smallest values with the (k+1)st ordered value, so you have to either sort the data or (equivalently) use RANK. You also need to be able to handle the case where x[k]=x[k+1].
Thank you for the very useful code and instructions! I am still a beginner so sorry if my question is silly but if I am working with a data set in which there are identifiers (i.e. company name, ticker, etc) for each observation, is there a way to keep the identifier variables with the corresponding winsorized variables for each observation in the output data set.
Thank you in advance for your time and your help!
Sure. Write the Winsorized data to a SAS data set and merge it with the original data that contains the ID variables.
Thank you very much!
Pingback: Winsorization: The good, the bad, and the ugly - The DO Loop
Someone asked me how to get the Winsorized data into a SAS data set:
Then you can call PROC MEANS to get Winsorized means, standard deviations, percentiles, and so forth:
Thank you Rick. It works just fine.
Hi Rick,
Can you please tell me how to add a code to keep a character variable along with the numeric variables. I am new to SAS IM and I tried a couple of different tricks but I must be missing something. My data is panel data so I need to keep a variable as an ID. My ID variable is currently a numeric one but since your code winsorizes all the numeric variables it does it for my ID variable as well. I thought the best thing is to use my date variable as an ID which is a character variable. any help would be appreciated.
You don't have to read all numeric variables into the matrix. You can specify the variables to read. For example:
If that doesn't answer your question, post a portion of your program to the Support Community for SAS/IML.
Hi Rick,
I was able to use proc univariate to winsorize my variable (which has many outliers) using the below code: How do I save the winsorized variable to a new dataset? I do not have proc iml.
proc univariate data=new.cov_psmd_models_1 winsor=1;
var psmdv1;
ods select BasicMeasures TrimmedMeans WinsorizedMeans;
run;
Thanks so much!
PROC UNIVARIATE computes STATISTICS (mean, standard error, and CI) for the Winsorized data, but it does not output the transformed data.