Recently a SAS customer asked how to Winsorize data in SAS. Winsorization is best known as a way to construct robust univariate statistics. The Winsorized mean is a robust estimate of location.
The Winsorized mean is similar to the trimmed mean, and both are described in the documentation for PROC UNIVARIATE. Both statistics require that you specify an integer k. For the trimmed mean, you exclude the smallest and largest k nonmissing values and take the mean of the remaining values. Thus for a variable with n observations, the trimmed mean is the mean of the central n – 2k values.
In contrast, when you Winsorize data you replace the k smallest values with the (k+1)st ordered value and you replace the k largest values with the (n–k)th largest value. You then take the mean of the new n observations.
Winsorize data in SAS
In a 2010 paper I described how to use SAS/IML software to trim data. Trimming is the act of truncating the upper and lower tails of the empirical distribution of the data.
Winsorizing is slightly more complicated, especially if the data contain missing values or repeated values. You can sort the data, but sorting puts missing values first, which makes some computations more challenging. Instead, the following code uses the RANK function to compute the rank of the data values. The values with ranks less than or equal to k are then replaced, and similarly for the values with the k largest ranks:
%let DSName = sashelp.heart; proc iml; /* SAS/IML module to Winsorize each column of a matrix. Input proportion of observations to Winsorize: prop < 0.5. Ex: y = Winsorize(x, 0.1) computes the two-side 10% Winsorized data */ start Winsorize(x, prop); p = ncol(x); /* number of columns */ w = x; /* copy of x */ do i = 1 to p; z = x[,i]; /* copy i_th column */ n = countn(z); /* count nonmissing values */ k = ceil(prop*n); /* number of obs to trim from each tail */ r = rank(z); /* rank values in i_th column */ /* find target values and obs with smaller/larger values */ lowIdx = loc(r<=k & r^=.); lowVal = z[loc(r=k+1)]; highIdx = loc(r>=n-k+1); highVal = z[loc(r=n-k)]; /* Winsorize (replace) k smallest and k largest values */ w[lowIdx,i] = lowVal; w[highIdx,i] = highVal; end; return(w); finish; /* test the algorithm on numerical vars in a data set */ use &DSName; read all var _NUM_ into X[colname=varNames]; close; winX = Winsorize(X, 0.1);
The matrix winX contains the Winsorized data, where the extreme values in each column have been replaced by a less extreme value. (If you want to print the Winsorized data, use %let DSName = sashelp.class;, which is a small data set.) To verify that the data are Winsorized correctly, you can compute the Winsorized means in SAS/IML and compare them to the Winsorized means that are computed by PROC UNIVARIATE. The SAS/IML computation is simply the mean of the Winsorized data:
/* Compute Winsorized mean, which is mean of the Winsorized data */ winMean = mean(winX); print winMean[c=varNames f=8.4];
With this data you can compute many robust statistics, such as the Winsorized standard deviation or the Winsorized covariance or correlation matrix. You can even compute t tests for a Winsorized mean.
As validation, the following call to PROC UNIVARIATE computes the Winsorized means for each of the numeric variables in the &DSName data set. The results are not shown, but are equivalent to the SAS/IML computations:
/* Validation: compute Winsorized means by using UNIVARIATE */ ods exclude all; proc univariate data=&dsname winsorized=0.1; ods output WinsorizedMeans=winMeans; run; ods exclude none; proc print data=winMeans; var VarName Mean; run;
The symmetric Winsorization results in a Winsorized mean that has nice theoretical properties. In particular, John Tukey and colleagues derived standard errors, confidence intervals, and other distributional properties for the Winsorized mean. These inferential statistics are computed by PROC UNIVARIATE.
Some software enables you to "Winsorize" data in an unsymmetric manner. Specifically, you can specify quantiles α < 0.5 and β > 0.5 and the software will replace values x < x(α) with x(α) and values x > x(β) with x(β), where x(α) is the value of the αth quantile. You can use the QNTL subroutine in SAS/IML to carry out this computation, or you can use a SAS macro.
However, I do not know whether the distributions of the resulting statistics are known. The interested reader can use a search engine such as Google Scholar to search for "asymmetric Winsorized means." For symmetric distributions, I recommend the classic symmetric Winsorization.