Odds ratio plots with a logarithmic scale in SAS

I recently read an argument by Andrew Wheeler for using a logarithmic axis for plotting odds ratios. I found his argument convincing. Accordingly, this blog post shows how to create an odds ratio plot in SAS where the ratio axis is displayed on a log scale. Thanks to Bob Derr who read and commented on an early draft of this article.

As the name implies, the odds ratio is a ratio of two odds. You can look up a detailed explanation, but essentially it is the odds of an event occurring in one group divided by the odds of it occurring in another group. The odds ratio is always positive, and an odds ratio of 1 means that the odds of the event occurring in the two groups is the same.

When plotting an odds ratio, the relevant fact is that it is a ratio. A ratio is not symmetric, and reversing the comparison group results in the reciprocal of the ratio. For example, suppose the odds ratio of a disease is 10 when comparing females to males. This means that the odds of getting the disease for females is 10 times greater than for males. However, it is just as correct to say that the odds ratio is 0.1 when you reverse the groups and compare males to females.

On a linear scale, the distance between 0.1 and 1 appears much smaller than the distance between 1 and 10. However, on a log scale, the distance between 10-1 and 100 (=1) is the same as the distance between 1 and 101; on a log10 scale, both distances are 1. It is only by using a log scale that you can visually compare the magnitudes of confidence intervals and standard errors in an odds ratio plot.

The following example is from a SAS Note about estimating odds ratios. In the following example, patients with one of two diagnoses (complicated or uncomplicated) are treated with one of three treatments (A, B, or C) and the result (cured or not cured) is observed:

data uti;
   input diagnosis : $13. treatment $ response $ count @@;
complicated    A  cured 78  complicated   A not 28
complicated    B  cured 101 complicated   B not 11
complicated    C  cured 68  complicated   C not 46
uncomplicated  A  cured 40  uncomplicated A not 5
uncomplicated  B  cured 54  uncomplicated B not 5
uncomplicated  C  cured 34  uncomplicated C not 6
proc logistic data=uti plots(only)=oddsratio;
   freq count;
   class diagnosis treatment / param=glm;
   model response(event="cured") = diagnosis treatment diagnosis*treatment ;
   oddsratio treatment;
   oddsratio diagnosis;
   ods output OddsRatiosWald= ORPlot;  /* save data for later use */

The default odds ratio plot is shown. Five estimates are less than 1 and four are greater than 1. Four confidence intervals intersect 1, which indicates ratios that are not significantly different from 1.

These intervals are not adjusted for multiple comparisons, so you really shouldn't compare their lengths, but many people use the length of a confidence interval to visualize uncertainty in an estimate, and comparisons are inevitable. From looking at the graph, a casual reader might think that the uncertainty for the third ratio (treatment B vs C at diagnosis=complicated) is the biggest. However, this initial impression ignores the fact that the confidence intervals squished into the interval (0,1] might span several orders of magnitude!

Plotting the odds ratios on a log scale automatically

Several SAS procedures enable you to specify a log scale by using the procedure syntax. For example, the LOGISTIC, GLIMMIX, and FREQ procedures support the LOGBASE=10 option on the PLOTS=ODDSRATIO option to generate the plot automatically, as follows:

proc logistic plots=oddsratio(logbase=10); /* specify log scale */

The new odds ratio plot (click to enlarge) displays exactly the same data, but uses a log scale. In the second graph you can see that the confidence interval for the third item is no longer the widest. This plot presents a more faithful visual description of the uncertainty associated with each estimate, regardless of whether the estimate is less than 1 or greater than 1.

Even comparing estimates is much improved. In the first plot, the sixth ratio (treatment B vs C at diagnosis=uncomplicated), which has the value 1.9, seems to be the second most extreme estimate. In the second plot, you can also see that the first and last estimates are more extreme (further from 1) than the sixth estimate. For example, the last estimate is about 0.26, which is the equivalent to the inverse ratio 1/0.26 ≈ 3.8, which is much greater than 1.9. That fact was not evident in the first plot.

Plotting the odds ratios on a log scale manually

If you compute the odds ratio and confidence limits in a DATA step or in a procedure that does not support odds ratio plots, you can use the SGPLOT procedure to create the odds ratio plot with a logarithmic axis. You can use the SCATTER statement to plot the estimates and the XERRORLOWER= and XERRORUPPER= options to plot the confidence intervals. You can use the TYPE=LOG option on the XAXIS statement to change the scale of the axis. The following PROC SGPLOT statement plots the data in the ORPlot data set, which was created by the ODS OUTPUT statement during the first call to PROC LOGISTIC. The names of the data set variables are self-explanatory:

title "Odds Ratios with 95% Wald Confidence Limits";
proc sgplot data=ORPlot noautolegend;
   scatter y=Effect x=OddsRatioEst / xerrorlower=LowerCL xerrorupper=UpperCL
   refline 1 / axis=x;
   xaxis grid type=log label="Odds Ratio (log scale)";  /* specify log scale */
   yaxis grid display=(nolabel) discreteorder=data reverse;

The graph is essentially the same as the one produced by PROC LOGISTIC and is not shown. This same technique can be used to create forest plots in SAS.

Create an odds ratio plot with a log scale? You decide!

Should you use a log-scale axis for an odds ratio plot? It depends on your target audience. I recommend it when you are presenting results to a mathematically sophisticated audience.

For other audiences, it is less clear whether the advantages of a log scale outweigh the disadvantages. Odds ratio plots are used heavily in medical research, such as to report the results of clinical trials. Is the average medical practitioner and healthcare administrator comfortable enough with log scales to justify using them?

What do you think? If you use odds ratio plots in your work, which version do you prefer? SAS software can easily produce both kinds of odds ratio plots, so the choice is yours.

Post a Comment

Convert a vector to a string

Base SAS contains many functions for processing strings, and you can call these functions from within a SAS/IML program. However, sometimes a SAS/IML programmer needs to process a vector of strings. No problem! You can call most Base SAS functions with a vector of parameters.

I have previously written about how to convert a string into a character vector, but sometimes it is useful to do the reverse operation. Specifically, it is sometimes useful to create a single string from the elements in a SAS/IML character vector or matrix. The following SAS/IML module concatenates all of the elements in a SAS/IML matrix into a single string, stripping out blank characters at the beginning or end of each element. The module contains an optional argument that you can use to specify a delimiter between words:

proc iml;
/* Given a character matrix, this module returns a string with 
   the N elements separated by spaces or a user-specified delimiter. */
start MakeString(vv, delim=" ");
   v = rowvec(vv);
   if type(v)='N' then v = putn(v,"BEST6."); /* convert numbers to char */
   N = ncol(v);
   s = "";
   do i = 1 to N-1;
      s = s + strip(v[i]) + delim;
   return ( s + strip(v[N]) );
/* Test the module. You might want to use ODS LISTING for printing because 
   HTML does not display multiple blank characters */
varNames = "x1":"x5";
s1 = MakeString(varNames);
x = {"I" "think" "that" "I" "shall" "never" "see"};
s2 = MakeString(x, "/");
colors = {"Red" "Blue" "Green"};
s3 = MakeString(colors, ", ");
num = 1:5;
s4 = MakeString(num, ",");
print s1, s2, s3, s4;

It is difficult to eliminate the DO loop in the module and truly vectorize this process. The problem is that the SAS/IML language uses fixed-length arrays, so every time you try to trim a vector or extract substrings, the resulting IML vector is re-padded with blanks!

However, if the strings all have the same prefix, as in the varNames example, then you can write some special code that uses the prefix information to eliminate the loop. The ROWCATC function concatenates the elements into a single string without any blanks. The TRANWRD function in Base SAS replaces each prefix "x" with " x", which adds a space between the variable names in this special case:

/* ROWCATC concatenates into a single string without blanks;
   Use TRANWRD to put blank before each "x" character */
s5 = strip(tranwrd(rowcatc(VarNames), "x", " x"));

I've used this "vector to string" function several times, most recently when I was reading a parameter estimates table that was produced by a SAS procedure. I needed to print a single string that contained the name of the variables in the model, and this module made it easy.

Post a Comment

Wealth and winning in NC high school athletics

The Raleigh News & Observer published a front-page article about the effect of wealth and poverty on high school athletics in North Carolina. In particular, the article concluded that "high schools with a high percentage of poor students rarely win titles in the so-called country club sports—tennis, golf and swimming—and the number of sports in which the more affluent dominate is growing."

The article analyzed state championships from 2001 through spring 2013 and concluded that "the percentage of students receiving free or reduced-price meals, which is an indication of students living in need, is a predictor of high school athletic success." The authors presented data on the number of state titles won in 10 sports, including baseball, men's basketball, football, volleyball, and the "country-club sports" for men and women. In percentage terms, the published results were as follows:


According to these statistics, almost 36% of state titles over 13 years were won by relatively affluent schools: those for which less than 20% of the students are eligible for free or reduced meals. Another 33% of titles were won by moderately affluent schools that had between 20% and 40% of "needy" students. In contrast only 1% of titles were won by high schools for which 80% or more of the students are needy.

Upon first glance, these numbers seem to back up the author's claims. But do they? The story leaves out an important fact: what is the percentage of NC high schools in each of those five categories? After all, if 36% percent of the NC schools are affluent and only 1% are impoverished, these numbers merely reflect the distribution of schools within the state.

The NC Department of Public Education provides data for the 2013-2014 school year on free and reduced meals at all NC public schools. I restricted the list to high schools and used SAS software to tabulate the percentage of public schools in each category. You can download the CSV file with the school-level data and the SAS program that I used to analyze the data. The following table summarizes the results.


In 2013, only 5% of public high schools in NC were "affluent," as determined by having 20% or fewer students qualify for free or reduced meals. In contrast, 8% of NC high schools had 80% or more of the students qualify.

The data indicate that the advantage of wealth (and disadvantage due to poverty) is even more pronounced than the article indicates. The affluent schools won 613% more state titles than might be predicted based on their relative numbers. The schools with the highest percentage of poor students won 87% fewer titles than their relative numbers would indicate.

In short, affluent schools win many more state titles in NC than their relative numbers might suggest, and the percentage of students receiving free or reduced-price meals is a highly significant factor in predicting state athletic titles in the 10 sports listed by the N&O.

Of course, the situation is more complex than merely counting the number of schools in each category. Larger schools have more students to draw from. Suburban schools might be located closer to training facilities that offer off-season clubs and clinics. The more affluent schools might be able to recruit and retain better coaches.

With more data and more statistics, you could build better statistical models, but I think the results are clear. For NC high school athletics, wealth and winning are highly correlated.

Post a Comment

The relationship between toothlessness and income

My colleague Robert Allison finds the most interesting data sets to visualize! Yesterday he posted a visualization of toothless seniors in the US. More precisely, he created graphs that show the estimated prevalence of adults (65 years or older) who have had all their natural teeth extracted. The dental profession calls these people edentulous. According to the CDC, about 20% of seniors (more than 35 million Americans) are edentulous.

When I looked at his sorted bar chart, I noticed that the states that had the most toothless seniors seemed to be poorer states such as West Virginia, Kentucky, Tennessee, Mississippi, and Louisiana. In contrast, richer states such as Colorado, Connecticut, and Hawaii had a relatively small number of toothless seniors. I wondered whether there was a correlation between median income in a state and the number of edentulous individuals.

Rob always publishes his SAS programs, so I used his data and merged it with the state median household income (2-year-average medians) as reported by the US Census Bureau. Then I used the SGPLOT procedure to plot the incidence of toothlessness (with lower and upper confidence intervals) versus the median state incomes. I also added a loess curve as a visual smoother to the data, as follows:

title "All Teeth Extracted vs. Median Income";
proc sgplot data=all;
scatter x=income y=pct / yerrorlower=LCL yerrorupper=UCL datalabel=N;
loess x=income y=pct;

The resulting graph (click to enlarge) shows a strong linear correlation (ρ = -0.63) between the data, and the loess curve indicates that the relationship is stronger for states in which the median income is less than $50,000. The confidence intervals indicate that most of the data is well approximated by the loess fit, but there are a few outliers.

Two states in the upper left corner of the graph (West Virginia and Kentucky) have incidences of edentulous that are much higher than suggested by the model that uses only median household income. Several states—including Montana, Florida, and Hawaii—have a much lower incidence of tooth extraction. For easy identification of the states on the scatter plot, you can create a second scatter plot that does not contain the confidence limits and instead displays the state names as labels.


Like Rob, I always post the full SAS code that creates the analyses and graphs in my blog posts, so feel free to play with the data and create more visualizations.

And regardless of your income or state of residence, brush your teeth!

Post a Comment

A new method to simulate the triangular distribution


The triangular distribution has applications in risk analysis and reliability analysis. It is also a useful theoretical tool because of its simplicity. Its density function is piecewise linear. The standardized distribution is defined on [0,1] and has one parameter, 0 ≤ c ≤ 1, which determines the peak of the density. The histogram at left shows 10,000 random draws from a triangular distribution with c=0.25, overlaid with the triangular density function.

Because the triangular distribution is so simple, you can write down an explicit formula for its cumulative and inverse cumulative distribution function. The density is piecewise linear, so the cumulative distribution function (CDF) is piecewise quadratic. The inverse CDF is expressed in terms of a piecewise function that contains square roots.

I assumed that everything about this simple distribution has been known for 100 years. Therefore I was surprised to discover a 2009 paper titled "A new method to simulate the triangular distribution" (Stein and Keblis).

In the short paper, the authors note that the triangular distribution can be expressed as a linear combination of two other densities: a linear increasing density on [0,1] and a linear decreasing density on [0,1]. They then note that the first density is the density for the random variable MIN(u, v), where u and v are independent uniform random variates. Similarly, the second density is the density for the random variable MAX(u, v). Consequently, they realized that you can simulate a random variate from the triangular distribution by generating u and v and setting x = (1-c)*MIN(u,v) + c*MAX(u,v). They call this the MINMAX method for generating x.

The RAND function in the DATA step and the RANDGEN function in SAS/IML have built-in support for the triangular distribution. Nevertheless, it is fun to implement new algorithms, so let's compare the MINMAX algorithm by Stein and Keblis to the traditional simulation algorithm, which is the inverse CDF method.

You can use the DATA step to implement the MINMAX algorithm, but I will use SAS/IML because it is easier to compare the performance of algorithms. The following program computes one million random variates from the triangular distribution in three ways: by calling the RANDGEN subroutine, by using the SAS/IML language to implement the inverse CDF algorithm, and by implementing the MINMAX algorithm by Stein and Keblis.

proc iml;
call randseed(123456);
N = 1e6;
c = 0.25;
time = j(1,3,.);
/* Built-in method: Call RAND or RANDGEN directly */
t0 = time();
   x = j(N,1);
   call randgen(x, "Triangle", c);
time[1] = time()-t0;
/* Inverse CDF method */
t0 = time();
   u = j(N,1);
   call randgen(u, "Uniform");
   x = choose(u<=c, sqrt(c*u), 1 - sqrt((1-c)*(1-u)));
time[2] = time()-t0;
/* MINMAX method: Stein and Keblis (2009) */
t0 = time();
   u = j(N,2);
   call randgen(u, "Uniform");
   x = (1-c)*u[,><] + c*u[,<>];       /* (1-c)MIN + c*MAX */
time[3] = time()-t0;
print N[F=comma9.] time[c={"Rand" "InvCDF" "MinMax"} F=6.4 L=""];

The table shows that you can simulate one million random variates from the triangular distribution in 0.02 seconds by calling the RANDGEN subroutine. If you implement the inverse CDF method "by hand," it takes about 0.06 seconds. If you implement the MINMAX algorithm of Stein and Keblis, that algorithm also takes about 0.06 seconds.

You can draw a few conclusions from this exercise:

  • The SAS/IML language is fast. This is why researchers often use SAS/IML for simulation studies. In my book Simulating Data with SAS I show how to implement simulation studies efficiently in SAS/IML. Even the most complicated simulation in the book completes in less than 30 seconds.
  • SAS/IML is versatile. It is easy to implement new algorithms in the language.
  • SAS/IML is compact. As a matrix language, you can implement an entire algorithm with a few SAS/IML statements.
  • SAS/IML is convenient. Not only was I able to implement two different simulation algorithms, but I was able to easily compare the performance of the algorithms.

I guess another lesson from this exercise is that there is always something new to learn...even about a simple distribution that has been studied for hundreds of years!

Post a Comment

Create a density curve with shaded tails

A SAS programmer wanted to plot the normal distribution and highlight the area under curve that corresponds to the tails of the distribution. For example, the following plot shows the lower decile shaded in blue and the upper decile shaded in red.


An easy way to do this in SAS is to use the BAND statement in PROC SGPLOT. The same technique enables you to shade the area under a curve between two values, such as the central 80% of a distribution or the domain of integration for an integral.

For a parametric distribution such as the standard normal distribution, you can use a DATA step to evaluate the probability density function (PDF) at evenly spaced X values. The plot of the PDF versus X is the density curve. To use the BAND statement, you need to compute two other variables. The left tail consists of the set of X and PDF values for which X is less than some cutoff. For example, the lower decile cutoff is -1.28155, which is the value xlow for which Φ(xlow) = 0.1, where Φ is the standard normal cumulative distribution function (CDF). The following DATA step uses the QUANTILE function to compute the quantile (inverse CDF) values. (To remind yourself what these probability functions do, see the article "Four essential functions for statistical programmers.")

data density;
low  = quantile("Normal", 0.10);   /* 10th percentile = lower decile */
high = quantile("Normal", 0.90);   /* 90th percentile = upper decile */
drop low high;
/* if step size is dx, location of quantile can be off by dx/2 */
do x = -4 to 4 by 0.01;            /* for x in [xMin, xMax] */
   pdf = pdf("Normal", x);         /* call PDF to get density */
   if x<=low  then up1 = pdf; else up1 = .;
   if x>=high then up2 = pdf; else up2 = .;
title "Normal Density with Colored Deciles";
proc sgplot data=density;
   series x=x y=pdf / legendlabel="Normal Density";
   band x=x upper=up1 lower=0 / fillattrs=GraphData1 legendlabel="Lower Decile";
   band x=x upper=up2 lower=0 / fillattrs=GraphData2 legendlabel="Upper Decile";

The density curve with shaded tails was shown previously. The LOWER= option in the BAND statement sets the lower part of the colored bands to 0.

One subtlety is worth mentioning. In the DATA step, a DO loop is used to evaluate the density at points separated by 0.01. That means that the shaded regions have some inaccuracy. For example, the left tail in the image appears at x = -1.28, rather than at the exact value x = -1.28155. If more accuracy is required, you can evaluate the PDF at the exact quantile value and then sort the data set prior to graphing.

Shading quantiles of a kernel density estimate

The same technique can be used to color the tails of a nonparametric curve, such as a kernel density estimate. For example, suppose you want to compute a kernel density estimate for the Systolic variable in the Sashelp.Heart data set. The variable records the systolic blood pressures of 5,209 patients. You can use the OUTKERNEL= option in the HISTOGRAM statement of PROC UNIVARIATE writes the kernel density estimate to a data set, as follows:

proc univariate data=Sashelp.Heart;
   var systolic;
   histogram systolic / kernel outkernel=KerDensity; 
   ods select histogram quantiles;

The output from the procedure (not shown) shows that the 10th percentile of the systolic values is 112, and the 90th percentile is 168. The output data set (called KerDensity) contains six variables. The _VALUE_ variable represents the X locations for the density curve and the _DENSITY_ variable represents the Y locations. You can use the empirical estimate of quantiles to set the cutoff value and highlight the upper and lower tails as follows:

%let lowerX = 112;   /* empirical 10th percentile */
%let upperX = 168;   /* empirical 90th percentile */
data KerDensity;
   set KerDensity;
   if _VALUE_<&lowerX then up1 = _DENSITY_; else up1 = .;
   if _VALUE_>&upperX then up2 = _DENSITY_; else up2 = .;
title "Kernel Density with Colored Deciles";
proc sgplot data=KerDensity;
   series x=_VALUE_ y=_DENSITY_ / legendlabel="Density";
   band x=_VALUE_ upper=up1 lower=0 / fillattrs=GraphData1 legendlabel="Lower Decile";
   band x=_VALUE_ upper=up2 lower=0 / fillattrs=GraphData2 legendlabel="Upper Decile";
   xaxis label="Systolic Blood Pressure" minorcount=4 minor;

Notice that except for the names of the variables, the PROC SGPLOT statements are the same for the parametric and nonparametric densities.

Although this example uses deciles to set the cutoff values, you could also use medically relevant values. For example, you might want to use red to indicate hypertension: a systolic blood pressure that is greater than 140. On the low side, you might want to highlight hypotensive patients who have a systolic blood pressure less than 90.

Post a Comment

Visualizing the distribution of ACT scores

My son is in high school and plans to take the ACT, a standardized test to assess college aptitude and readiness. My wife asked, "What is a good score for the ACT?" I didn't know, but I did a quick internet search and discovered a tabulation of scores for the 1.66 million students who took the ACT in 2012. The data were presented for four "content areas," namely English, mathematics, reading, and science. A score is a number between 1 and 36. The four content areas are combined to create a composite score.

To answer my wife's question, I used PROC MEANS in SAS to compute the following table, which contains summary statistics for ACT scores for the four content areas and for the composite scores. You can see that an average score for the ACT is about 20 or 21. A score of 25 of more will land you in the top quartile (25%) of students. To be in the top decile (10%), you need a score in the high 20s, or even 30.


The data were in a report that included a graph that tried to show the distribution of scores. However, I did not find the graph to be very effective. I thought I'd use SAS to visualize the distribution of ACT scores in a form that might be clearer. You can download the data and SAS program that I used to create the plots.

For a first attempt, I used PROC SGPANEL to create a panel with four bar graphs, one for each subject area. The published graph aggregated scores into bins, so I did the same. You can apply a custom SAS format and use PROC FREQ to aggregate the scores. The following graph contains the same information as the published graph, but shows the distribution of scores more clearly by using a single bar chart for each subject area:


The panel displays four graphs. The first row shows the distribution of aggregated scores for the English content area. About 25% of student scores are in the range 20–23. Only 14% of students scored 28 or higher. The distributions of reading scores is similar.

The distribution of math scores (the graph in the second row) is more interesting. This seems to be a bimodal distribution. About 45% of students have scores less than 20. About one quarter of students score in the 24–27 range. It is curious that almost no one scores lower than 13 in math, although low scores exist in the other content areas.

The bar charts for reading and science occupy the third and fourth rows, respectively. The most obvious feature is that the distribution of science scores is narrow, with only 9% of students scoring 28 or better.

An alternative visualization of ACT scores

The aggregation of scores into bins is not necessary. You can also plot the distribution of the raw scores, as shown below.


I like this variation. The 36 bins show the exact distribution of the scores. You can easily see the bimodal nature of the math scores. You can see that the reading and English scores have a relatively large standard deviation, with many students scoring very high or very low, especially when compared to the science scores.

By drawing a histogram, you can also add "benchmark scores" or other reference values to the distributions. According to the ACT report (p. 3): "A benchmark score is the minimum score needed on an ACT subject-area test to indicate a 50% chance of obtaining a B or higher or about a 75% chance of obtaining a C or higher in the corresponding credit-bearing college courses, which include English Composition, Algebra, Social Science and Biology."

By adding the benchmark scores, you can visually estimate what proportion of the ACT students are likely to get at least a B in each subject area, thus making a connection between ACT scores and college performance. For English, about two thirds of students have at least a 50% chance of getting a B or higher. The benchmark score for English is relatively low compared to the median. In contrast, the benchmarks for math and reading are about the same as the median scores in those subject areas, indicating that about half of the students have a 50% chance of getting a B or higher. Apparently college-level Biology is difficult for students because the benchmark score for science (24) is relatively high. Only 30% of students that take the ACT have a science score greater than or equal to the benchmark.

Which visualization do you prefer? Or maybe you have a different idea that you'd like to try? Leave a comment.

Post a Comment

How to Winsorize data in SAS

Recently a SAS customer asked how to Winsorize data in SAS. Winsorization is best known as a way to construct robust univariate statistics. The Winsorized mean is a robust estimate of location.

The Winsorized mean is similar to the trimmed mean, and both are described in the documentation for PROC UNIVARIATE. Both statistics require that you specify an integer k. For the trimmed mean, you exclude the smallest and largest k nonmissing values and take the mean of the remaining values. Thus for a variable with n observations, the trimmed mean is the mean of the central n – 2k values.

In contrast, when you Winsorize data you replace the k smallest values with the (k+1)st ordered value and you replace the k largest values with the (n–k)th largest value. You then take the mean of the new n observations.

Winsorize data in SAS

In a 2010 paper I described how to use SAS/IML software to trim data. Trimming is the act of truncating the upper and lower tails of the empirical distribution of the data.

Winsorizing is slightly more complicated, especially if the data contain missing values or repeated values. You can sort the data, but sorting puts missing values first, which makes some computations more challenging. Instead, the following code uses the RANK function to compute the rank of the data values. The values with ranks less than or equal to k are then replaced, and similarly for the values with the k largest ranks:

%let DSName = sashelp.heart;
proc iml;
/* SAS/IML module to Winsorize each column of a matrix. 
   Input proportion of observations to Winsorize: prop < 0.5. 
   Ex:  y = Winsorize(x, 0.1) computes the two-side 10% Winsorized data */
start Winsorize(x, prop);
   p = ncol(x);            /* number of columns */
   w = x;                  /* copy of x */
   do i = 1 to p;
      z = x[,i];           /* copy i_th column */
      n = countn(z);       /* count nonmissing values */
      k = ceil(prop*n);    /* number of obs to trim from each tail */
      r = rank(z);         /* rank values in i_th column */
      /* find target values and obs with smaller/larger values */
      lowIdx = loc(r<=k & r^=.);
      lowVal = z[loc(r=k+1)]; 
      highIdx = loc(r>=n-k+1);
      highVal = z[loc(r=n-k)]; 
      /* Winsorize (replace) k smallest and k largest values */
      w[lowIdx,i] = lowVal;
      w[highIdx,i] = highVal;
/* test the algorithm on numerical vars in a data set */
use &DSName;
read all var _NUM_ into X[colname=varNames];
winX = Winsorize(X, 0.1);

The matrix winX contains the Winsorized data, where the extreme values in each column have been replaced by a less extreme value. (If you want to print the Winsorized data, use %let DSName = sashelp.class;, which is a small data set.) To verify that the data are Winsorized correctly, you can compute the Winsorized means in SAS/IML and compare them to the Winsorized means that are computed by PROC UNIVARIATE. The SAS/IML computation is simply the mean of the Winsorized data:

/* Compute Winsorized mean, which is  mean of the Winsorized data */
winMean = mean(winX);
print winMean[c=varNames f=8.4];

With this data you can compute many robust statistics, such as the Winsorized standard deviation or the Winsorized covariance or correlation matrix. You can even compute t tests for a Winsorized mean.

As validation, the following call to PROC UNIVARIATE computes the Winsorized means for each of the numeric variables in the &DSName data set. The results are not shown, but are equivalent to the SAS/IML computations:

/* Validation: compute Winsorized means by using UNIVARIATE */
ods exclude all;
proc univariate data=&dsname winsorized=0.1;
   ods output WinsorizedMeans=winMeans;
ods exclude none;
proc print data=winMeans;
   var VarName Mean;

Unsymmetric Winsorization

The symmetric Winsorization results in a Winsorized mean that has nice theoretical properties. In particular, John Tukey and colleagues derived standard errors, confidence intervals, and other distributional properties for the Winsorized mean. These inferential statistics are computed by PROC UNIVARIATE.

Some software enables you to "Winsorize" data in an unsymmetric manner. Specifically, you can specify quantiles α < 0.5 and β > 0.5 and the software will replace values x < x(α) with x(α) and values x > x(β) with x(β), where x(α) is the value of the αth quantile. You can use the QNTL subroutine in SAS/IML to carry out this computation, or you can use a SAS macro.

However, I do not know whether the distributions of the resulting statistics are known. The interested reader can use a search engine such as Google Scholar to search for "asymmetric Winsorized means." For symmetric distributions, I recommend the classic symmetric Winsorization.

Post a Comment

Compare the performance of algorithms in SAS

As my colleague Margaret Crevar recently wrote, it is useful to know how long SAS programs take to run. Margaret and others have written about how to use the SAS FULLSTIMER option to monitor the performance of the SAS system. In fact, SAS distributes a macro that enables you to parse SAS logs to extract performance and timing information.

But for researchers who are developing custom algorithms in the SAS/IML language, there is a much simpler way to monitor performance. You can use the TIME function in Base SAS to find out (to within about 15 milliseconds) how long it takes for a SAS/IML function or set of statements to run. I use this function often to assess how an algorithm scales with the size of the input data. The TIME function returns the time of day, so if you call it twice and compute the difference in times, you get the time (in seconds) that elapsed between calls.

For example, suppose that you need to compute eigenvalues of a large symmetric matrix. You might be interested in knowing how the algorithm scales with the size of the (square) input matrix. The following SAS/IML program uses the SQRVECH function to create symmetric matrices of size 500, 1000, ..., 2500. For each matrix the TIME function is called just before and immediately after a call to the EIGVAL function, which computes the eigenvalues of the matrix. The elapsed time is plotted against the size of the matrix:

proc iml;
size = T(do(500, 2500, 250));    /* 500, 1000, ..., 2500 */
time = j(nrow(size), 1);         /* allocate room for results */
call randseed(12345); 
do i = 1 to nrow(size);
   n = size[i];
   r = j(n*(n+1)/2, 1);
   call randgen(r, "uniform");   /* generate random elements */
   A = sqrvech(r);               /* form symmetric matrix */
   t0 = time();                  /* save the current time */
   val = eigval(A);              /* put computation here */
   time[i] = time() - t0;        /* compute elapsed time */
title "Time versus Matrix Size";
call series(size, time) grid={x y};

The line plot (click to enlarge) shows the timing of the eigenvalue computation on square matrices of varying sizes. The computation is very fast when the matrices are less than 1000 x 1000, but takes longer as the matrix grows. For a 2500 x 2500 matrix, the computation takes about 15 seconds.

You can also use the TIME function to compare the performance of two or more different algorithms. For example, you can compare the performance of solving linear systems to help you write efficient programs.

You can also use this technique to time how long SAS procedures take to run: You can use the SUBMIT/ENDSUBMIT statements to call any SAS procedure, which means you can "drive" the performance analysis from the SAS/IML language. This technique is much easier than parsing SAS logs!

Incidentally, the distribution of the eigenvalues for a matrix with random elements that are drawn from a given distribution is a fascinating topic that is part of "random matrix theory." For a peek into this beautiful mathematical topic, see the article "The curious case of random eigenvalues", which discusses the symmetric matrices that I used in today's article. For unsymmetric matrices, the eigenvalues can be complex, and the distribution of the eigenvalues in the complex plane makes beautiful images.

For more details about timing computations and assessing the performance of algorithms, see Chapter 15 of Statistical Programming with SAS/IML Software.

Post a Comment

Simulating a drunkard's walk in SAS

You've probably heard of a random walk, but have you heard about the drunkard's walk? I've previously written about how to simulate a one-dimensional random walk in SAS. In the random walk, you imagine a person who takes a series of steps where the step size and direction is a random draw from the normal distribution. The drunkard's walk is similar, but the drunkard takes unit steps in a random direction (for example, left or right in one dimension).

For both kinds of walks, you can create a simulation by drawing the step at random. To find the walker's path, you form the cumulative sum of the steps.


In a one-dimensional drunkard's walk, the drunkard takes a step to the left or to the right with probability 0.5. In the SAS DATA step, you can use the RAND function to draw from the Bernoulli distribution. You can do the same in the SAS/IML language, but SAS/IML also supports the SAMPLE function, which you can use to directly sample from a finite set such as {+1, -1}. The following SAS/IML statements simulate a drunkard's walk that begins at 0 and proceeds for 20 steps. A line plot of the drunkard's path is shown.

proc iml;
call randseed(1234);
NumSteps = 20;
x = sample({-1 1}, 1 // NumSteps);  /* 1 col; NumSteps rows */
position = x[+];                    /* final position of drunkard */
Y = cusum(0 // x);                  /* show initial position at t=0 */
title "A Single Drunkard's Walk";
call series(0:NumSteps, Y) other="refline 0/ axis=y;" option="markers";

For this choice of the random number seed, the drunkard's final position is two units to the left of his original position. If you change the random number seed and rerun the program, the final position might be -4 for some runs, +2 for others, and 0 for some others. However, the final position after 20 steps cannot be an odd number of units. When the drunkard takes an even number of steps, the difference between his final and initial positions is even; when he takes an odd number of steps, the difference is odd.

You can easily simulate 5,000 drunkards who all start at position 0. In the following simulation, each drunkard staggers randomly for 100 steps. The histogram shows the distribution of the positions of the 5,000 drunkards:

proc iml;
call randseed(1234);
NumWalks = 5000;
NumSteps = 100;
x = sample({-1 1}, NumWalks // NumSteps); /* 5000 columns; NumSteps rows */
position = x[+,];                     /* final position of each drunkard */
title "Final Position of 5,000 Drunkards";
call histogram(position) xvalues=do(-40,40,2) rebin={-38,2};

The final position is distributed like a binomial distribution, but with a twist. When n is even (as in this case), all the probability density is restricted to the even positions; when n is odd (such as the 99th step), the density is on the odd positions.

You can create a time series by forming the cumulative sums of the steps. The simulation created a matrix where each column represents the path of a single drunkard. To plot all of the paths on a single graph, it is convenient to convert the data from wide to long form. Then a call to the SCATTER routine plots all of the positions. The markers are semi-transparent so that overplotted points appear darker and the plot gives some indication of the density of the positions.

/* create time series for each drunkard */
Y = j(nrow(x), ncol(x));
do i = 1 to NumWalks;
   Y[,i] = cusum(x[,i]); /* just show steps 1..N */
/* http://blogs.sas.com/content/iml/2015/02/27/multiple-series-iml.html */
load module=WideToLong;         /* or insert the module definition here */
run WideToLong(Position, Step, group, Y);
title "Distribution of 5,000 Drunkard's Walks";
call Scatter(Step, Position) group=group 
        option="transparency=0.95 markerattrs=(symbol=SquareFilled color=blue)" 
        procopt="noautolegend" yvalues=do(-40,40,5) other="refline 0 / axis=y;";

In conclusion, the drunkard's walk is an interesting variation of the random walk. In the drunkard's walk, the step size is always unit length. The location of a drunkard after n steps follows a binomial distribution on even or odd positions.

Although I demonstrated this simulation by using the SAS/IML language, it is straightforward to reproduce these results by using the DATA step and the SGPLOT procedure. I leave that exercise for the motivated reader.

Post a Comment