A statistical analysis of Stephen Curry's shooting


Last week Robert Allison showed how to download NBA data into SAS and create graphs such as the location where Stephen Curry took shots in the 2015-16 season to date. The graph at left shows the kind of graphs that Robert created. I've reversed the colors from Robert's version, so that red indicates "good" (a basket was scored) and blue indicates "bad" (a missed shot). The location of the NBA three-point line is evident by the many markers that form an arc in the scatter plot.

When I saw the scatter plot, I knew that I wanted to add some statistical analysis. In particular, I wanted to use SAS to construct a statistical model that estimates the probability that Curry scores from any position on the basketball court.

This article focuses on the results of the analysis. You can download the SAS program that generates the analyses and graphics. Although this article analyzes Stephen Curry, you can modify the SAS program to analyze Kevin Durant, Lebron James, or any other basketball player.

Probability as a function of distance

The first analysis estimates the probability that Curry makes a basket solely as a function of his distance from the basket. Curry is known for his consistent ability to make three-point shots. A three-point shot in the NBA requires that a player shoot from at least 22 feet away (when near the baseline) or 23.9 feet away (when further up the court).


You can use logistic regression to model the probability of making a shot as a function of the distance to the basket. The adjacent plot shows the result of a logistic regression analysis in SAS. The model predicts a probability of 0.7 that Curry will make a shot from under the basket, a probability of 0.5 from 20 feet away, and a probability of 0.46 from the three-point arc, indicated by the vertical gray line at 23.9 feet. Recall that a probability of 0.46 is equivalent to predicting that Curry will sink 46% of shots from the three-point arc.

Almost all (98.3%) of Curry's shots were taken from 30 feet or closer, and the shots taken from beyond 30 feet were end-of-quarter "Hail Mary" heaves. Therefore, the remaining analyses restrict to shots that were from 30 feet or closer.

Probability as a function of angle and distance

The previous analysis only considers the distance from the basket. It ignores position of the shot relative to the basket. In general, the probability of scoring depends on the location from which the shot was launched.

For consistency, let's agree that "right" and "left" means the portion of the court as seen by a fan sitting behind the backboard. This is, of course, opposite of what Curry would see when coming down the court toward the basket. Our "right" is Curry's left.


One way to model the positional dependence in the model is incorporate the angle relative to the backboard. The diagram shows one way to assign an angle to each position on the court. In the diagram, 90 degrees indicates a perpendicular shot to the basket, such as from the top of the key. An angle of 0 indicates a "baseline shot" from the right side of the court. Similarly, an angle of 180 degrees means a baseline shot from the left side of the court.

The following panel of graphs is the result of a logistic regression analysis that includes the interaction between angle and distance. The vertical lines in some plots indicate the distance to the sideline at particular angles. For 0 and 180 degress, the distance from the basket to the sideline is 25 feet.


The panel of plots show that Curry is most accurate when he shoots from the left side of the court. (The left side corresponds to angles greater than 90 degrees, which are on the left side of the panel.) Remarkably, the model estimates that Curry's probability of making a shot from the left side barely depends on the distance from the basket! He is a solid shooter (probability 0.5, which is 50%) from the left baseline (Angle = 180) and from a slight angle (Angle = 150). The previous scatter plot shows that he shoots many shots from the 120 degree angle. This analysis shows that he is uncannily accurate from 20 and even 30 feet away, although the probability of scoring decreases as the distance increases.

On the right side of the court (angles less than 90 degrees), Curry's probability of making a shot depends more strongly on the distance to the basket. Near the basket, the model predicts a scoring probability of 0.6 or more. However, the probability drops dramatically as the distance increases. On the right side of the court, Curry is less accurate from 20 or more feet than for the same distance on the other side. At three-point range, Curry's probability of making a shot on the right (his left) drops to "only" 0.4. The probability drops off most dramatically when Curry shoots from the baseline (Angle = 0).

Probability as a function of position

A logistic analysis is a parametric model, which means that the analyst must specify the explanatory variables in the model and also the way that those variables interact with each other. This often leads to simplistic models, such as a linear or quadratic model. A simple model is often not appropriate for modeling the scoring probability as a function of the Cartesian X and Y positions on the court because a simple model cannot capture local spatial variations in the data.

SAS provides several possibilities for nonparametric modeling of data, but let's stick with logistic regression for now. Many SAS regression procedures, including PROC LOGISTIC, support using an EFFECT statement to generate spline effects for a variable. A spline effect expands a variable into spline bases. Spline effects enable you to model complex nonlinear behavior without specifying an explicit form for the nonlinear effects. The following graph visualizes such a model.


The image shows a scatter plot of the location of shots overlaid on a heat map that shows the predicted probability of Curry sinking a basket from various locations on the court. To better show the shot locations, the image has been stretched vertically. As mentioned previously, the location with the highest predicted probability is under the basket. From farther away, the predicted probability varies according to direction: to the left of the basket the probability is about 0.5, whereas a 15-foot jumper in front of the basket has probability 0.6. Notice the relative abundance of blue color (low probability) for shots on the right side. The lowest probability (about 0.3) occurs just beyond the three-point line at about a 60 degree angle, which agrees with the previous analysis. The same distance on the left side of the court is a much lighter shade of whitish-blue that corresponds almost 0.5 probability.

Statisticians will wonder about how well the model fits the data. The Pearson goodness-of-fit test indicates that the spline fit is not great, which is not surprising for a parametric fit to this kind of spatial data. In a follow-up post I use SAS to create nonparametric predictive models for the same data.


SAS programmers will appreciate the fact that "effect plots" in this article were generated automatically by PROC LOGISTIC. By using the EFFECT statement and the EFFECTPLOT statement, it is simple to create graphs that visualize the predictions for a logistic regression model.

These graphs show that in general Stephen Curry is a phenomenal shooter who has a high probability of scoring from even a long distance. Logistic regression was used to model the probability that Curry makes a shot from various angles and locations on the court. The analysis indicates that Curry shoots better from his right side, especially from three-point range.

Post a Comment

Simulate from the multinomial distribution in the SAS DATA step

There are several ways to simulate multinomial data in SAS. In the SAS/IML matrix language, you can use the RANDMULTINOMIAL function to generate samples from the multinomial distribution. If you don't have a SAS/IML license, I have previously written about how to use the SAS DATA step or PROC SURVEYSELECT to generate multinomial data.

The DATA step method I used in that previous article was a brute force method called the "direct method." This article shows how to simulate multinomial data in the DATA step by using a more efficient algorithm.

The direct method for simulating multinomial data

The direct method constructs each multinomial observation by simulating the underlying process, which can be thought of as drawing N balls (with replacement) from an urn that contains balls of k different colors. The parameters to the multinomial distribution are N (the number of balls to draw) and (p1, p2, ..., pk), which is a vector of probabilities. Here pi is the probability of drawing a ball of the ith color and Σi pi = 1.

In the direct method, you simulate one multinomial draw by explicitly generating N balls and counting the number of each color. The distribution of counts (N1, N2, ..., Nk) follows a multinomial distribution, where N = Σi Ni. The direct method runs quickly if N is small and you simulate a relatively small multinomial sample. For example, when N=100 you can simulate thousands of multinomial observations almost instantly.

The conditional method for simulating multinomial data

If N is large or you plan to generate a large number of observations, there is a more efficient way to simulate multinomial data in the SAS DATA step. It is called the "conditional method" and it uses the fact that you can think of a multinomial draw as being a series of binomial draws (Gentle, 2003, pp. 198-199).

Think about generating the Ni in sequence. The first count, N1, follows the binomial distribution Bin(p1, N). If you generate a specific value n1, then there are N - n1 items left to draw. The probability of drawing N2 is p2 / (1-p1), which implies that N2 ~ Bin(p2 / (1-p1), N - n1).

Continue this process. If the first i-1 counts have been drawn, then Ni ~ Bin(N − n1 - ... - ni-1, pi/(1 − p1 - ... - pi-1 )). This leads to the following efficient simulation method for multinomial observations:

/* generate multinomial sample by using conditional method */
%let SampleSize = 1000;             /* number of observations in MN sample */
%let N = 100;                       /* number of trials in each MN draw */
data MN;
call streaminit(12435);
array probs{3} _temporary_ (0.5 0.2 0.3); /* prob of drawing item 1, 2, 3 */
array x{3};                         /* counts for each item */
do obs = 1 to &SampleSize; 
   ItemsLeft = &N;                  /* how many items remain? */
   cumProb = 0;                     /* cumulative probability */
   do i = 1 to dim(probs)-1;        /* loop over k categories */
      p = probs[i] / (1 - cumProb);
      x[i] = rand("binomial", p, ItemsLeft);     /* binomial draw */
      ItemsLeft = ItemsLeft - x[i]; /* decrement size by selection */
      cumProb = cumProb + probs[i]; /* adjust prob of next binomial draw */
   x[dim(probs)] = ItemsLeft;       /* remaining items go into last category */
keep x:;
title "Multinomial Distribution, p={0.5 0.2 0.3}, N=100";
proc kde data=MN;
   bivar x1 x2 / bwm=1.5 plots=ContourScatter;
Simulate Multinomial datat with the SAS DATA step

Whereas the direct method requires an inner DO loop that performs N iterations, the conditional method only requires k iterations, where k is the number of categories being drawn. In the example, N=100, whereas k=3. The direct method must generate 100,000 values from the "Table" distribution, whereas the conditional method generates 3,000 values from the binomial distribution.

The graph shows 1,000 observations from the multinomial distribution with N=100 and p={0.5, 0.2, 0.3}. Because of overplotting, you cannot see all 1,000 observations, so the scatterplot is overlaid on a kernel density estimate. The graph shows the counts for the first two categories; the third count is determined by the fact that the counts sum to 100. Notice that the distribution of counts is centered near the expected values x1=50 and x2=20.

In summary, if you want to simulate multinomial data by using the SAS DATA step, the algorithm in this article is more efficient than the brute-force direct computation. This algorithm simulates a multinomial vector conditionally as a series of binomial draws.

Post a Comment

Monte Carlo estimates of pi and an important statistical lesson

Today is March 14th, which is annually celebrated as Pi Day. Today's date, written as 3/14/16, represents the best five-digit approximation of pi. On Pi Day, many people blog about how to approximate pi. This article uses a Monte Carlo simulation to estimate pi, in spite of the fact that "Monte Carlo methods are ... not a serious way to determine pi" (Ripley 2006, p. 197). However, this article demonstrates an important principle for statistical programmers that can be applied 365 days of the year. Namely, I describe two seemingly equivalent Monte Carlo methods that estimate pi and show that one method is better than the other "on average."

Monte Carlo estimates of pi

mcestpi1 To compute Monte Carlo estimates of pi, you can use the function f(x) = sqrt(1 – x2). The graph of the function on the interval [0,1] is shown in the plot. The graph of the function forms a quarter circle of unit radius. The exact area under the curve is π / 4. There are dozens of ways to use Monte Carlo simulation to estimate pi. Two common Monte Carlo techniques are described in an easy-to-read article by David Neal (The College Mathematics Journal, 1993). The first is the "average value method," which uses random points in an interval to estimate the average value of a continuous function on the interval. The second is the "area method," which enables you to estimate areas by generating a uniform sample of points and counting how many fall into a planar region.

The average value method

In calculus you learn that the average value of a continuous function f on the interval [a, b] is given by the following integral: Monte Carlo estimates of pi: Quarter circle In particular, for f(x) = sqrt(1 – x2), the average value is π/4 because the integral is the area under the curve. In symbols, piMCest3 If you can estimate the left hand side of the equation, you can multiply the estimate by 4 to estimate pi. Recall that if X is a uniformly distributed random variable on [0,1], then Y=f(X) is a random variable on [0,1] whose mean is favg. It is easy to estimate the mean of a random variable: you draw a random sample and compute the sample mean. The following SAS/IML program generates N=10,000 uniform variates in [0,1] and uses those values to estimate favg = E(f(X)). Multiplying that estimate by 4 gives an estimate for pi.
proc iml;
call randseed(3141592);       /* use digits of pi as a seed! */
N = 10000;
u = randfun(N, "Uniform");    /* U ~ U(0, 1) */
Y = sqrt(1 - u##2);
piEst1 = 4*mean(Y);           /* average value of a function */
print piEst1;
Monte Carlo estimates of pi: First estimate In spite of generating a random sample of size 10,000, the average value of this sample is only within 0.01 of the true value of pi. This doesn't seem to be a great estimate. Maybe this particular sample was "unlucky" due to random variation. You could generate a larger sample size (like a million values) to improve the estimate, but instead let's see how the area method performs.

The area method

Consider the same quarter circle as before. If you generate a 2-D point (an ordered pair) uniformly at random within the unit square, then the probability that the point is inside the quarter circle is equal to the ratio of the area of the quarter circle divided by the area of the unit square. That is, P(point inside circle) = Area(quarter circle) / Area(unit square) = π/4. It is easy to use a Monte Carlo simulation to estimate the probability P: generate N random points inside the unit square and count the proportion that fall in the quarter circle. The following statements continue the previous SAS/IML program:
u2 = randfun(N, "Uniform");   /* U2 ~ U(0, 1) */
isBelow = (u2 < Y);           /* binary indicator variable */
piEst2 = 4*mean(isBelow);     /* proportion of (u, u2) under curve */
print piEst2;
Monte Carlo estimates of pi: Second estiamte The estimate is within 0.0008 of the true value, which is closer than the value from the average value method. Can we conclude from one simulation that the second method is better at estimating pi? Absolutely not! Longtime readers might remember the article "How to lie with a simulation" in which I intentionally chose a random number seed that produced a simulation that gave an uncharacteristic result. The article concluded by stating that when someone shows you the results of a simulation, you should ask to see several runs or to "determine the variance of the estimator so that you can compute the Monte Carlo standard error."

The variance of the Monte Carlo estimators

I confess: I experimented with many random number seeds before I found one that generated a sample for which the average value method produced a worse estimate than the area method. The truth is, the average values of the function usually give better estimates. To demonstrate this fact, the following statements generate 1,000 estimates for each method. For each set of estimates, the mean (called the Monte Carlo estimate) and the standard deviation (called the Monte Carlo standard error) are computed and displayed:
/* Use many Monte Carlo simulations to estimate the variance of each method */
NumSamples = 1000;
pi = j(NumSamples, 2);
do i = 1 to NumSamples;
   call randgen(u, "Uniform");         /*  U ~ U(0, 1) */
   call randgen(u2, "Uniform");        /* U2 ~ U(0, 1) */
   Y = sqrt(1 - u##2);
   pi[i,1] = 4*mean(Y);        /* Method 1: Avg function value */
   pi[i,2] = 4*mean(u2 < Y);   /* Method 2: proportion under curve */
MCEst = mean(pi);              /* mean of estimates = MC est */
StdDev = std(pi);              /* std dev of estimates = MC std error */
print (MCEst//StdDev)[label="Comparison of Methods"
                      colname={"Avg Func" "Area"}
                      rowname={"MC Estimate" "MC StdErr"}];
Monte Carlo estimates of pi: Overlay distributions Now the truth is revealed! Both estimators provide a reasonable approximation of pi, but estimate from the average function method is better. More importantly, the standard error for the average function method is about half as large as for the area method. You can visualize this result if you overlay the histograms of the 1,000 estimates for each method. The following graph shows the distribution of the two methods. The average function estimates are in red. The distribution is narrow and has a sharp peak at pi. In contrast, the area estimates are shown in blue. The distribution is wider and has a less pronounced peak at pi. The graph indicates that the average function method is more accurate because it has a smaller variance. piMCest1 Estimating pi on Pi Day is fun, but this Pi Day experiment teaches an important lesson that is relevant 365 days of the year: If you have a choice between two ways to estimate some quantity, choose the method that has the smaller variance. For Monte Carlo estimation, a smaller variance means that you can use fewer Monte Carlo iterations to estimate the quantity. For the two Monte Carlo estimates of pi that are shown in this article, the method that computes the average function value is more accurate than the method that estimates area. Consequently, "on average" it will provide better estimates.
Post a Comment

Comparative histograms: Panel and overlay histograms in SAS

You can use histograms to visualize the distribution of data. A comparative histogram enables you to compare two or more distributions, which usually represent subpopulations in the data. Common subpopulations include males versus females or a control group versus an experimental group. There are two common ways to construct a comparative histogram: you can create a panel of histograms, or you can overlay histograms in a single graph. This article shows how to create comparative histograms in SAS.

Sanjay Matange and I have each written multiple previous articles on this topic. This article collects many of the ideas in one place. In the SAS 9.2 and SAS 9.3 releases, the graph template language (GTL) was required to construct some of these graphs. However, thanks to recent features added to PROC SGPLOT, PROC SGPANEL, and PROC UNIVARIATE, you can now create comparative histograms in SAS without writing any GTL.

Panel of histograms

Comparative histogram of three groups in a panel of histograms

A panel of histograms enables you to compare the data distributions of different groups. You can create the histograms in a column (stacked vertically) or in a row. I usually prefer a column layout because it enables you to visualize the relative locations of modes and medians in the data.

In SAS, you can create a panel of histograms by using PROC UNIVARIATE or by using PROC SGPANEL. Both procedures require that the data be in "long form": one continuous variable that specifies the measurements and another categorical variable that indicates the group to which each measurement belongs. If your data are in "wide form," you can convert the data from wide form to long form.

To use PROC UNIVARIATE, specify the categorical variable on the CLASS statement and the continuous variable on the HISTOGRAM statement. For example, the following example compares the distribution of the SepalLength variable for each of the three values of the Species variable in the Sashelp.Iris data:

proc univariate data=sashelp.iris;
  class Species;
  var SepalLength;      /* computes descriptive statisitcs */
  histogram SepalLength / nrows=3 odstitle="PROC UNIVARIATE with CLASS statement";
  ods select histogram; /* display on the histograms */

The result is shown at the beginning of this section. The graph suggests that the median value of the SepalLength variable differs between levels of the Species variable. Furthermore the variance of the "Virginica" group is larger than for the other groups.

You can create similar graphs by using the SGPANEL procedure, which supports a wide range of options that control the layout. Specify the Species variable in the PANELBY statement and the SepalLength variable in the HISTOGRAM statement. The following call to PROC SGPANEL creates a comparative histogram:

title "PROC SGPANEL with PANELBY statement";
proc sgpanel data=sashelp.iris;
  panelby Species / rows=3 layout=rowlattice;
  histogram SepalLength;

The graph produced by PROC SGPANEL is similar to the previous graph.

With the GTL you can create more complicated panel displays than are shown here. For example, Sanjay shows how to create mirrored histograms, which are sometimes used for population pyramids.

Overlay histograms

For comparing the distributions of three or more groups, I recommend a panel of histograms. However, for two groups you might want to overlay the histograms. You can use the TRANSPARENCY= option in PROC SGPLOT statements so that both histograms are visible, even when the bars overlap. The portion of bars that overlap are shown in a blended color.

In the HISTOGRAM statement of PROC SGPLOT, you can use the GROUP= option to specify the variable that indicates group membership. The GROUP= option overlays the histograms for each group, as the following example shows:

proc sgplot data=sashelp.iris;
  where Species in ("Setosa", "Versicolor");       /* restrict to two groups */
  histogram SepalLength / group=Species transparency=0.5;       /* SAS 9.4m2 */
  density SepalLength / type=kernel group=Species; /* overlay density estimates */
Overlay histograms

In this graph I added density estimates to help the eye visualize the basic shape of the two histograms. The purple region shows the overlap between the two distributions. For more than two categories, you might want to omit the histograms and just overlay the density estimates. The graph combines the first two rows of the panel in the previous section. The overlay enables you to compare the two subpopulations without your eye bouncing back and forth between rows of a panel.

The GROUP= option was added to the HISTOGRAM and DENSITY statements in SAS 9.4m2. You can create the same graph in PROC UNIVARIATE by using the OVERLAY option in the HISTOGRAM statement. The OVERLAY option requires SAS 9.4m3.

Overlay histograms of different variables

Because PROC SGPLOT enables you to use more than one HISTOGRAM statement, you can also overlay the histograms of different variables.

When comparing histograms it is best that both histograms use the same bin width and anchor locations. Prior to SAS 9.3, you could overlay histograms by using the graph template language (GTL). However, SAS 9.3 introduced support for the BINWIDTH= and BINSTART= options in the HISTOGRAM statement in PROC SGPLOT. Therefore you can force the histograms to have a common bin width, as shown in the following example:

title "Overlay Histograms with PROC SGPLOT";
proc sgplot data=Sashelp.Iris;
  histogram PetalLength / binwidth=5 transparency=0.5
               name="petal" legendlabel="Petal Width";
  histogram SepalLength / binwidth=5 transparency=0.5
               name="sepal" legendlabel="Sepal Width";
  density PetalLength / type=kernel lineattrs=GraphData1;  /* optional */
  density SepalLength / type=kernel lineattrs=GraphData2;  /* optional */
  xaxis label="Length (mm)" min=0;
  keylegend "petal" "sepal" / across=1 position=TopRight location=Inside;
Overlay histograms


In summary, SAS provides multiple ways to use histograms to compare the distributions of data. To obtain a panel of histograms, the data must be in the "long" format. You can then:

  • Use the CLASS statement in PROC UNIVARIATE to specify the grouping variable. This is a good choice if you also want to compute descriptive statistics or fit a distribution to the data.
  • Use PROC SGPANEL, which provides you with complete control over the layout of the panel, axes, and other graphical options.

If you only have two groups and you want to overlay partially transparent histograms, you can do the following:

  • Use the GROUP= option in the HISTOGRAM statement of PROC SGPLOT (requires SAS 9.4m2).
  • Use the OVERLAY option in the HISTOGRAM statement of PROC UNIVARIATE (requires SAS 9.4m3).

Lastly, if you have two variable to compare, you can use two HISTOGRAM statements. Be sure to use the BINWIDTH= option (and optionally the BINSTART= option), which requires SAS 9.3.

The comparative histogram is not a perfect tool. You can also use spread plots and other techniques. However, for many situations a panel of histograms or an overlay of histograms provides an effect way to visually compare the distributions of data in several groups.

Post a Comment

How to use COLLECTION effects to specify pairwise interactions in SAS

Most SAS regression procedures support the "stars and bars" operators, which enable you to create models that include main effects and all higher-order interaction effects. You can also easily create models that include all n-way interactions up to a specified value of n. However, it can be a challenge to specify models that include many—but not all!—higher-order interactions. This article describes a little-known trick: you can use COLLECTION effects to specify interaction terms.

Stars and Bars: Building models with interaction terms in SAS

Many of the regression procedures in SAS (such as GLM, GENMOD, LOGISTIC, MIXED,...) support the bar operator (|) to specify all interactions between effects. For example, the following MODEL statement specifies that the model should include all main effects and all higher-order interactions:

proc logistic;
   model Y = x1 | x2 | x3 | x4;   /* all main effects and interactions */

The previous MODEL statement includes all two-way, three-way, and four-way interaction effects. The statement is equivalent to the following statement that uses the star operator (*) to explicitly specify each interaction term:

model Y = x1 x2 x3 x4                         /* all main effects */
          x1*x2 x1*x3 x1*x4 x2*x3 x2*x4 x3*x4 /* all two-way interactions */
          x1*x2*x3 x1*x2*x4 x1*x3*x4 x2*x3*x4 /* all three-way interactions */
          x1*x2*x3*x4;                        /* all four-way interactions */

Fitting a model with so many effects will lead to overfitting, so in practice an analyst might restrict the model to two-way interactions. Again, SAS supplies an easy syntax. You can use the "at" operator (@) to specify the highest interaction terms in the model. For example, the following syntax specifies that the model contains only main effects and two-way interactions:

model Y = x1 | x2 | x3 | x4 @2;   /* main effects and two-way interactions */

Specifying many, but not all, interaction terms

Unfortunately, there is no simple syntax for constructing many, but not all, interaction effects. This can be frustrating when there is a structure to the interaction terms. A common structure is that there are two lists of variables and you want to build all interactions that involve one effect from the first list and one effect from the second list.

For example, suppose you want to create the following interaction effects:
c1*x1 c1*x2 c2*x1 c2*x2
The interaction terms are the pairwise combinations of the variables {c1 c2} with the variables {x1 x2}. Note, however, that within-list interactions are not desired: there are no terms for c1*c2 or x1*x2.

It would be great to have some kind of shorthand notation that tells SAS to "cross all elements in the first list with all elements in the second list." A natural syntax would be
(c1 c2) | (x1 x2)
but unfortunately that syntax is not supported.

Some SAS programmers might use the macro language to generate all pairwise interactions between two lists of variables, but COLLECTION effects offer an easier way.


More than a dozen regression procedures in SAS support the EFFECT statement. According the documentation, the EFFECT statement generates "special collections of columns for design matrices." In particular, the so-called COLLECTION effect enables you to specify multiple variables that are "considered as a unit."

As a colleague recently reminded me, you can use COLLECTION effects to specify interactions. If V and W are two collection effects, then V*W contains all pairwise interactions of the individual variables in V with the individual variables in W. Similarly, V | W contains all main effects and the pairwise interaction effects.

As an example of using COLLECTION effects, the following model uses two classification variables and four continuous variables in the SasHelp.BWeight data. Here is the model specified in the usual way:

proc logistic data=Sashelp.Heart;
   class BP_Status Sex;
   model Status = BP_Status Sex Cholesterol Height Weight MRW
         BP_Status*Cholesterol BP_Status*Height BP_Status*Weight BP_Status*MRW
               Sex*Cholesterol       Sex*Height       Sex*Weight       Sex*MRW;
   ods select ParameterEstimates;
   ods output ParameterEstimates = Parm1;

Manually enumerating all those interaction terms requires a lot of typing. More importantly, the enumeration does not make it clear that the interaction terms are the pairwise interactions between the classification variables and the continuous variables. In contrast, the following statements use COLLECTION effects to define two sets of variables. The MODEL statement uses the familiar bar operator to form all main effects and pairwise interactions between the variables.

proc logistic data=Sashelp.Heart;
   class BP_Status Sex;
   effect V = collection(BP_Status Sex);                     /* one list     */ 
   effect W = collection(Cholesterol Height Weight MRW);     /* another list */ 
   model Status = V | W;      /* vars and interactions between the var lists */
   ods select ParameterEstimates;
   ods output ParameterEstimates = Parm2;

The second model statement is more concise. The two models produce equivalent predictions, but the second is much easier to type and to interpret.

You can use PROC COMPARE to show that the parameter estimates are the same (to eight decimal places), and therefore the predicted values will be the same. Because the order of the parameters differs between models, the parameter estimates are sorted before running the comparison.

proc sort data=Parm1; by Estimate; run;
proc sort data=Parm2; by Estimate; run;
proc compare brief method=absolute criterion=1e-8
             base   =Parm1(drop=Variable)
             compare=Parm2(drop=Variable ClassVal:);
NOTE: All values compared are within the equality criterion used.

This use of the COLLECTION effect is somewhat nonstandard. SAS introduced COLLECTION effects for variable selection routines such as the "group LASSO" as a way to specify that all variables in the collection should be included in the model, or all should be excluded. The variables enter or leave the model "as a unit."

Although most tables and statistics from PROC LOGISTICS are the same for the two models, there are differences. One difference is the "Type 3 Analysis of Effects," which tests whether all the parameters associated with an effect are zero. The first call to PROC LOGISTIC analyzes 14 effects; the second call analyzes three (collection) effects.

In summary, the EFFECT statement provides a way to treat sets of variables "as a unit." This leads to a simple syntax for forming specific interaction effects. The example in this article showed how to create pairwise interactions, but the COLLECTION effects can also be used to specify higher-order interactions.

Post a Comment

Dummy variables in SAS/IML

Last week I showed how to create dummy variables in SAS by using the GLMMOD procedure. The procedure enables you to create design matrices that encode continuous variables, categorical variables, and their interactions. You can use dummy variables to replace categorical variables in procedures that do not support a CLASS statement. You can use other procedures to create design matrices for other parameterizations.

SAS/IML programmers can use two built-in functions to create dummy variables. The DESIGN function generates dummy variables for the GLM parameterization. The DESIGNF function generates dummy variables for the EFFECT encoding. You can use the HDIR function to create interaction effects from the main-effect dummy variables.

The following DATA step creates sample data. The PROC IML statements read the data into vectors and use the DESIGN and DESIGNF function to create dummy variables. Note the use of the ODS LAYOUT GRIDDED statement to print SAS/IML matrices across the page.

data Patients;
   keep Cholesterol Sex BP_Status;
   set sashelp.heart;
   if 18 <= _N_ <= 27;
proc iml;
use Patients;  
read all var {Cholesterol Sex BP_Status};  
close Patients;
Dummy_GLM = design( BP_Status );      /* dummy vars, GLM encoding */
Dummy_Effect = designf( BP_Status );  /* dummy vars, EFFECT encoding */
ods layout gridded columns=3 advance=table; /* create gridded layout in HTML */
print BP_Status, Dummy_GLM, Dummy_Effect;
ods layout end;

You can see that the DESIGN matrix creates k binary dummy variables for a categorical variable that contains k levels. The first column represents the first level (in alphabetical order), which for this data is "High." The first column has the value 1 for each row for which BP_Status="High." Similarly, the second column contains a 1 for each row for which BP_Status="Normal." The third column contains a 1 for each row for which BP_Status="Optimal."

In contrast, the DESIGNF creates a design matrix that has k–1 columns. The matrix has the EFFECT encoding, with the last category ("Optimal") serving as the reference level. The first column has the value 1 for rows for which BP_Status="High," the value –1 for rows for which BP_Status is the reference level, and 0 otherwise. The second column is similar, except that 1 indicates rows for which BP_Status="Normal."

Linear regression with dummy variables

Dummy variables convert character variables (and other categorical variables) into numerical variables with a specified encoding. As such they enable you to use matrix computations to perform a statistical analysis such as linear regression.

For example, the following SAS/IML statements perform a regression analysis that models the Cholesterol variable as a linear function of the Sex and BP_Status variables. The statements use the DESIGNF statement to form the dummy variables for each categorical variable. These columns (and an intercept column) are concatenated horizontally to form the design matrix. Because the DESIGNF statement is a nonsingular parameterization, you can use the SOLVE function to solve the normal equations and obtain the least squares solution, as follows:

Y = Cholesterol;                 /* response variable */
Intercept = j(nrow(Y), 1, 1);
X1 = designf( Sex );
X2 = designf( BP_Status );
X = Intercept || X1 || X2;      /* design matrix with EFFECT parameterization */
/* Matrix formulation of one-way ANOVA (cell means model): Y = X*beta + epsilon
   See https://en.wikipedia.org/wiki/Design_matrix       */
b = solve( X`*X, X`*Y );       /* solve normal equations */
print b[rowname={"Intercept" "Sex:Female" "BP_Status:High" "BP_Status:Normal"}];

The interpretation of the parameter estimates for this linear example is somewhat complicated; see Lewis (2007) if you are interested. However, for comparison, the following call to PROC GENMOD creates parameter estimates for the same linear model. The PARAM=EFFECT option is used so that the procedure uses the EFFECT parameterization.

proc genmod data=Patients;
   class Sex BP_Status / param=effect;
   model Cholesterol = Sex BP_Status / noscale;
   ods select ParameterEstimates;

Strictly speaking, PROC GENMOD uses maximum likelihood estimation whereas the PROC IML code is a least squares estimate, but you can see that the estimates are identical to four decimal places.

REFERENCE encoding and the GLM parameter estimates

Although SAS/IML does not provide a built-in function for generating a design matrix that uses the REFERENCE encoding, you can easily create such a function. The REFERENCE encoding is similar to the GLM encoding, but with the (redundant) last column dropped:

/* design matrix for reference encoding */
start designr(x); 
   A = design(x);             /* get design matrix with GLM encoding */
   return( A[,1:ncol(A)-1] ); /* drop last column */

If you use the REFERENCE encoding to create the X matrix as in the previous section, then the SOLVE function returns the same parameter estimates that are provided by the GLM procedure. (The GLM procedure sets the parameters for the last dummy columns to zero.)

Interactions of dummy variables

You can use the HDIR function to create interaction effects. For example, the following statements create columns that indicate the interaction between the Sex and BP_Status variables. The printed output shows the results for the EFFECT parameterization, but the same SAS/IML statement will produce the interaction effects for other parameterizations:

X1X2 = hdir(X1, X2);   /* dummy variables for interaction term */
print X1X2[c={"Female High" "Female Normal"}];

By using the tips in this article, you can create design matrices for ANOVA and regression models that contain categorical variables. In this way, you can use SAS/IML to reproduce the parameter estimates in many SAS linear regression procedures.

Post a Comment

One-level data set names in SAS are not always stored in WORK

One of the first things SAS programmers learn is that SAS data sets can be specified in two ways. You can use a two-level name such as "sashelp.class" which uses a SAS libref (SASHELP) and a member name (CLASS) to specify the location of the data set. Alternatively, you can use a one-level name such as "TempData," and SAS searches for the data set in a default location.

In many SAS environments, one-level data set names are like the seven little dwarves: Heigh-Ho, heigh-ho, it's off to WORK they go!

In other words, the WORK directory is the default location for one-level names. Consequently, one-level names often imply "temporary data," because data sets in WORK are deleted when you exit SAS.

However, it is possible to use the OPTIONS statement to change the libref that SAS searches when you specify a one-level SAS data set name. The option name is USER. The following statements specify a new libref that SAS should use as the default location for one-level data set names:

libname DEFLIB "C:/Temp";    /* define any libref */
options user=DEFLIB;         /* set the default location for one-level names */

For example, the following DATA step uses a one-level name for the data set. Consequently, the data set is created in the USER directory and PROC DATASETS lists the data sets in USER rather than WORK:

data TempData; x=1; y=2; z=3; run;  /* create data set using one-level name */
proc datasets; run;                 /* note that it is in the USER libref! */
Default libref for one-level name

Personally, I never do this because data sets in USER are not deleted when SAS exits. However, this example shows that one-level names are not always stored in WORK.

Discover the default storage location

If a one-level data set name is not necessarily in WORK, can you programmatically discover the libref where the data set is? Yes! The GETOPTION function returns the value for any SAS option, so you can retrieve the value of the USER option. For example, the following DATA step discovers the libref and data set name for a specified data set. For a two-level name, the name contains a period, which you can find by using the FINDC function. You can then use the SUBSTR function to extract the name of the libref and data set. If the data set name is a one-level name, then the GETOPTION function obtains the default libref. (If the USER option is not set, GETOPTION returns a blank string.)

%let MyData = TempData;       /* specify one-level or two-level data set name */
data _null_;
dsName = "&MyData";
LocDot = findc(dsName, ".");          /* Does name contain a period (.)?     */
if LocDot > 0 then do;                /*   Yes: it is a two-level name       */
   lib = substr(dsName, 1, LocDot-1); /*     get substring before the period */
   member = substr(dsName, LocDot+1); /*     get substring after the period  */
else do;                              /*   No: it is a one-level name        */
   lib = getoption("user");           /*   Has this option been defined?     */
   if lib = ' ' then lib = "work";    /*     No: use WORK                    */
   member = dsName;
put lib=;
put member=;

In summary, although one-level data set names are usually stored in WORK, that is not always the case. However, a programmer can use the GETOPTION function to discover the libref where one-level data sets are stored.

An application to SAS/IML programming

The reason I was interested in the GETOPTION function is that I was trying to write a function in SAS/IML that would accept a one- or two-level data set name and return the names of the variables in the data. The CONTENTS function in SAS/IML almost does what I want, but the CONTENTS function has two different signatures, one for two-level names and one for one-level names:

  • For two-level names, use two arguments: varNames = contents(lib, name);
  • For one-level names, use one argument: varNames = contents(name);

I wanted to write a function that accepts a single string (a one-level or two-level data set name) and calls the appropriate signature of the CONTENTS function. The following SAS/IML function does the job:

proc iml;
/* new CONTENTS function that handles one- and two-level data set names */
start ContentsEx( dsName );              /* "Ex" means "extended" */
   LocDot = findc(dsName, ".");          /* Does name contain a period (.)?     */
   if LocDot > 0 then do;                /*   Yes: it is a two-level name       */
      lib = substr(dsName, 1, LocDot-1); /*     get substring before the period */
      member = substr(dsName, LocDot+1); /*     get substring after the period  */
      return( contents(lib, member) );
   return( contents(dsName) );           /*   No: it is a one-level name        */
dsName = "&MyData";
varNames =  ContentsEx( dsName );
print varNames;

Have you ever had the need to use the USER option to override the default storage location for one-level data set names? Leave a comment.

Post a Comment

Four ways to create a design matrix in SAS

SAS programmers sometimes ask, "How do I create a design matrix in SAS?" A design matrix is a numerical matrix that represents the explanatory variables in regression models. In simple models, the design matrix contains one column for each continuous variable and multiple columns (called dummy variables) for each classification variable.

I previously wrote about how to create dummy variables in SAS by using the GLMMOD procedure to create binary indicator variables for each categorical variable. But PROC GLMMOD is not the only way to generate design matrices in SAS. This article demonstrates four SAS procedures that create design matrices: GLMMOD, LOGISTIC, TRANSREG, and GLIMMIX. (Others include PROC CATMOD and PROC GLMSELECT.) Of the four, the LOGISTIC procedure is my favorite because it provides an easy-to-use syntax and supports various parameterizations for creating design matrices.

How categorical variables are represented in a design matrix in SAS

The CLASS statement in a SAS procedure specifies categorical variables that should be replaced by dummy variables when forming the design matrix. The process of forming columns in a design matrix is called parameterization or encoding. The three most popular parameterizations are the GLM encoding, the EFFECT encoding, and the REFERENCE encoding. For a detailed explanation of these encodings, see the section "Parameterization of Model Effects" in the SAS/STAT documentation. For applications and interpretation of different parameterizations, see Pasta (2005).

The following DATA step create an example data set with 10 observations. It has three fixed effects: one continuous variable (Cholesterol) and two categorical variables. One categorical variable (Sex) has two levels and the other (BP_Status) has three levels. It also has a categorical variable (HospitalID) that will be used as a random effect.

data Patients;
   HospitalID = mod(_N_, 4);
   keep HospitalID Cholesterol Sex BP_Status;
   set sashelp.heart;
   if 18 <= _N_ <= 27;
proc print; run;
Example data set for creating design matrices in SAS

PROC GLMMOD: Design matrices that use the GLM encoding

The simplest way to create dummy variables is by using the GLMMOD procedure, which can produce a basic design matrix with GLM encoding. The GLM encoding is a singular parameterization in which each categorical variable is represented by k binary variables, where k is the number of levels in the variable. There is also an intercept column that has all 1s. The GLMMOD procedure uses a syntax that is identical to the MODEL statement in PROC GLM, so it is very easy to create interaction effects. See my previous article for an example of how to use PROC GLMMOD to create a design matrix and how the singular parameterization affects parameter estimates in regression.

PROC LOGISTIC: Design matrices for any parameterization

You can also create a design matrix in SAS by using the LOGISTIC procedure. The PROC LOGISTIC statement supports a DESIGNONLY option, which prevents the procedure from running the analysis. Instead it only forms the design matrix and writes it to a data set. By default, PROC LOGISTIC uses the EFFECT encoding for classification variables, but you can use the PARAM= option on the CLASS statement to specify any parameterization.

A drawback of using PROC LOGISTIC is that you must supply a binary response variable on the MODEL statement, which might require you to run an extra DATA step. The following DATA step creates a view that contains a variable that has the constant value 0. This variable is used on the left-hand side of the MODEL statement in PROC LOGISTIC, but is dropped from the design matrix:

data Temp / view=Temp;
   set Patients;
   FakeY = 0;
proc logistic data=Temp outdesign=EffectDesign(drop=FakeY) outdesignonly;
   class sex BP_Status / param=effect; /* also supports REFERENCE & GLM encoding */
   model FakeY = Cholesterol Sex BP_Status;
proc print data=EffectDesign; run;
Design matrix in SAS with effect encoding

The design matrix shows the effect encoding, which uses –1 to indicate the reference level, which by default is the last level in alphabetical order. The name of a dummy variable is the conatenation of the original variable name and a level. For example, the Sex variable is replaced by the dummy variable named SexFemale, which has the value 1 to represent females and –1 to represent the reference level ("Male"). The BP_Status variable is replaced by two variables. The BP_StatusHigh variable contains 1 for patients that have high blood pressure, –1 for the reference level ("Optimal"), and 0 otherwise. Similarly, the BP_StatusNormal dummy variable has the value 1 for patients with normal blood pressure, –1 for the reference level ("Optimal"), and 0 otherwise.

The effect encoding produces k-1 columns for a categorical variable that has k levels. This results in a nonsingular design matrix.

You can use the REF= option after each classification variable to specify the reference level. You can also use the PARAM= option on the CLASS statement to specify a different parameterization. For example, the following statements create a design matrix that uses the REFERENCE parameterization. The reference level for the Sex variable is set to "Female" and the reference level for the BP_Status variable is set to "Normal."

proc logistic data=Temp outdesign=RefDesign(drop=FakeY) outdesignonly;
   class sex(ref="Female") BP_Status(ref="Normal") / param=reference; 
   model FakeY = Sex BP_Status;
proc print data=RefDesign; run;

Parameterizations affect the way that parameter estimates are interpreted in a regression analysis. For the reference encoding, parameter estimates of main effects indicate the difference of each level as compared to the effect of the reference level. For the effect encoding, the comparison is to the average effect over all levels.

PROC TRANSREG: Design matrices and a macro for variable names

Using PROC LOGISTIC is very flexible, but it has two drawbacks: You have to create a fake response variable, and you have to look at the output data set to discover the names of the dummy variables. In contrast, PROC TRANSREG does not require that you specify a response variable when you generate the design matrix. Furthermore, the procedure creates a macro variable (&_TRGIND, for "TRANSREG indicator" variables) that contains the names of the columns of the design matrix. Another nice feature is that the output data set contains the original variables, and you can use the ID variable to output additional variables.

However, the syntax for the TRANSREG procedure is different from most other SAS regression procedures. Instead of a CLASS statement, you specify classification effects in a CLASS() transformation list. By default, the procedure uses the REFERENCE parameterization; you can use the ZERO= option to control reference levels. The procedure also supports the GLM parameterization (via the ZERO=SUM option), the EFFECT parameterization (via the EFFECT option), and other options. The following statements show an example that generates a design matrix with the effect encoding:

proc transreg data=Patients design;
   model identity(Cholesterol) 
         class(Sex BP_Status / EFFECT zero="Female" "Normal");
   output out=B;
proc print data=B; 
   var Intercept &_TrgInd; 

The output is not shown because it is identical to the EffectDesign data set in the previous section. Notice that the output is displayed by using the &_TRGIND macro variable. For details about generating design matrices, see the TRANSREG documentation section "Using the DESIGN Output Option."

PROC GLIMMIX: Design matrices for fixed and random effects

PROC GLIMMIX enables you to construct two design matrices: one for the fixed effects and another for the random effects. The PROC GLIMMIX statement supports an OUTDESIGN= option that you can use to specify the output data set and a NOFIT option that ensures that the procedure will not try to fit the model.

The following statements create an output data set that contains two design matrices:

proc glimmix data=Patients outdesign(names novar)=MixedDesign nofit;
   class sex BP_Status HospitalID;
   model Cholesterol = Sex BP_Status;
   random HospitalID;
   ods select ColumnNames;
proc print data=MixedDesign; run;
Design matrix in SAS for fixed and random effects

Dummy variables for the fixed effects are prefixed by "_X" and dummy variables for the random effects are prefixed by "_Z." Two additional tables (not shown) associate the levels of the original variables with the columns of the design matrices.

The GLIMMIX procedure uses only the GLM parameterization. Consequently, there is little advantage to using PROC GLIMMIX instead of PROC GLMMOD. You can generate the same designs by calling PROC GLMMOD twice, once for the fixed effects and once for the random effects.


In summary, SAS provides four procedures that you can use to generate design matrices for continuous variables, classification variables, and their interactions. The GLMMOD procedure is ideal for creating design matrices that use the GLM encoding. PROC LOGISTIC supports all encodings in SAS and provides an easy-to-use syntax for specifying interactions. PROC TRANSREG supports fewer parameterizations, but does not require that you manufacture a response variable. Lastly, the GLIMMIX procedure produces design matrices for both fixed and random effects.

If you need to use matrix computations, the SAS/IML procedure also supports creating design matrices.

Post a Comment

Create dummy variables in SAS

A dummy variable (also known as indicator variable) is a numeric variable that indicates the presence or absence of some level of a categorical variable. The word "dummy" does not imply that these variables are not smart. Rather, dummy variables serve as a substitute or a proxy for a categorical variable, just as a "crash-test dummy" is a substitute for a crash victim, or a "sewing dummy" is a dressmaker's proxy for the human body.

In regression and other statistical analyses, a categorical variable can be replaced by dummy variables. For example, a categorical variable with levels "Low," "Moderate," and "High" can be represented by using three binary dummy variables. The first dummy variable has the value 1 for observations that have the level "Low," and 0 for the other observations. The second dummy variable has the value 1 for observations that have the level "Moderate," and zero for the others. The third dummy variable encodes the "High" level.

There are many ways to construct dummy variables in SAS. Some programmers use the DATA step, but there is an easier way. This article discusses the GLMMOD procedure, which produces basic binary dummy variables. A follow-up article discusses other SAS procedures that create a design matrix for representating categorical variables.

Why generate dummy variables in SAS?

Many programmers never have to generate dummy variables in SAS because most SAS procedures that model categorical variables contain a CLASS statement. If a procedure contains a CLASS statement, then the procedure will automatically create and use dummy variables as part of the analysis.

However, it can be useful to create a SAS data set that explicitly contains a design matrix, which is a numerical matrix that use dummy variables to represent categorical variables. A design matrix also includes columns for continuous variables, the intercept term, and interaction effects. A few reasons to generate a design matrix are:

  • Students might need to create a design matrix so that they can fully understand the connections between regression models and matrix computations.
  • If a SAS procedure does not support a CLASS statement, you can use often use dummy variables in place of a classification variable. An example is PROC REG, which does not support the CLASS statement, although for most regression analyses you can use PROC GLM or PROC GLMSELECT. Another example is the MCMC procedure, whose documentation includes an example that creates a design matrix for a Bayesian regression model.
  • In simulation studies of regression models, it is easy to generate responses by using matrix computations with a numerical design matrix. It is harder to use classification variables directly.

PROC GLMMOD: Design matrices that use the GLM parameterization

The following DATA step create a data set with 10 observations. It has one continuous variable (Cholesterol) and two categorical variables. One categorical variable (Sex) has two levels and the other (BP_Status) has three levels.

data Patients;
   keep Cholesterol Sex BP_Status;
   set sashelp.heart;
   if 18 <= _N_ <= 27;
proc print;  var Cholesterol Sex BP_Status;  run;
Original data with categorical variables

The GLMMOD procedure can create dummy variables for each categorical variable. If a categorical variable contains k levels, the GLMMOD procedure creates k binary dummy variables. The GLMMOD procedure uses a syntax that is identical to the MODEL statement in PROC GLM, so it is very easy to use to create interaction effects.

The following call to PROC GLMMOD creates an output data set that contains the dummy variables. The output data set is named by using the OUTDESIGN= option. The OUTPARAM= option creates a second data set that associates each dummy variable to a level of a categorical variable:

proc glmmod data=Patients outdesign=GLMDesign outparm=GLMParm;
   class sex BP_Status;
   model Cholesterol = Sex BP_Status;
proc print data=GLMDesign; run;
proc print data=GLMParm; run;
Dummy variables in SAS for each level of the categorical variables

The OUTDESIGN= data set contains the design matrix, which includes variables named COL1, COL2, COL3, and so forth. The OUTPARM= data set associates levels of the original variables to the dummy variables. For these data, the GLMMOD procedure creates six binary columns. The first is the intercept column. The next two encode the Sex variable. The last three encode the BP_Status variable. If you specify interactions between the original variables, additional dummy variables are created. Notice that the order of the columns is the sort order of the values of their levels. For example, the "Female" column appears before the "Male" column.

When you use this design matrix in a regression analysis, the parameter estimates of main effects estimate the difference in the effects of each level compared to the last level (in alphabetical order). The following statements show that using the dummy variables in PROC REG give the same parameter estimates as are obtained by using the original classification variables in PROC GLM:

ods graphics off;
/* regression analysis by using dummy variables */
proc reg data=GLMDesign;
   DummyVars: model Cholesterol = COL2-COL6; /* dummy variables except intercept */
   ods select ParameterEstimates;
/* same analysis by using the CLASS statement */
proc glm data=Patients;
   class sex BP_Status;              /* generates dummy variables internally */
   model Cholesterol = Sex BP_Status / solution;
   ods select ParameterEstimates;
PROC REG output for dummy variables in SAS

The parameter estimates from PROC REG is shown. The parameter estimates from PROC GLM are identical. Notice that the parameter estimates for the last level are set to zero and the standard errors are assigned missing values. This occurs because the dummy variable for each categorical variable is redundant. For example, the second dummy variable for the Sex variable ("Males") is a linear combination of the intercept column and the dummy variable for "Females"). Similarly, the last dummy variable for the BP_Status variable ("Optimal") is a linear combination of the intercept column and the "High" and "Normal" dummy variables. By setting the parameter estimate to zero, the last column for each set of dummy variables does not contribute to the model.

For this reason, the GLM encoding is called a singular parameterization. In my next blog post I will present ways to parameterize levels of the categorical variables. These different parameterizations lead to nonsingular design matrices.

Post a Comment

A simple trick to include (and order!) all categories in SGPLOT legends

Last week Sanjay Matange wrote about a new SAS 9.4m3 option that enables you to show all categories in a graph legend, even when the data do not contain all the categories. Sanjay's example was a chart that showed medical conditions classified according to the scale "Mild," "Moderate," and "Severe." He showed how to display all these categories in a legend, even if the data set does not contain any instances of "Severe." The technique is valid for SAS 9.4m3, in which the DATTRMAP= data set supports a special column named SHOW that tells the legend the complete list of categories.

If you are running an earlier version of SAS, you can use a trick that accomplishes the same result. The trick is to create a set of all categories (in the order you want them to appear in the legend) and prepend these "fake observations" to the top of your data set. All other variables for the fake observations are set to missing values. When PROC SGPLOT reads the data for the categorical variable, it encounters all categories. However, the missing values in the other variables prevent the fake observations from appearing in the graph. (The exception is a graph that shows ONLY the categorical variable, but you can handle that case, too.)

Data that excludes a valid category

Let's create a data set that shows the problem. The SasHelp.Heart data set contains observations about patients in a medical study. The data set includes variables for the height and weight of the patient and a categorical variable called Weight_Status that has the values "Underweight," "Normal," and "Overweight." The following DATA step extracts a subset of 200 observations in which no patient is "Underweight." The call to PROC SGPLOT creates a scatter plot of the heights and weights and uses the GROUP= option to color each marker according to the Weight_Status variable.

data Heart;
set Sashelp.Heart;
where Weight >= 125;
keep Height Weight Weight_Status;
if _N_ <= 200;
/* This scatter plot shows three problems: 
   1) The order of GROUP= variable is unspecified (default is GROUPORDER=DATA)
   2) The colors are assigned to the wrong categories
   3) The "Underweight" category is missing from the legend */
title "This scatter plot has problems!";
proc sgplot data=Heart; 
  /* attempt to assign colors for underweight=green, normal=blue, overweight=red */
  styleattrs datacontrastcolors = (lightgreen blue lightred); 
  scatter x=height y=Weight / group=Weight_Status 
Legend does not show all categories

There are several problems with this scatter plot. I tried to use the STYLEATTRS statement to assign the colors green, blue, and red to the categories "Underweight," "Normal," and "Overweight," respectively. However, that effort was thwarted by the fact that the default order of the colors is determined by the order of the (two!) categories in the data set. How can I get the correct colors, and also get the legend to display the "Underweight" category?

A useful trick: Prepend fake data

The "fake data" trick is useful in many situations, not just for legends. I have used it to specify the order of a categorical variable in a graph or analysis. For example, it is useful for a PROC FREQ analysis because PROC FREQ supports an ORDER=DATA option.

The trick has three steps:

  1. Create a data set that contains only one categorical variable. Specify the complete set of possible values in the order in which you want the values to be displayed.
  2. Use the SET statement in the DATA step to append the real data after the fake data. The DATA step will automatically assign missing values to all unspecified variables for the fake observations. On the SET statement, use the IN= data set option to create a new indicator variable called FREQ. This new variable will have the value 0 for the fake observations and the value 1 for the real observations. (Or, if your data set already has a frequency variable, multiply the existing variable by 0 or 1.)
  3. Use the newly appended data set to plot the data as usual. When you use the GROUP= option, the legends, colors, and order of categories will appear correctly because the data now contains all categories. Missing values prevent the fake observations from appearing in your plots.

The following statements illustrate the three steps for the Weight_Status variable in the Heart data set:

/* Step 1: Create a data set that contains all categories, in order */
data AllCategories;
input Weight_Status $11.;
/* Step 2: Append the fake and real data. Create indicator variable. */
data Heart2;
set AllCategories         /* the fake data, which contains all categories */
    Heart(in=IsRealData); /* the original data */
Freq = IsRealData;        /* 1 for the real data; 0 for the fake data */
/* Step 3: Use appended data set and plot as usual */
title "Include All Categories in Legend";
proc sgplot data=Heart2; 
  styleattrs datacontrastcolors = (lightgreen blue lightred);
  scatter x=height y=Weight / group= Weight_Status 
                              markerattrs=(size=9 symbol=CircleFilled);
Legend shows all categories

In this graph, the legend display all possible categories, and the categories appear in the correct order. The STYLEATTRS statement has correctly assigned colors to categories because you knew the order of the categories in the data set.

Graphs of the categorical variable

Adding new observations can create problems if you aren't careful. For example, suppose you use the Heart2 data set and create a bar chart of the Weight_Status variable. Unless you correct for the fake data, the bar chart will show 203 observations and display a bar for the "Underweight" category, which is not part of the original data.

The solution to this dilemma is to use a FREQ= or WEIGHT= option when you create graphs of the modified variable. The DATA step that appended the fake data also added an indicator variable, which you can use to prevent SAS procedures from displaying or analyzing the fake data, as follows:

title "Restrict to Real Data";
proc sgplot data=Heart2; 
  vbar Weight_Status /  freq=Freq;   /* do not use the fake data */

Notice that the bar chart shows only the 200 original observations. The FREQ=Freq statement uses the indicator variable (Freq) to omit the fake data.

In summary, by prepending "fake data" to your data set, you can ensure that all categories appear in legends. As a bonus, the same trick enables you to specify the order of categories in a legend. In short, prepending fake data is a useful trick to add to your SAS toolbox of techniques.

Post a Comment