High school rankings of top NCAA wrestlers

Last weekend was the 2016 NCAA Division I wrestling tournament. In collegiate wrestling there are ten weight classes. The top eight wrestlers in each weight class are awarded the title "All-American" to acknowledge that they are the best wrestlers in the country.

I saw a blog post on the InterMat web site that lists each All-American and the wrestler's national ranking when he was a high school senior. The data are interesting, but I wanted a simple graph that visualized the data. I decided to use SAS to create graphs that show the high school ranking for each of this year's 80 All-American wrestlers.

I was also interested in the relationship between high school ranking and placement at the NCAA tournament. Were the top NCAA wrestlers already nationally recognized while still in high school? Or were there some "late bloomers" who perfected their skills after entering college?

Ranking wrestlers

Several web sites, magazines, and organizations try to rank the top 50 or 100 US high school wrestlers in each weight class. It can be challenge to rank two individuals who have never wrestled each other. However, many of the top contenders in each weight class go to national tournaments, so there is often head-to-head data as well as data for common opponents.

An even more difficult challenge is attempting to rank the best wrestlers regardless of weight classes. How do you compare an undefeated heavyweight with an undefeated lightweight? Nevertheless, people do their best and you can find many internet lists that rank the "best pound-for-pound" wrestlers, boxers, MMA fighters, and so forth.

The high school rankings of the 2016 All-Americans

The InterMat article included whether each All-American was ranked in the Top 100 as a senior. If so, it gave the wrestler's rank. (Presumably, using their own ranking system.) If the wrestler was not nationally ranked, it lists whether he was ranked in his weight class (sort of an honorable mention), or whether he was unranked.

After importing the data into SAS, I used a custom format and PROC FREQ to tabulate the high school rankings against the wrestler's place in the NCAA tournament. You can download the data and the SAS program that generates the analyses in this article. The tabular results follow.

t_wrestlingrank

Of the wrestlers who finished first at the NCAA tournament, eight had Top 20 status as a high school senior. The results were similar for second-place finishers. However, if you look at fourth place or lower, you can see that a surprisingly large number of All-Americans who were unranked in high school. Still, with the exception of fifth place winners, more than half of each place (1–8) contained ranked wrestlers. Overall, 56 out of the 80 All-Americans were nationally ranked in high school.

PROC FREQ can automatically create a mosaic plot that graphically visualizes the tabular results. Because there are exactly ten wrestlers for each place (1–8), the mosaic plot is actually a stacked bar chart for these data. (There is an alternative way to create a stacked bar chart in SAS.)

wrestlingrank1

For each place, the brown rectangle represents the proportion of place winners who were ranked in the Top 20 in high school. The green rectangle represents the proportion who were not in the Top 20, but were in the Top 100. The pink rectangle shows wrestlers who were ranked in their weight classes. The blue rectangles show All-Americans who were unranked as high school seniors. Those formerly unranked wrestlers are the "late bloomers" who improved markedly and became a top college wrestler.

Association between NCAA place and high school rank

The previous graph shows that most wrestlers who placed first, second, or third were top-ranked high school wrestlers. The InterMat web site includes the exact high school ranking (1—100), so let's plot each wrestler's NCAA place against his high school ranking. To accommodate the wrestlers who were not ranked, I arbitrarily assign the rank "110" to the weight-class-ranked wrestlers. Instead of plotting the value 110 on a graph, I use the abbreviation "WC" for "weight class." I assign the rank "120" to the unranked wrestlers, and label that value by "NR" for "not ranked."

wrestlingrank2

The scatter plot of place versus high school ranking is shown to the left, along with a loess smoother to the data. In order to separate these artificial ranks from the real ranks, I create a broken axis on the graph. The graph indicates that the All-Americans who were very highly ranked in high school placed very well at the NCAA tournament. For example, 14 wrestlers were ranked in the Top 10 in high school. Of those, 10 wrestled in the finals for first or second place, and another four wrestled for third or fourth place.

The association between place and ranking is noticeable until about ranking 25. After that, the loess smoother levels off, which indicates no relationship between high school ranking and placement at the tournament.

I want to emphasize that this sample is not a random selection of collegiate wrestlers. Because of that, you cannot conclude that high school ranking predicts success in college wrestling. The sample here is nonrandom. Therefore the graphs show a relationships given that these men are All-American champions. It is a subtle but important distinction.

Feel free to download the SAS program that created these graphs. Although in general I am not a fan of broken axes, I think they are useful in this case because it makes it clear that the ranks 1–100 are different from the ranks "WC" and "NR". See Sanjay Matange's blog for more conventional applications of broken axes.

This analysis sends a clear message to high school wrestlers who are not nationally ranked: With hard work you can still become a premier collegiate athlete. At the same time, it clearly supports another truism: Many of the best athletes in college were also high school stars.

Post a Comment

Nonparametric regression for binary response data in SAS

My previous blog post shows how to use PROC LOGISTIC and spline effects to predict the probability that an NBA player scores from various locations on a court. The LOGISTIC procedure fits parametric models, which means that the procedure estimates parameters for every explanatory effect in the model. Spline bases enable you to fit complex models, but it is easy to generate many spline effects, which means that you need to be careful not to overfit the data.

In contrast, modern nonparametric models enable you to balance the complexity of a model with the goodness of fit, thus reducing the likelihood of overfitting the data. SAS provides several procedures that fit nonparametric regression models for a binary response variable. Options include:

  • Use variable selection techniques in PROC LOGISTIC or PROC HPGENSELECT to allow the data to select the effects that best model the data. Variable selection creates a hybrid analysis that has properties of nonparametric models while preserving the interpretability of parametric models.
  • Use the GAMPL procedure in SAS/STAT 14.1 (SAS 9.4m3) to fit the data. The GAMPL procedure uses penalized likelihood (PL) methods to fit generalized additive models (GAM).

Other choices in SAS/STAT software include the ADAPTIVEREG procedure, which combines splines with variable selection techniques, and the HPSPLIT procedure, which is a tree-based classification procedure. Both procedures were introduced in SAS/STAT 12.1.

Generalized additive models in SAS

Generalized additive models use spline effects to model nonlinear relationships in data. A smoothing penalty is applied to each spline term in an attempt to model nonlinear features without overfitting the data. For details and examples, you can read the GAMPL documentation or watch a video about PROC GAMPL.

The syntax for the GAMPL procedure is similar to the familiar syntax for PROC LOGISTIC or PROC GENMOD. You can specify spline effects and the distribution of the response variable. The following statement uses a two-dimensional thin-plate spline to model the probability of Stephen Curry scoring from various shooting locations. The data are from Robert Allison's blog "How to graph NBA data with SAS." You can download the complete SAS program that produces the graphs in this post.

proc gampl data=Curry;
   where Shot_Distance <= 30;
   model Shot_Made(event='Made') = Spline(X Y / maxdf=40) / dist=binary;
   id X Y Shot_Made;
   output out=GamPLOut;
run;

The OUTPUT statement saves the predicted probabilities to a data set. The option MAXDF=40 tells the procedure to consider up to 40 degrees of freedom for the spline effect and to choose the smoothing parameter that provides the best tradeoff between model complexity and goodness of fit. For the Stephen Curry data, the optimal smoothing parameter results in 14.7 degrees of freedom.

GAMPL Analysis of Curry Data

You can use the graph template language (GTL) to create a contour plot of the predicted probabilities. The contour map is qualitatively similar to the probabilities that were predicted by the PROC LOGISTIC analysis in my previous post. There is an area of high probability near the basket at (0,0). The probabilities on the right side of the graph are lower than on the left. There is a "hot spot" on the left side of the graph, which corresponds to a high probability that Curry will score from that region.

Verify: The fundamental principle of nonparametric analysis

I initially view the results of any nonparametric analyses with skepticism. I trust the mathematics behind the methods, but I need to be convinced that a qualitative feature in the predicted values is real and not merely an artifact of some complicated nonparametric witchcraft.

There are many statistical techniques that enable you to evaluate whether a model fits data well, but it is wise to perform a basic "sanity check" by using a different nonparametric procedure to analyze the same data. If the two analyses reveal the same qualitative features in the data, that is evidence that the features are truly present. Conversely, if two models produce different qualitative features, then I question whether either model is accurate. I call this sanity check the fundamental principle of nonparametric analysis: Trust, but verify.

Let's apply the fundamental principle to the NBA data by running PROC ADAPTIVEREG:

proc adaptivereg data=Curry plots;
   where Shot_Distance <= 30;
   model Shot_Made(event='Made') = X Y / dist=binary;
   output out=AdaptiveOut p(ilink);
run;
ADAPTIVEREG Analysis of Curry Data

The PROC ADAPTIVEREG analysis is shown to the left. The contour plot shows the same qualitative features that were apparent from the LOGISTIC and GAMPL analyses. Namely, the probability of scoring is high under the basket, low to the right, average up the middle, and high on the left. Seeing these features appear in several analyses gives me confidence that these features of the data are real. After verifying that the models are qualitatively similar, you can investigate which model is better, perhaps by splitting the data into subsets for model training, validation, and testing.

Summary

This article briefly introduced two nonparametric procedures in SAS that can analyze binary response variables and other response distributions. The two analyses produced qualitatively similar predictions on sample data. The fundamental principle of nonparametric analysis is a meta-theorem that says that you should verify the qualitative predictions of a nonparametric model. Reproducibility is a necessary (but not sufficient) condition to believe that a feature is real and not spurious. For this example, all analyses agree that Stephen Curry shoots better from one side of the court than from the other.

Post a Comment

A statistical analysis of Stephen Curry's shooting

curry1

Last week Robert Allison showed how to download NBA data into SAS and create graphs such as the location where Stephen Curry took shots in the 2015-16 season to date. The graph at left shows the kind of graphs that Robert created. I've reversed the colors from Robert's version, so that red indicates "good" (a basket was scored) and blue indicates "bad" (a missed shot). The location of the NBA three-point line is evident by the many markers that form an arc in the scatter plot.

When I saw the scatter plot, I knew that I wanted to add some statistical analysis. In particular, I wanted to use SAS to construct a statistical model that estimates the probability that Curry scores from any position on the basketball court.

This article focuses on the results of the analysis. You can download the SAS program that generates the analyses and graphics. Although this article analyzes Stephen Curry, you can modify the SAS program to analyze Kevin Durant, Lebron James, or any other basketball player.

Probability as a function of distance

The first analysis estimates the probability that Curry makes a basket solely as a function of his distance from the basket. Curry is known for his consistent ability to make three-point shots. A three-point shot in the NBA requires that a player shoot from at least 22 feet away (when near the baseline) or 23.9 feet away (when further up the court).

curry2

You can use logistic regression to model the probability of making a shot as a function of the distance to the basket. The adjacent plot shows the result of a logistic regression analysis in SAS. The model predicts a probability of 0.7 that Curry will make a shot from under the basket, a probability of 0.5 from 20 feet away, and a probability of 0.46 from the three-point arc, indicated by the vertical gray line at 23.9 feet. Recall that a probability of 0.46 is equivalent to predicting that Curry will sink 46% of shots from the three-point arc.

Almost all (98.3%) of Curry's shots were taken from 30 feet or closer, and the shots taken from beyond 30 feet were end-of-quarter "Hail Mary" heaves. Therefore, the remaining analyses restrict to shots that were from 30 feet or closer.

Probability as a function of angle and distance

The previous analysis only considers the distance from the basket. It ignores position of the shot relative to the basket. In general, the probability of scoring depends on the location from which the shot was launched.

For consistency, let's agree that "right" and "left" means the portion of the court as seen by a fan sitting behind the backboard. This is, of course, opposite of what Curry would see when coming down the court toward the basket. Our "right" is Curry's left.

curry3

One way to model the positional dependence in the model is incorporate the angle relative to the backboard. The diagram shows one way to assign an angle to each position on the court. In the diagram, 90 degrees indicates a perpendicular shot to the basket, such as from the top of the key. An angle of 0 indicates a "baseline shot" from the right side of the court. Similarly, an angle of 180 degrees means a baseline shot from the left side of the court.

The following panel of graphs is the result of a logistic regression analysis that includes the interaction between angle and distance. The vertical lines in some plots indicate the distance to the sideline at particular angles. For 0 and 180 degress, the distance from the basket to the sideline is 25 feet.

curry4

The panel of plots show that Curry is most accurate when he shoots from the left side of the court. (The left side corresponds to angles greater than 90 degrees, which are on the left side of the panel.) Remarkably, the model estimates that Curry's probability of making a shot from the left side barely depends on the distance from the basket! He is a solid shooter (probability 0.5, which is 50%) from the left baseline (Angle = 180) and from a slight angle (Angle = 150). The previous scatter plot shows that he shoots many shots from the 120 degree angle. This analysis shows that he is uncannily accurate from 20 and even 30 feet away, although the probability of scoring decreases as the distance increases.

On the right side of the court (angles less than 90 degrees), Curry's probability of making a shot depends more strongly on the distance to the basket. Near the basket, the model predicts a scoring probability of 0.6 or more. However, the probability drops dramatically as the distance increases. On the right side of the court, Curry is less accurate from 20 or more feet than for the same distance on the other side. At three-point range, Curry's probability of making a shot on the right (his left) drops to "only" 0.4. The probability drops off most dramatically when Curry shoots from the baseline (Angle = 0).

Probability as a function of position

A logistic analysis is a parametric model, which means that the analyst must specify the explanatory variables in the model and also the way that those variables interact with each other. This often leads to simplistic models, such as a linear or quadratic model. A simple model is often not appropriate for modeling the scoring probability as a function of the Cartesian X and Y positions on the court because a simple model cannot capture local spatial variations in the data.

SAS provides several possibilities for nonparametric modeling of data, but let's stick with logistic regression for now. Many SAS regression procedures, including PROC LOGISTIC, support using an EFFECT statement to generate spline effects for a variable. A spline effect expands a variable into spline bases. Spline effects enable you to model complex nonlinear behavior without specifying an explicit form for the nonlinear effects. The following graph visualizes such a model.

curry5

The image shows a scatter plot of the location of shots overlaid on a heat map that shows the predicted probability of Curry sinking a basket from various locations on the court. To better show the shot locations, the image has been stretched vertically. As mentioned previously, the location with the highest predicted probability is under the basket. From farther away, the predicted probability varies according to direction: to the left of the basket the probability is about 0.5, whereas a 15-foot jumper in front of the basket has probability 0.6. Notice the relative abundance of blue color (low probability) for shots on the right side. The lowest probability (about 0.3) occurs just beyond the three-point line at about a 60 degree angle, which agrees with the previous analysis. The same distance on the left side of the court is a much lighter shade of whitish-blue that corresponds almost 0.5 probability.

Statisticians will wonder about how well the model fits the data. The Pearson goodness-of-fit test indicates that the spline fit is not great, which is not surprising for a parametric fit to this kind of spatial data. In a follow-up post I use SAS to create nonparametric predictive models for the same data.

Conclusions

SAS programmers will appreciate the fact that "effect plots" in this article were generated automatically by PROC LOGISTIC. By using the EFFECT statement and the EFFECTPLOT statement, it is simple to create graphs that visualize the predictions for a logistic regression model.

These graphs show that in general Stephen Curry is a phenomenal shooter who has a high probability of scoring from even a long distance. Logistic regression was used to model the probability that Curry makes a shot from various angles and locations on the court. The analysis indicates that Curry shoots better from his right side, especially from three-point range.

Post a Comment

Simulate from the multinomial distribution in the SAS DATA step

There are several ways to simulate multinomial data in SAS. In the SAS/IML matrix language, you can use the RANDMULTINOMIAL function to generate samples from the multinomial distribution. If you don't have a SAS/IML license, I have previously written about how to use the SAS DATA step or PROC SURVEYSELECT to generate multinomial data.

The DATA step method I used in that previous article was a brute force method called the "direct method." This article shows how to simulate multinomial data in the DATA step by using a more efficient algorithm.

The direct method for simulating multinomial data

The direct method constructs each multinomial observation by simulating the underlying process, which can be thought of as drawing N balls (with replacement) from an urn that contains balls of k different colors. The parameters to the multinomial distribution are N (the number of balls to draw) and (p1, p2, ..., pk), which is a vector of probabilities. Here pi is the probability of drawing a ball of the ith color and Σi pi = 1.

In the direct method, you simulate one multinomial draw by explicitly generating N balls and counting the number of each color. The distribution of counts (N1, N2, ..., Nk) follows a multinomial distribution, where N = Σi Ni. The direct method runs quickly if N is small and you simulate a relatively small multinomial sample. For example, when N=100 you can simulate thousands of multinomial observations almost instantly.

The conditional method for simulating multinomial data

If N is large or you plan to generate a large number of observations, there is a more efficient way to simulate multinomial data in the SAS DATA step. It is called the "conditional method" and it uses the fact that you can think of a multinomial draw as being a series of binomial draws (Gentle, 2003, pp. 198-199).

Think about generating the Ni in sequence. The first count, N1, follows the binomial distribution Bin(p1, N). If you generate a specific value n1, then there are N - n1 items left to draw. The probability of drawing N2 is p2 / (1-p1), which implies that N2 ~ Bin(p2 / (1-p1), N - n1).

Continue this process. If the first i-1 counts have been drawn, then Ni ~ Bin(N − n1 - ... - ni-1, pi/(1 − p1 - ... - pi-1 )). This leads to the following efficient simulation method for multinomial observations:

/* generate multinomial sample by using conditional method */
%let SampleSize = 1000;             /* number of observations in MN sample */
%let N = 100;                       /* number of trials in each MN draw */
data MN;
call streaminit(12435);
array probs{3} _temporary_ (0.5 0.2 0.3); /* prob of drawing item 1, 2, 3 */
array x{3};                         /* counts for each item */
do obs = 1 to &SampleSize; 
   ItemsLeft = &N;                  /* how many items remain? */
   cumProb = 0;                     /* cumulative probability */
   do i = 1 to dim(probs)-1;        /* loop over k categories */
      p = probs[i] / (1 - cumProb);
      x[i] = rand("binomial", p, ItemsLeft);     /* binomial draw */
      ItemsLeft = ItemsLeft - x[i]; /* decrement size by selection */
      cumProb = cumProb + probs[i]; /* adjust prob of next binomial draw */
   end;
   x[dim(probs)] = ItemsLeft;       /* remaining items go into last category */
   output;
end;
keep x:;
run;
 
title "Multinomial Distribution, p={0.5 0.2 0.3}, N=100";
proc kde data=MN;
   bivar x1 x2 / bwm=1.5 plots=ContourScatter;
run;
Simulate Multinomial datat with the SAS DATA step

Whereas the direct method requires an inner DO loop that performs N iterations, the conditional method only requires k iterations, where k is the number of categories being drawn. In the example, N=100, whereas k=3. The direct method must generate 100,000 values from the "Table" distribution, whereas the conditional method generates 3,000 values from the binomial distribution.

The graph shows 1,000 observations from the multinomial distribution with N=100 and p={0.5, 0.2, 0.3}. Because of overplotting, you cannot see all 1,000 observations, so the scatterplot is overlaid on a kernel density estimate. The graph shows the counts for the first two categories; the third count is determined by the fact that the counts sum to 100. Notice that the distribution of counts is centered near the expected values x1=50 and x2=20.

In summary, if you want to simulate multinomial data by using the SAS DATA step, the algorithm in this article is more efficient than the brute-force direct computation. This algorithm simulates a multinomial vector conditionally as a series of binomial draws.

Post a Comment

Monte Carlo estimates of pi and an important statistical lesson

Today is March 14th, which is annually celebrated as Pi Day. Today's date, written as 3/14/16, represents the best five-digit approximation of pi. On Pi Day, many people blog about how to approximate pi. This article uses a Monte Carlo simulation to estimate pi, in spite of the fact that "Monte Carlo methods are ... not a serious way to determine pi" (Ripley 2006, p. 197). However, this article demonstrates an important principle for statistical programmers that can be applied 365 days of the year. Namely, I describe two seemingly equivalent Monte Carlo methods that estimate pi and show that one method is better than the other "on average."

Monte Carlo estimates of pi

mcestpi1 To compute Monte Carlo estimates of pi, you can use the function f(x) = sqrt(1 – x2). The graph of the function on the interval [0,1] is shown in the plot. The graph of the function forms a quarter circle of unit radius. The exact area under the curve is π / 4. There are dozens of ways to use Monte Carlo simulation to estimate pi. Two common Monte Carlo techniques are described in an easy-to-read article by David Neal (The College Mathematics Journal, 1993). The first is the "average value method," which uses random points in an interval to estimate the average value of a continuous function on the interval. The second is the "area method," which enables you to estimate areas by generating a uniform sample of points and counting how many fall into a planar region.

The average value method

In calculus you learn that the average value of a continuous function f on the interval [a, b] is given by the following integral: Monte Carlo estimates of pi: Quarter circle In particular, for f(x) = sqrt(1 – x2), the average value is π/4 because the integral is the area under the curve. In symbols, piMCest3 If you can estimate the left hand side of the equation, you can multiply the estimate by 4 to estimate pi. Recall that if X is a uniformly distributed random variable on [0,1], then Y=f(X) is a random variable on [0,1] whose mean is favg. It is easy to estimate the mean of a random variable: you draw a random sample and compute the sample mean. The following SAS/IML program generates N=10,000 uniform variates in [0,1] and uses those values to estimate favg = E(f(X)). Multiplying that estimate by 4 gives an estimate for pi.
proc iml;
call randseed(3141592);       /* use digits of pi as a seed! */
N = 10000;
u = randfun(N, "Uniform");    /* U ~ U(0, 1) */
Y = sqrt(1 - u##2);
piEst1 = 4*mean(Y);           /* average value of a function */
print piEst1;
Monte Carlo estimates of pi: First estimate In spite of generating a random sample of size 10,000, the average value of this sample is only within 0.01 of the true value of pi. This doesn't seem to be a great estimate. Maybe this particular sample was "unlucky" due to random variation. You could generate a larger sample size (like a million values) to improve the estimate, but instead let's see how the area method performs.

The area method

Consider the same quarter circle as before. If you generate a 2-D point (an ordered pair) uniformly at random within the unit square, then the probability that the point is inside the quarter circle is equal to the ratio of the area of the quarter circle divided by the area of the unit square. That is, P(point inside circle) = Area(quarter circle) / Area(unit square) = π/4. It is easy to use a Monte Carlo simulation to estimate the probability P: generate N random points inside the unit square and count the proportion that fall in the quarter circle. The following statements continue the previous SAS/IML program:
u2 = randfun(N, "Uniform");   /* U2 ~ U(0, 1) */
isBelow = (u2 < Y);           /* binary indicator variable */
piEst2 = 4*mean(isBelow);     /* proportion of (u, u2) under curve */
print piEst2;
Monte Carlo estimates of pi: Second estiamte The estimate is within 0.0008 of the true value, which is closer than the value from the average value method. Can we conclude from one simulation that the second method is better at estimating pi? Absolutely not! Longtime readers might remember the article "How to lie with a simulation" in which I intentionally chose a random number seed that produced a simulation that gave an uncharacteristic result. The article concluded by stating that when someone shows you the results of a simulation, you should ask to see several runs or to "determine the variance of the estimator so that you can compute the Monte Carlo standard error."

The variance of the Monte Carlo estimators

I confess: I experimented with many random number seeds before I found one that generated a sample for which the average value method produced a worse estimate than the area method. The truth is, the average values of the function usually give better estimates. To demonstrate this fact, the following statements generate 1,000 estimates for each method. For each set of estimates, the mean (called the Monte Carlo estimate) and the standard deviation (called the Monte Carlo standard error) are computed and displayed:
/* Use many Monte Carlo simulations to estimate the variance of each method */
NumSamples = 1000;
pi = j(NumSamples, 2);
do i = 1 to NumSamples;
   call randgen(u, "Uniform");         /*  U ~ U(0, 1) */
   call randgen(u2, "Uniform");        /* U2 ~ U(0, 1) */
   Y = sqrt(1 - u##2);
   pi[i,1] = 4*mean(Y);        /* Method 1: Avg function value */
   pi[i,2] = 4*mean(u2 < Y);   /* Method 2: proportion under curve */
end;
 
MCEst = mean(pi);              /* mean of estimates = MC est */
StdDev = std(pi);              /* std dev of estimates = MC std error */
print (MCEst//StdDev)[label="Comparison of Methods"
                      colname={"Avg Func" "Area"}
                      rowname={"MC Estimate" "MC StdErr"}];
Monte Carlo estimates of pi: Overlay distributions Now the truth is revealed! Both estimators provide a reasonable approximation of pi, but estimate from the average function method is better. More importantly, the standard error for the average function method is about half as large as for the area method. You can visualize this result if you overlay the histograms of the 1,000 estimates for each method. The following graph shows the distribution of the two methods. The average function estimates are in red. The distribution is narrow and has a sharp peak at pi. In contrast, the area estimates are shown in blue. The distribution is wider and has a less pronounced peak at pi. The graph indicates that the average function method is more accurate because it has a smaller variance. piMCest1 Estimating pi on Pi Day is fun, but this Pi Day experiment teaches an important lesson that is relevant 365 days of the year: If you have a choice between two ways to estimate some quantity, choose the method that has the smaller variance. For Monte Carlo estimation, a smaller variance means that you can use fewer Monte Carlo iterations to estimate the quantity. For the two Monte Carlo estimates of pi that are shown in this article, the method that computes the average function value is more accurate than the method that estimates area. Consequently, "on average" it will provide better estimates.
Post a Comment

Comparative histograms: Panel and overlay histograms in SAS

You can use histograms to visualize the distribution of data. A comparative histogram enables you to compare two or more distributions, which usually represent subpopulations in the data. Common subpopulations include males versus females or a control group versus an experimental group. There are two common ways to construct a comparative histogram: you can create a panel of histograms, or you can overlay histograms in a single graph. This article shows how to create comparative histograms in SAS.

Sanjay Matange and I have each written multiple previous articles on this topic. This article collects many of the ideas in one place. In the SAS 9.2 and SAS 9.3 releases, the graph template language (GTL) was required to construct some of these graphs. However, thanks to recent features added to PROC SGPLOT, PROC SGPANEL, and PROC UNIVARIATE, you can now create comparative histograms in SAS without writing any GTL.

Panel of histograms

Comparative histogram of three groups in a panel of histograms

A panel of histograms enables you to compare the data distributions of different groups. You can create the histograms in a column (stacked vertically) or in a row. I usually prefer a column layout because it enables you to visualize the relative locations of modes and medians in the data.

In SAS, you can create a panel of histograms by using PROC UNIVARIATE or by using PROC SGPANEL. Both procedures require that the data be in "long form": one continuous variable that specifies the measurements and another categorical variable that indicates the group to which each measurement belongs. If your data are in "wide form," you can convert the data from wide form to long form.

To use PROC UNIVARIATE, specify the categorical variable on the CLASS statement and the continuous variable on the HISTOGRAM statement. For example, the following example compares the distribution of the SepalLength variable for each of the three values of the Species variable in the Sashelp.Iris data:

proc univariate data=sashelp.iris;
  class Species;
  var SepalLength;      /* computes descriptive statisitcs */
  histogram SepalLength / nrows=3 odstitle="PROC UNIVARIATE with CLASS statement";
  ods select histogram; /* display on the histograms */
run;

The result is shown at the beginning of this section. The graph suggests that the median value of the SepalLength variable differs between levels of the Species variable. Furthermore the variance of the "Virginica" group is larger than for the other groups.

You can create similar graphs by using the SGPANEL procedure, which supports a wide range of options that control the layout. Specify the Species variable in the PANELBY statement and the SepalLength variable in the HISTOGRAM statement. The following call to PROC SGPANEL creates a comparative histogram:

title "PROC SGPANEL with PANELBY statement";
proc sgpanel data=sashelp.iris;
  panelby Species / rows=3 layout=rowlattice;
  histogram SepalLength;
run;

The graph produced by PROC SGPANEL is similar to the previous graph.

With the GTL you can create more complicated panel displays than are shown here. For example, Sanjay shows how to create mirrored histograms, which are sometimes used for population pyramids.

Overlay histograms

For comparing the distributions of three or more groups, I recommend a panel of histograms. However, for two groups you might want to overlay the histograms. You can use the TRANSPARENCY= option in PROC SGPLOT statements so that both histograms are visible, even when the bars overlap. The portion of bars that overlap are shown in a blended color.

In the HISTOGRAM statement of PROC SGPLOT, you can use the GROUP= option to specify the variable that indicates group membership. The GROUP= option overlays the histograms for each group, as the following example shows:

proc sgplot data=sashelp.iris;
  where Species in ("Setosa", "Versicolor");       /* restrict to two groups */
  histogram SepalLength / group=Species transparency=0.5;       /* SAS 9.4m2 */
  density SepalLength / type=kernel group=Species; /* overlay density estimates */
run;
Overlay histograms

In this graph I added density estimates to help the eye visualize the basic shape of the two histograms. The purple region shows the overlap between the two distributions. For more than two categories, you might want to omit the histograms and just overlay the density estimates. The graph combines the first two rows of the panel in the previous section. The overlay enables you to compare the two subpopulations without your eye bouncing back and forth between rows of a panel.

The GROUP= option was added to the HISTOGRAM and DENSITY statements in SAS 9.4m2. You can create the same graph in PROC UNIVARIATE by using the OVERLAY option in the HISTOGRAM statement. The OVERLAY option requires SAS 9.4m3.

Overlay histograms of different variables

Because PROC SGPLOT enables you to use more than one HISTOGRAM statement, you can also overlay the histograms of different variables.

When comparing histograms it is best that both histograms use the same bin width and anchor locations. Prior to SAS 9.3, you could overlay histograms by using the graph template language (GTL). However, SAS 9.3 introduced support for the BINWIDTH= and BINSTART= options in the HISTOGRAM statement in PROC SGPLOT. Therefore you can force the histograms to have a common bin width, as shown in the following example:

title "Overlay Histograms with PROC SGPLOT";
proc sgplot data=Sashelp.Iris;
  histogram PetalLength / binwidth=5 transparency=0.5
               name="petal" legendlabel="Petal Width";
  histogram SepalLength / binwidth=5 transparency=0.5
               name="sepal" legendlabel="Sepal Width";
  density PetalLength / type=kernel lineattrs=GraphData1;  /* optional */
  density SepalLength / type=kernel lineattrs=GraphData2;  /* optional */
  xaxis label="Length (mm)" min=0;
  keylegend "petal" "sepal" / across=1 position=TopRight location=Inside;
run;
Overlay histograms

Summary

In summary, SAS provides multiple ways to use histograms to compare the distributions of data. To obtain a panel of histograms, the data must be in the "long" format. You can then:

  • Use the CLASS statement in PROC UNIVARIATE to specify the grouping variable. This is a good choice if you also want to compute descriptive statistics or fit a distribution to the data.
  • Use PROC SGPANEL, which provides you with complete control over the layout of the panel, axes, and other graphical options.

If you only have two groups and you want to overlay partially transparent histograms, you can do the following:

  • Use the GROUP= option in the HISTOGRAM statement of PROC SGPLOT (requires SAS 9.4m2).
  • Use the OVERLAY option in the HISTOGRAM statement of PROC UNIVARIATE (requires SAS 9.4m3).

Lastly, if you have two variable to compare, you can use two HISTOGRAM statements. Be sure to use the BINWIDTH= option (and optionally the BINSTART= option), which requires SAS 9.3.

The comparative histogram is not a perfect tool. You can also use spread plots and other techniques. However, for many situations a panel of histograms or an overlay of histograms provides an effect way to visually compare the distributions of data in several groups.

Post a Comment

How to use COLLECTION effects to specify pairwise interactions in SAS

Most SAS regression procedures support the "stars and bars" operators, which enable you to create models that include main effects and all higher-order interaction effects. You can also easily create models that include all n-way interactions up to a specified value of n. However, it can be a challenge to specify models that include many—but not all!—higher-order interactions. This article describes a little-known trick: you can use COLLECTION effects to specify interaction terms.

Stars and Bars: Building models with interaction terms in SAS

Many of the regression procedures in SAS (such as GLM, GENMOD, LOGISTIC, MIXED,...) support the bar operator (|) to specify all interactions between effects. For example, the following MODEL statement specifies that the model should include all main effects and all higher-order interactions:

proc logistic;
   model Y = x1 | x2 | x3 | x4;   /* all main effects and interactions */
run;

The previous MODEL statement includes all two-way, three-way, and four-way interaction effects. The statement is equivalent to the following statement that uses the star operator (*) to explicitly specify each interaction term:

model Y = x1 x2 x3 x4                         /* all main effects */
          x1*x2 x1*x3 x1*x4 x2*x3 x2*x4 x3*x4 /* all two-way interactions */
          x1*x2*x3 x1*x2*x4 x1*x3*x4 x2*x3*x4 /* all three-way interactions */
          x1*x2*x3*x4;                        /* all four-way interactions */

Fitting a model with so many effects will lead to overfitting, so in practice an analyst might restrict the model to two-way interactions. Again, SAS supplies an easy syntax. You can use the "at" operator (@) to specify the highest interaction terms in the model. For example, the following syntax specifies that the model contains only main effects and two-way interactions:

model Y = x1 | x2 | x3 | x4 @2;   /* main effects and two-way interactions */

Specifying many, but not all, interaction terms

Unfortunately, there is no simple syntax for constructing many, but not all, interaction effects. This can be frustrating when there is a structure to the interaction terms. A common structure is that there are two lists of variables and you want to build all interactions that involve one effect from the first list and one effect from the second list.

For example, suppose you want to create the following interaction effects:
c1*x1 c1*x2 c2*x1 c2*x2
The interaction terms are the pairwise combinations of the variables {c1 c2} with the variables {x1 x2}. Note, however, that within-list interactions are not desired: there are no terms for c1*c2 or x1*x2.

It would be great to have some kind of shorthand notation that tells SAS to "cross all elements in the first list with all elements in the second list." A natural syntax would be
(c1 c2) | (x1 x2)
but unfortunately that syntax is not supported.

Some SAS programmers might use the macro language to generate all pairwise interactions between two lists of variables, but COLLECTION effects offer an easier way.

COLLECTION effects

More than a dozen regression procedures in SAS support the EFFECT statement. According the documentation, the EFFECT statement generates "special collections of columns for design matrices." In particular, the so-called COLLECTION effect enables you to specify multiple variables that are "considered as a unit."

As a colleague recently reminded me, you can use COLLECTION effects to specify interactions. If V and W are two collection effects, then V*W contains all pairwise interactions of the individual variables in V with the individual variables in W. Similarly, V | W contains all main effects and the pairwise interaction effects.

As an example of using COLLECTION effects, the following model uses two classification variables and four continuous variables in the SasHelp.BWeight data. Here is the model specified in the usual way:

proc logistic data=Sashelp.Heart;
   class BP_Status Sex;
   model Status = BP_Status Sex Cholesterol Height Weight MRW
         BP_Status*Cholesterol BP_Status*Height BP_Status*Weight BP_Status*MRW
               Sex*Cholesterol       Sex*Height       Sex*Weight       Sex*MRW;
   ods select ParameterEstimates;
   ods output ParameterEstimates = Parm1;
run;

Manually enumerating all those interaction terms requires a lot of typing. More importantly, the enumeration does not make it clear that the interaction terms are the pairwise interactions between the classification variables and the continuous variables. In contrast, the following statements use COLLECTION effects to define two sets of variables. The MODEL statement uses the familiar bar operator to form all main effects and pairwise interactions between the variables.

proc logistic data=Sashelp.Heart;
   class BP_Status Sex;
   effect V = collection(BP_Status Sex);                     /* one list     */ 
   effect W = collection(Cholesterol Height Weight MRW);     /* another list */ 
   model Status = V | W;      /* vars and interactions between the var lists */
   ods select ParameterEstimates;
   ods output ParameterEstimates = Parm2;
run;

The second model statement is more concise. The two models produce equivalent predictions, but the second is much easier to type and to interpret.

You can use PROC COMPARE to show that the parameter estimates are the same (to eight decimal places), and therefore the predicted values will be the same. Because the order of the parameters differs between models, the parameter estimates are sorted before running the comparison.

proc sort data=Parm1; by Estimate; run;
proc sort data=Parm2; by Estimate; run;
 
proc compare brief method=absolute criterion=1e-8
             base   =Parm1(drop=Variable)
             compare=Parm2(drop=Variable ClassVal:);
run;
NOTE: All values compared are within the equality criterion used.

This use of the COLLECTION effect is somewhat nonstandard. SAS introduced COLLECTION effects for variable selection routines such as the "group LASSO" as a way to specify that all variables in the collection should be included in the model, or all should be excluded. The variables enter or leave the model "as a unit."

Although most tables and statistics from PROC LOGISTICS are the same for the two models, there are differences. One difference is the "Type 3 Analysis of Effects," which tests whether all the parameters associated with an effect are zero. The first call to PROC LOGISTIC analyzes 14 effects; the second call analyzes three (collection) effects.

In summary, the EFFECT statement provides a way to treat sets of variables "as a unit." This leads to a simple syntax for forming specific interaction effects. The example in this article showed how to create pairwise interactions, but the COLLECTION effects can also be used to specify higher-order interactions.

Post a Comment

Dummy variables in SAS/IML

Last week I showed how to create dummy variables in SAS by using the GLMMOD procedure. The procedure enables you to create design matrices that encode continuous variables, categorical variables, and their interactions. You can use dummy variables to replace categorical variables in procedures that do not support a CLASS statement. You can use other procedures to create design matrices for other parameterizations.

SAS/IML programmers can use two built-in functions to create dummy variables. The DESIGN function generates dummy variables for the GLM parameterization. The DESIGNF function generates dummy variables for the EFFECT encoding. You can use the HDIR function to create interaction effects from the main-effect dummy variables.

The following DATA step creates sample data. The PROC IML statements read the data into vectors and use the DESIGN and DESIGNF function to create dummy variables. Note the use of the ODS LAYOUT GRIDDED statement to print SAS/IML matrices across the page.

data Patients;
   keep Cholesterol Sex BP_Status;
   set sashelp.heart;
   if 18 <= _N_ <= 27;
run;
 
proc iml;
use Patients;  
read all var {Cholesterol Sex BP_Status};  
close Patients;
 
Dummy_GLM = design( BP_Status );      /* dummy vars, GLM encoding */
Dummy_Effect = designf( BP_Status );  /* dummy vars, EFFECT encoding */
 
ods layout gridded columns=3 advance=table; /* create gridded layout in HTML */
print BP_Status, Dummy_GLM, Dummy_Effect;
ods layout end;
t_designIML1

You can see that the DESIGN matrix creates k binary dummy variables for a categorical variable that contains k levels. The first column represents the first level (in alphabetical order), which for this data is "High." The first column has the value 1 for each row for which BP_Status="High." Similarly, the second column contains a 1 for each row for which BP_Status="Normal." The third column contains a 1 for each row for which BP_Status="Optimal."

In contrast, the DESIGNF creates a design matrix that has k–1 columns. The matrix has the EFFECT encoding, with the last category ("Optimal") serving as the reference level. The first column has the value 1 for rows for which BP_Status="High," the value –1 for rows for which BP_Status is the reference level, and 0 otherwise. The second column is similar, except that 1 indicates rows for which BP_Status="Normal."

Linear regression with dummy variables

Dummy variables convert character variables (and other categorical variables) into numerical variables with a specified encoding. As such they enable you to use matrix computations to perform a statistical analysis such as linear regression.

For example, the following SAS/IML statements perform a regression analysis that models the Cholesterol variable as a linear function of the Sex and BP_Status variables. The statements use the DESIGNF statement to form the dummy variables for each categorical variable. These columns (and an intercept column) are concatenated horizontally to form the design matrix. Because the DESIGNF statement is a nonsingular parameterization, you can use the SOLVE function to solve the normal equations and obtain the least squares solution, as follows:

Y = Cholesterol;                 /* response variable */
Intercept = j(nrow(Y), 1, 1);
X1 = designf( Sex );
X2 = designf( BP_Status );
X = Intercept || X1 || X2;      /* design matrix with EFFECT parameterization */
 
/* Matrix formulation of one-way ANOVA (cell means model): Y = X*beta + epsilon
   See https://en.wikipedia.org/wiki/Design_matrix       */
b = solve( X`*X, X`*Y );       /* solve normal equations */
print b[rowname={"Intercept" "Sex:Female" "BP_Status:High" "BP_Status:Normal"}];
t_designIML2

The interpretation of the parameter estimates for this linear example is somewhat complicated; see Lewis (2007) if you are interested. However, for comparison, the following call to PROC GENMOD creates parameter estimates for the same linear model. The PARAM=EFFECT option is used so that the procedure uses the EFFECT parameterization.

proc genmod data=Patients;
   class Sex BP_Status / param=effect;
   model Cholesterol = Sex BP_Status / noscale;
   ods select ParameterEstimates;
run;
t_designIML3

Strictly speaking, PROC GENMOD uses maximum likelihood estimation whereas the PROC IML code is a least squares estimate, but you can see that the estimates are identical to four decimal places.

REFERENCE encoding and the GLM parameter estimates

Although SAS/IML does not provide a built-in function for generating a design matrix that uses the REFERENCE encoding, you can easily create such a function. The REFERENCE encoding is similar to the GLM encoding, but with the (redundant) last column dropped:

/* design matrix for reference encoding */
start designr(x); 
   A = design(x);             /* get design matrix with GLM encoding */
   return( A[,1:ncol(A)-1] ); /* drop last column */
finish;

If you use the REFERENCE encoding to create the X matrix as in the previous section, then the SOLVE function returns the same parameter estimates that are provided by the GLM procedure. (The GLM procedure sets the parameters for the last dummy columns to zero.)

Interactions of dummy variables

You can use the HDIR function to create interaction effects. For example, the following statements create columns that indicate the interaction between the Sex and BP_Status variables. The printed output shows the results for the EFFECT parameterization, but the same SAS/IML statement will produce the interaction effects for other parameterizations:

X1X2 = hdir(X1, X2);   /* dummy variables for interaction term */
print X1X2[c={"Female High" "Female Normal"}];
t_designIML4

By using the tips in this article, you can create design matrices for ANOVA and regression models that contain categorical variables. In this way, you can use SAS/IML to reproduce the parameter estimates in many SAS linear regression procedures.

Post a Comment

One-level data set names in SAS are not always stored in WORK

One of the first things SAS programmers learn is that SAS data sets can be specified in two ways. You can use a two-level name such as "sashelp.class" which uses a SAS libref (SASHELP) and a member name (CLASS) to specify the location of the data set. Alternatively, you can use a one-level name such as "TempData," and SAS searches for the data set in a default location.

In many SAS environments, one-level data set names are like the seven little dwarves: Heigh-Ho, heigh-ho, it's off to WORK they go!

In other words, the WORK directory is the default location for one-level names. Consequently, one-level names often imply "temporary data," because data sets in WORK are deleted when you exit SAS.

However, it is possible to use the OPTIONS statement to change the libref that SAS searches when you specify a one-level SAS data set name. The option name is USER. The following statements specify a new libref that SAS should use as the default location for one-level data set names:

libname DEFLIB "C:/Temp";    /* define any libref */
options user=DEFLIB;         /* set the default location for one-level names */

For example, the following DATA step uses a one-level name for the data set. Consequently, the data set is created in the USER directory and PROC DATASETS lists the data sets in USER rather than WORK:

data TempData; x=1; y=2; z=3; run;  /* create data set using one-level name */
proc datasets; run;                 /* note that it is in the USER libref! */
Default libref for one-level name

Personally, I never do this because data sets in USER are not deleted when SAS exits. However, this example shows that one-level names are not always stored in WORK.

Discover the default storage location

If a one-level data set name is not necessarily in WORK, can you programmatically discover the libref where the data set is? Yes! The GETOPTION function returns the value for any SAS option, so you can retrieve the value of the USER option. For example, the following DATA step discovers the libref and data set name for a specified data set. For a two-level name, the name contains a period, which you can find by using the FINDC function. You can then use the SUBSTR function to extract the name of the libref and data set. If the data set name is a one-level name, then the GETOPTION function obtains the default libref. (If the USER option is not set, GETOPTION returns a blank string.)

%let MyData = TempData;       /* specify one-level or two-level data set name */
 
data _null_;
dsName = "&MyData";
LocDot = findc(dsName, ".");          /* Does name contain a period (.)?     */
if LocDot > 0 then do;                /*   Yes: it is a two-level name       */
   lib = substr(dsName, 1, LocDot-1); /*     get substring before the period */
   member = substr(dsName, LocDot+1); /*     get substring after the period  */
end;
else do;                              /*   No: it is a one-level name        */
   lib = getoption("user");           /*   Has this option been defined?     */
   if lib = ' ' then lib = "work";    /*     No: use WORK                    */
   member = dsName;
end;
put lib=;
put member=;
run;
lib=DEFLIB
member=TempData

In summary, although one-level data set names are usually stored in WORK, that is not always the case. However, a programmer can use the GETOPTION function to discover the libref where one-level data sets are stored.

An application to SAS/IML programming

The reason I was interested in the GETOPTION function is that I was trying to write a function in SAS/IML that would accept a one- or two-level data set name and return the names of the variables in the data. The CONTENTS function in SAS/IML almost does what I want, but the CONTENTS function has two different signatures, one for two-level names and one for one-level names:

  • For two-level names, use two arguments: varNames = contents(lib, name);
  • For one-level names, use one argument: varNames = contents(name);

I wanted to write a function that accepts a single string (a one-level or two-level data set name) and calls the appropriate signature of the CONTENTS function. The following SAS/IML function does the job:

proc iml;
/* new CONTENTS function that handles one- and two-level data set names */
start ContentsEx( dsName );              /* "Ex" means "extended" */
   LocDot = findc(dsName, ".");          /* Does name contain a period (.)?     */
   if LocDot > 0 then do;                /*   Yes: it is a two-level name       */
      lib = substr(dsName, 1, LocDot-1); /*     get substring before the period */
      member = substr(dsName, LocDot+1); /*     get substring after the period  */
      return( contents(lib, member) );
   end;
   return( contents(dsName) );           /*   No: it is a one-level name        */
finish;
 
dsName = "&MyData";
varNames =  ContentsEx( dsName );
print varNames;
t_deflib2

Have you ever had the need to use the USER option to override the default storage location for one-level data set names? Leave a comment.

Post a Comment

Four ways to create a design matrix in SAS

SAS programmers sometimes ask, "How do I create a design matrix in SAS?" A design matrix is a numerical matrix that represents the explanatory variables in regression models. In simple models, the design matrix contains one column for each continuous variable and multiple columns (called dummy variables) for each classification variable.

I previously wrote about how to create dummy variables in SAS by using the GLMMOD procedure to create binary indicator variables for each categorical variable. But PROC GLMMOD is not the only way to generate design matrices in SAS. This article demonstrates four SAS procedures that create design matrices: GLMMOD, LOGISTIC, TRANSREG, and GLIMMIX. (Others include PROC CATMOD and PROC GLMSELECT.) Of the four, the LOGISTIC procedure is my favorite because it provides an easy-to-use syntax and supports various parameterizations for creating design matrices.

How categorical variables are represented in a design matrix in SAS

The CLASS statement in a SAS procedure specifies categorical variables that should be replaced by dummy variables when forming the design matrix. The process of forming columns in a design matrix is called parameterization or encoding. The three most popular parameterizations are the GLM encoding, the EFFECT encoding, and the REFERENCE encoding. For a detailed explanation of these encodings, see the section "Parameterization of Model Effects" in the SAS/STAT documentation. For applications and interpretation of different parameterizations, see Pasta (2005).

The following DATA step create an example data set with 10 observations. It has three fixed effects: one continuous variable (Cholesterol) and two categorical variables. One categorical variable (Sex) has two levels and the other (BP_Status) has three levels. It also has a categorical variable (HospitalID) that will be used as a random effect.

data Patients;
   HospitalID = mod(_N_, 4);
   keep HospitalID Cholesterol Sex BP_Status;
   set sashelp.heart;
   if 18 <= _N_ <= 27;
run;
 
proc print; run;
Example data set for creating design matrices in SAS

PROC GLMMOD: Design matrices that use the GLM encoding

The simplest way to create dummy variables is by using the GLMMOD procedure, which can produce a basic design matrix with GLM encoding. The GLM encoding is a singular parameterization in which each categorical variable is represented by k binary variables, where k is the number of levels in the variable. There is also an intercept column that has all 1s. The GLMMOD procedure uses a syntax that is identical to the MODEL statement in PROC GLM, so it is very easy to create interaction effects. See my previous article for an example of how to use PROC GLMMOD to create a design matrix and how the singular parameterization affects parameter estimates in regression.

PROC LOGISTIC: Design matrices for any parameterization

You can also create a design matrix in SAS by using the LOGISTIC procedure. The PROC LOGISTIC statement supports a DESIGNONLY option, which prevents the procedure from running the analysis. Instead it only forms the design matrix and writes it to a data set. By default, PROC LOGISTIC uses the EFFECT encoding for classification variables, but you can use the PARAM= option on the CLASS statement to specify any parameterization.

A drawback of using PROC LOGISTIC is that you must supply a binary response variable on the MODEL statement, which might require you to run an extra DATA step. The following DATA step creates a view that contains a variable that has the constant value 0. This variable is used on the left-hand side of the MODEL statement in PROC LOGISTIC, but is dropped from the design matrix:

data Temp / view=Temp;
   set Patients;
   FakeY = 0;
run;
 
proc logistic data=Temp outdesign=EffectDesign(drop=FakeY) outdesignonly;
   class sex BP_Status / param=effect; /* also supports REFERENCE & GLM encoding */
   model FakeY = Cholesterol Sex BP_Status;
run;
 
proc print data=EffectDesign; run;
Design matrix in SAS with effect encoding

The design matrix shows the effect encoding, which uses –1 to indicate the reference level, which by default is the last level in alphabetical order. The name of a dummy variable is the conatenation of the original variable name and a level. For example, the Sex variable is replaced by the dummy variable named SexFemale, which has the value 1 to represent females and –1 to represent the reference level ("Male"). The BP_Status variable is replaced by two variables. The BP_StatusHigh variable contains 1 for patients that have high blood pressure, –1 for the reference level ("Optimal"), and 0 otherwise. Similarly, the BP_StatusNormal dummy variable has the value 1 for patients with normal blood pressure, –1 for the reference level ("Optimal"), and 0 otherwise.

The effect encoding produces k-1 columns for a categorical variable that has k levels. This results in a nonsingular design matrix.

You can use the REF= option after each classification variable to specify the reference level. You can also use the PARAM= option on the CLASS statement to specify a different parameterization. For example, the following statements create a design matrix that uses the REFERENCE parameterization. The reference level for the Sex variable is set to "Female" and the reference level for the BP_Status variable is set to "Normal."

proc logistic data=Temp outdesign=RefDesign(drop=FakeY) outdesignonly;
   class sex(ref="Female") BP_Status(ref="Normal") / param=reference; 
   model FakeY = Sex BP_Status;
run;
 
proc print data=RefDesign; run;
t_design3

Parameterizations affect the way that parameter estimates are interpreted in a regression analysis. For the reference encoding, parameter estimates of main effects indicate the difference of each level as compared to the effect of the reference level. For the effect encoding, the comparison is to the average effect over all levels.

PROC TRANSREG: Design matrices and a macro for variable names

Using PROC LOGISTIC is very flexible, but it has two drawbacks: You have to create a fake response variable, and you have to look at the output data set to discover the names of the dummy variables. In contrast, PROC TRANSREG does not require that you specify a response variable when you generate the design matrix. Furthermore, the procedure creates a macro variable (&_TRGIND, for "TRANSREG indicator" variables) that contains the names of the columns of the design matrix. Another nice feature is that the output data set contains the original variables, and you can use the ID variable to output additional variables.

However, the syntax for the TRANSREG procedure is different from most other SAS regression procedures. Instead of a CLASS statement, you specify classification effects in a CLASS() transformation list. By default, the procedure uses the REFERENCE parameterization; you can use the ZERO= option to control reference levels. The procedure also supports the GLM parameterization (via the ZERO=SUM option), the EFFECT parameterization (via the EFFECT option), and other options. The following statements show an example that generates a design matrix with the effect encoding:

proc transreg data=Patients design;
   model identity(Cholesterol) 
         class(Sex BP_Status / EFFECT zero="Female" "Normal");
   output out=B;
run;
 
proc print data=B; 
   var Intercept &_TrgInd; 
run;

The output is not shown because it is identical to the EffectDesign data set in the previous section. Notice that the output is displayed by using the &_TRGIND macro variable. For details about generating design matrices, see the TRANSREG documentation section "Using the DESIGN Output Option."

PROC GLIMMIX: Design matrices for fixed and random effects

PROC GLIMMIX enables you to construct two design matrices: one for the fixed effects and another for the random effects. The PROC GLIMMIX statement supports an OUTDESIGN= option that you can use to specify the output data set and a NOFIT option that ensures that the procedure will not try to fit the model.

The following statements create an output data set that contains two design matrices:

proc glimmix data=Patients outdesign(names novar)=MixedDesign nofit;
   class sex BP_Status HospitalID;
   model Cholesterol = Sex BP_Status;
   random HospitalID;
   ods select ColumnNames;
run;
 
proc print data=MixedDesign; run;
Design matrix in SAS for fixed and random effects

Dummy variables for the fixed effects are prefixed by "_X" and dummy variables for the random effects are prefixed by "_Z." Two additional tables (not shown) associate the levels of the original variables with the columns of the design matrices.

The GLIMMIX procedure uses only the GLM parameterization. Consequently, there is little advantage to using PROC GLIMMIX instead of PROC GLMMOD. You can generate the same designs by calling PROC GLMMOD twice, once for the fixed effects and once for the random effects.

Summary

In summary, SAS provides four procedures that you can use to generate design matrices for continuous variables, classification variables, and their interactions. The GLMMOD procedure is ideal for creating design matrices that use the GLM encoding. PROC LOGISTIC supports all encodings in SAS and provides an easy-to-use syntax for specifying interactions. PROC TRANSREG supports fewer parameterizations, but does not require that you manufacture a response variable. Lastly, the GLIMMIX procedure produces design matrices for both fixed and random effects.

If you need to use matrix computations, the SAS/IML procedure also supports creating design matrices.

Post a Comment