Tabulate counts when there are unobserved categories

Suppose that you are tabulating the eye colors of students in a small class (following Friendly, 1992). Depending upon the ethnic groups of these students, you might not observe any green-eyed students. How do you put a 0 into the table that summarizes the number of students who have each eye color?

If you are using PROC FREQ to tabulate the counts, I have previously shown how to use the DATA step to create the set of all possible categories and run PROC FREQ on the resulting data. The trick is to use the ZEROS option on the WEIGHT statement to include categories with zero counts.

In the SAS/IML language you can do something similar to enable the TABULATE subroutine to report 0 counts for unobserved categories. However, I don't like the previous way that I implemented the idea, so I want to try again. The following SAS/IML module uses the LOC-ELEMENT technique to accomplish the task, but wraps the functionality with a simpler syntax that is similar to the syntax for the TABULATE subroutine. The only difference is a fourth argument that specifies the "reference set", which is the set of all possible categories.

proc iml;
/* output levels and frequencies for categories in x, including all 
   levels in the reference set */
start TabulateLevels(OutLevels, OutFreq, x, refSet);
   call tabulate(levels, freq, x);        /* compute the observed frequencies */
   OutLevels = union(refSet, levels);     /* union of data and reference set */ 
   idx = loc(element(OutLevels, levels)); /* find the observed categories */
   OutFreq = j(1, ncol(OutLevels), 0);    /* set all frequencies to 0 */
   OutFreq[idx] = freq;                   /* overwrite observed frequencies */
/* count faces for 10 rolls of six-sided die, using the reference set 1:6 */
rolls = {4 6 4 6 5 2 4 3 3 6};           /* only 5 faces observed */
call TabulateLevels(Levels, Freq, rolls, 1:6);
print Freq[c=(char(Levels))];

The table shows the output for a set of 10 rolls of a six-sided die. The "1" face was not observed during the experiment, but we want to report a count of 0 for that category.

I like this module better because its syntax is similar to the TABULATE subroutine. The name even includes "Tabulate", which will make the function easier to find when I need to use it.

Post a Comment

How to generate random integers in SAS

I was recently talking with some SAS customers and someone started talking about generating random numbers. I was asked "Why can't SAS create an easy way to generate random numbers? Excel has a simple way to generate random numbers between 1 and 100, and I use it all the time."

The speaker was talking about generating random integers from a discrete uniform distribution, where the numbers range between a specified minimum and maximum value. I have previously written about how to generate random numbers in SAS, but the section about random integers is buried in the middle. So let's simplify the process and see whether we can get SAS to generate random integers "as easily as Excel does."

A macro with a simple syntax

I looked up the Excel function that generates random integers. The syntax is RANDBETWEEN(min, max). For example, if you want a random integer between 1 and 10, the syntax is RANDBETWEEN(1, 10).

In SAS, random numbers are generated by using the RAND function, which is a sophisticated function that can sample from complex statistical distributions. However, the following SAS macro hides the complexity of the RAND function and creates a simple statement that that has the same functionality as the Excel function:

/* SAS macro that duplicates the Excel RANDBETWEEN function */
%macro RandBetween(min, max);
   (&min + floor((1+&max-&min)*rand("uniform")))

After you've defined the macro, you can just pretend that it is a function and use it in a DATA step. The following DATA step generates 100 random numbers between 1 and 10:

data RandInt;
do i = 1 to 100;
   x = %RandBetween(1, 10);
proc print data=RandInt(obs=6);  var x;  run;

The adjacent table shows the first six random integers. Every time you run the program you will get a different sequence of random integers. Use the STREAMINIT subroutine if you want a reproducible stream of integers.

A SAS/IML function for random integers

The previous macro call should appeal to DATA step programmers. For the SAS/IML programmers, you can define a function that takes an additional argument and returns a vector of random integers:

proc iml;
/* Generate n random integers drawn uniformly at random from 
   integers in [min, max] where min and max are integers. */
start RandBetween(n, min, max);
   u = j(n, 1);
   call randgen(u, "Uniform", min, max+1);
   return( floor(u) );
/* Test the function. Generate 10,000 values and tabulate the result. */
call randseed(1234);
x = RandBetween(10000, 1, 10);
call tabulate(val, freq, x);
print freq[c=(char(val))];

The output shows that each integer 1 through 10 appeared approximately 1,000 times in the simulated data.

In summary, you can define a SAS macro or a SAS/IML function that hides the complexity of the RAND function. The result is a syntax that is easy to use and duplicates the functionality of the Excel RANDBETWEEN function.

Post a Comment

Balls and urns Part 2: Multi-colored balls

In a previous post I described how to simulate random samples from an urn that contains colored balls. The previous article described the case where the balls can be either of two colors. In that csae, all the distributions are univariate. In this article I examine the case where the urn contains c > 2 colors, which leads to multivariate distributions.

Let K1, K2, ..., Kc be the number of balls for each of the c colors. Let N = Σ Ki be the total number of balls.

The table distribution

The simplest experiment is to reach into the urn and pull out a single ball. The probability of choosing a ball of color i is pi = Ki / N. In SAS you can use the table distribution to specify the probabilities of selecting each integer in the range [1, c]. Some people call the table distribution a categorical distribution or a generalized Bernoulli distribution.

You can use the RAND function in the DATA step to simulate from a categorical distribution. You can also use the RANDGEN function in SAS/IML software, or you can use the SAMPLE function.

The multinomial distribution

If you repeat the experiment n times (replacing the selected ball after recording its value), you obtain a c-tuple of counts. Let ki be the number of times that a ball of color i was selected. Then the result of n draws with replacement is the vector (k1, k2, ..., kc) where Σ ki = n.

By definition, this c-tuple is a single draw from the multinomial distribution. In SAS, you can use the RANDMULTINOMIAL function in SAS/IML to simulate from the multinomial distribution.

The multivariate hypergeometric distribution

When sampling with replacement, the probability of choosing a ball of color i never changes. However, you can also sample without replacement. Without replacement, the probability of choosing a given color changes from draw to draw.

The distribution of the number of colored balls when you sample without replacement is called the multivariate hypergeometric distribution. SAS does not have a built-in function for simulating data from a multivariate hypergeometric distribution, but the following programs shows that you can iteratively use the univariate hypergeometric distribution to construct a multivariate function.

A standard technique for drawing a sample from multiple categories is to reduce the problem to a series of two-category problems by combining colors (Gentle, 2003, p. 206). You can obtain a random number of balls drawn from the first color category by using the hypergeometric distribution with K1 balls of the first color and N - K1 balls in the "other" category. Suppose that k1 balls of the first color are selected. For the second step, there are N - K1 balls of the other colors in the urn and n - k1 balls left to draw. Repeat this process with the second color category. Use the hypergeometric distribution to draw k2 balls of the second color. That leaves (N - K1 - K2) balls of the remaining colors in the urn and (n - k1 - k1) balls left to draw. Continue this process to obtain counts for all the categories.

The following SAS/IML function can return multiple draws from the multivariate hypergeometric distribution. The function is efficient because it exploits the fact that you can pass vectors of parameters to the RANDGEN subroutine. The RandMVHyper function uses a vector of N's (ItemsLeft) and a vector of n's (nDraw) to generate an arbitrary number of independent draws from the multivariate hypergeometric distribution. There is no loop over the number of replications. The only loop is over the number of colors, which is necessary because that loop updates the selection probabilities. (If you study the code, you might want to mentally set nRep = 1 so that the vectors of parameters become scalar quantities.)

proc iml;
/* There are K[i] balls of color i in an urn.
   Draw N balls without replacement and report the number of 
   each color. 
   N[1] = sample size
   N[2] = number of replications (optional. Default is 1) 
   K = vector that gives the number balls of each color.
       The total number of balls is sum(K). 
start RandMVHyper(N, _K);
   if nrow(N)*ncol(N)>1 then nRep = N[2];
   else nRep = 1;
   K = rowvec(_K);               /* K[i] is number of items for category i */
   nDraw = j(nRep, 1, N[1]);     /* run nRep sims at once */
   ItemsLeft = j(nRep, 1, sum(K));
   x = j(nRep, ncol(K), 0);     /* each row is draw from MV hyper */
   h = j(nRep, 1, 0);           /* vec for hypergeometric values  */
   do i = 1 to ncol(K)-1;
      Kvec = j(nRep, 1, K[i]);
      call randgen(h, "Hyper", ItemsLeft, Kvec, nDraw);
      x[,i] = h;
      ItemsLeft = ItemsLeft - K[i];         /* update parameters */
      nDraw = nDraw - h;
   x[,ncol(K)] = nDraw;
   return (x);
call randseed(1234);
/* TEST:      nDraws nRep     K1 K2 K3 */
y = RandMVHyper({100 1000}, {100 40 60});
print (y[1:5,])[c={"black" "white" "red"} L="Draws w/o Replacement"];
mean = mean(y);
print mean;

The test code requests 1,000 replicates. Each replicate is a random vector from the multivariate hypergeometric distribution, namely the result of drawing 100 balls without replacement from an urn that contains 100 black balls, 40 white balls, and 60 red balls. The table shows the first five results. For example, the fourth experiment resulted in 53 black, 24 white, and 23 red balls. Each row sums to 100, which is the number of balls drawn for each experiment. The average counts from the 1,000 experiments were approximately 50 black ball, 20 white balls, and 30 red balls. These are the expected values.

The RandMVHyper function handles most cases, but it can break down when the sample size is small and when some categories (colors) have small marginal sums. You can download a more robust version of the function that handles these edge cases.

In summary:

  • Use the table (generalized Bernoulli) distribution to simulate drawing a colored ball from an urn that contains balls of different colors.
  • Use the multinomial distribution to generate the counts when you sample n balls with replacement. In other words, you repeat n independent and identical generalized Bernoulli trials and tabulate the number of balls that were selected for each color.
  • Use the multivariate hypergeometric distribution to generate the counts when you sample n balls without replacement. This article contains a SAS/IML function that samples from the multivariate hypergeometric distribution.
Post a Comment

Balls and urns: Discrete probability functions in SAS

If not for probability theory, urns would appear only in funeral homes and anthologies of British poetry.

But in probability and statistics, urns are ever present and contain colored balls. The removal and inspection of colored balls from an urn is a classic way to demonstrate probability, sampling, variation, and other elementary concepts in probability and statistics.

This article describes some ball-and-urn experiments, and how you can use SAS software to simulate drawing balls from an urn under various assumptions. This article describes the case where the urn contains exactly two colors, often chosen to be black and white. A subsequent article will describe the case where the urn contains more than two colors.

Let K0 be the number of black balls, K1 be the number of white balls, and N be the total number of balls in the urn.

Bernoulli Trials

The simplest experiment is to reach into the urn and pull out a single ball. Although there are two kinds of balls, the act of drawing one ball depends only on one parameter, namely the probability of choosing a white ball, which is p = K1 / N. (Similarly, the ball will be black with probability 1–p.) This is called a Bernoulli trial. In SAS the Bernoulli distribution is supported by the four essential functions for probability and sampling: the PDF, CDF, QUANTILE, and RAND functions.

For example, the following DATA step repeats 100 independent Bernoulli trials and plots the results for an urn in which 75% of the balls are white. "Independence" means that after each trial, the selected ball is returned to the urn, and the urn is thoroughly mixed. The only parameter for the Bernoulli distribution is the probability of selecting a white ball:

data Bern;
call streaminit(54321);
do i = 1 to 100;
   color = rand("Bernoulli", 0.75);  /* P(color=1) = 0.75 */
proc freq data=Bern;
tables color / nocum plots=FreqPlot;

Binomial Trials

You don't actually need to simulate 100 Bernoulli trials and call PROC FREQ to obtain the (random) number of white balls that were drawn. The number of white balls that are selected when you draw n balls with replacement from an urn can be obtained from the binomial distribution. The binomial distribution saves you computational time because you can simulate one binomial trial instead of 100 Bernoulli trials. The binomial distribution has two parameters: the size of the sample that you are drawing (n=100) and the probability of the event of interest (p = 0.75).

The following DATA step simulates five binomial trials. Each binomial trial represents 100 independent Bernoulli draws:

data Binom;
call streaminit(54321);
do i = 1 to 5;
   numWhite = rand("Binomial", 100, 0.75);  /* 100 draws; P(color=1) = 0.75 */
proc print; var numWhite; run;

Hypergeometric Trials

When sampling with replacement, the probability of choosing a white ball never changes. However, you can also sample without replacement. You start with N balls in the urn, of which K1 are white. If you choose a white ball and remove it, the probability of choosing a white ball on the next draw is decreased. If you choose a black ball, the probability of choosing white increases. In addition, K1 is an upper bound for the maximum number of white balls that can ever be drawn.

In spite of the changing probability, the distribution of the number of white balls when you sample without replacement is well understood. It is called the hypergeometric distribution. Like the binomial distribution, using the hypergeometric distribution saves time because you don't need to use the Bernoulli distribution (with a changing probability) to simulate drawing 100 balls without replacement.

The following DATA step uses the hypergeometric distribution to draw a sample of 100 balls from an urn that contains 120 balls, 90 of which are white. For this distribution it is necessary to specify three parameters: the total number of balls (N), and the initial number of white balls (K1), and the sample size (n).

data Hyper;
call streaminit(54321);
do i = 1 to 5;
   /* sample without replacement from urn with 120 ball of which 90 are white */
   numWhite = rand("Hypergeometric", 120, 90, 100); /* N=120, K1=90; 100 draws */
proc print; var numWhite; run;

The values from the hypergeometric distribution have a smaller variance then the values from the binomial distribution. This is typical when the number of draws is close to the number of balls in the urn. When the number of balls is much greater than the number of draws, the hypergeometric distributions is similar to the binomial distribution.

In summary:

  • Use the Bernoulli distribution to analyze the probability of pulling a colored ball from an urn with two colors.
  • The binomial distribution model the distribution of the number of white balls when you sample n balls with replacement. In other words, you repeat n independent and identical Bernoulli trials.
  • The hypergeometric distribution models the distribution of the number of white balls when you sample n balls without replacement.

Other related discrete probability distributions

There are other discrete probability functions that can model ball-and-urn experiments in which the question is "how many trials do you need until" some event occurs:

  • The geometric distribution describes the number of trials that are needed to obtain one white ball. SAS supports the geometric distribution in the RAND function and in probability functions.
  • The negative binomial distribution is the distribution of the number of black balls before a specified number of white balls are drawn in a sampling with replacement experiment. SAS supports the negative binomial distribution as well
  • A lesser-known distribution is the negative hypergeometric distribution, which is the distribution of black balls when you sample without replacement. SAS does not support this distribution directly, but the hypergeometric distribution is useful for implementing it.
Post a Comment

Ten "one-liners" that create test matrices for statistical programmers

You've had a long day. You've implemented a custom algorithm in the SAS/IML language. But before you go home, you want to generate some matrices and test your program.

If you are like me, you prefer a short statement—one line would be best. However, you also want the flexibility to create large matrices to test the performance of the algorithm. And you'd like the matrices to be simple so that you can determine if the algorithm is working and debug it if necessary.

Last week's article on how to generate a large correlation matrix caused me to think about other quick and easy ways to generate matrices of an arbitrary size. Over the years I've developed favorite "test matrices" that I use to test algorithms, to write examples for my blog, or to answer questions on a discussion forum.

Here are ten one-line statements that generate numerical and character matrices of an arbitrary size. Most matrices have n rows and p columns. However, some functions generate square matrices, which are n x n.

  1. Constant matrix: The J function enables you to create a matrix with constant values. The syntax is
    const = j(n, p, value);
  2. Constant by row or column: The ROW and COL functions enable you to create a matrix where the ith row or column has the value i. The syntax is
    rows = row(j(n, p));
  3. Sequential and nonrepeating: The SHAPE function in conjunction with the index creation operator (:) enables you to create a matrix that contains sequential values. (You can also use the SHAPECOL function.) The syntax is
    seq = shape(1:n*p, n);
  4. Periodic: The MOD function returns the remainder when a number is divided by some value. When combined with the previous example, you obtain matrices for which the values repeat in a periodic manner. The syntax is
    mod = mod(shape(1:n*p, n), value);
  5. Symmetric matrices: The SQRSYM and SQRVECH functions enable you to create square symmetric matrices from a vector of values. The syntax is
    sym = sqrsym(1:n*(n+1)/2);
  6. Diagonally banded: The TOEPLITZ function enables you to create square banded matrices. The syntax is
    band = toeplitz(n:1);
  7. Magic squares: The MAGIC function (which requires SAS/IML 12.3) creates square matrices for which the sums of the rows, columns, and diagonals are equal. The syntax is
    magic = magic(n);
  8. Random integers: The SAMPLE function enables you to generate a random sample from a finite set. The syntax is
    samp = sample(1:value, p//n);
  9. Random values from a standard distribution: The RANDFUN function enables you to generate a matrix of random values from a standard distribution. The syntax is
    rand = randfun(n//p, "Normal");
  10. Character matrices: Many of the previous matrices can be modified to create character matrices. For any integer matrix, you can apply the MOD function to create integers in the range 1–26 that can be mapped to letters. The index creation operator (:) supports letters, so you can create a vector of all uppercase letters by using the syntax "A":"Z". Also, the SAMPLE function can create a random sample directly from any set of character values. The syntax is
    sampC = sample("A":"Z", p//n);

The following SAS/IML program generates examples of each of these matrices. You can change the values of n and p to create matrices of any size.

proc iml;
call randseed(12345);
n = 4;
p = 6;
value = 3;
/* constant and sequential matrices */
const = j(n, p, value);
rows = row(j(n, p));
cols = col(j(n,p));
seq = shape(1:n*p, n);
seq2 = shapecol(1:n*p, n);
mod = mod(shape(1:n*p, n), value);
/* square matrices */
symU = sqrvech(1:n*(n+1)/2);
symL = sqrsym(1:n*(n+1)/2);
band = toeplitz(n:1);
magic = magic(n);
/* random matrices */
samp = sample(1:value, p//n);
rand = randfun(n//p, "Normal");
/* character matrices */
letters = "A":"Z";
sampC = sample(letters, p//n);
idx = 1 + mod(0:n*p-1, ncol(letters));  /* values 1-26 */
modC = shape(letters[idx],  n);
print const, rows, cols, seq, seq2, mod,
      symU, symL, band, magic, samp, rand, 
      sampC, modC;

Do you have a favorite matrix that you use to test your programs? Leave a comment.

Post a Comment

A simple way to construct a large correlation matrix

Occasionally a SAS statistical programmer will ask me, "How can I construct a large correlation matrix?" Often they are simulating data with SAS or developing a matrix algorithm that involves a correlation matrix. Typically they want a correlation matrix that is too large to input by hand, such as a correlation matrix for 100 variables. The particular correlations are not usually important; they just want any valid correlation matrix.

Except for diagonal and diagonally dominant matrices, it is not easy to write down a valid correlation matrix. I've previously written about the fact that not every symmetric matrix with a unit diagonal is a correlation matrix. Correlation matrices are symmetric and positive definite (PD), which means that all the eigenvalues of the matrix are positive. (Technically, a correlation matrix can have a zero eigenvalues, but that is a degenerate case that I prefer to avoid.)

So here is a tip: you can generate a large correlation matrix by using a special Toeplitz matrix.

Recall that a Toeplitz matrix has a banded structure. The diagonals that are parallel to the main diagonal are constant. The SAS/IML language has a built-in TOEPLITZ function that returns a Toeplitz matrix, as shown:

proc iml;
T = toeplitz(5:1);    /* 5x5 Toeplitz matrix */
print T;

A Toeplitz matrix is obviously symmetric, and the following computation shows that the smallest eigenvalue is positive, which means that the matrix is positive definite:

smallestEigenvalue = min( eigval(T) );  /* compute smallest eigenvalue */
print smallestEigenvalue;

Generating positive definite Toeplitz matrices

In the previous example, the matrix was generated by the vector {5,4,3,2,1}. Because it is symmetric and PD, it is a valid covariance matrix. Equivalently, the scaled Toeplitz matrix that is generated by the vector {1,0.8,0.6,0.4,0.2} is a correlation matrix that is also PD. This example is a specific case of a very cool mathematical fact:

A Toeplitz matrix generated from any linearly decreasing sequence of nonnegative values is positive definite.

In other words, for every positive number R and increment h, the k-element vector {R, R-h, R-2h, ..., R-(k-1)h} generates a valid covariance matrix provided that R-(k-1)h > 0, which is equivalent to hR/(k-1). A convenient choice is h = R / k. This is a useful fact because it enables you to construct arbitrarily large Toeplitz matrices from a decreasing sequence. For example, the following statements uses R=1 and h=0.01 to construct a 100 x 100 correlation matrix. You can use the matrix to simulate data from a multivariate normal correlated distribution with 100 variables.

p = 100;
h = 1/p;                         /* stepsize for decreasing sequence */
v = do(1, h, -h);                /* {1, 0.99, 0.98, ..., 0.01, 0} */
Sigma = toeplitz(v);             /* PD correlation matrix */
mu = j(1, ncol(v), 0);           /* population mean vector: (0,0,0,...0) */
X = randnormal(500, mu, Sigma);  /* generate 500 obs from MVN(mu, Sigma) */

A proof of the fact that the Toeplitz matrix is PD is in a paper by Bogoya, Böttcher, and Grudsky (J. Spectral Theory, 2012), who prove the result on p. 273.

A stronger result: Allowing negative correlations

In fact, Bogoya, Böttcher, and Grudsky prove a stronger result. The previous example assumes that all 100 variables are positively correlated with each other. The paper also proves that you can use a decreasing sequence of length n that contains negative values, as long as the sum of the values is positive. Thus to generate 100 variables with almost as many negative correlations as positive, you can choose h = 2/p, as follows:

h = 2/p;                   /* stepsize for decreasing sequence */
v = do(1, -1+h, -h);       /* {1, 0.98, 0.96, ..., -0.96, -0.98} */
Sigma = toeplitz(v);       /* PD correlation matrix */

Large covariance matrices

A Toeplitz matrix creates a covariance matrix that has a constant diagonal, which corresponds to having the same variance for all variables. However, you can use the CORR2COV function in SAS/IML to convert a correlation matrix to a covariance matrix. Algebraically, this transformation is accomplished by pre- and post-multiplying by a diagonal matrix, which will preserve positive definiteness. For example, the following statement converts Sigma into a covariance matrix for which each variable has a different variance:

sd = 1:p;                  
Cov = corr2cov(Sigma, sd);  /* make std(x_i) = i */

The next time you need to generate a large symmetric positive-definite matrix, remember the Toeplitz matrix.

Post a Comment

Excluding variables: Read all but one variable into a matrix

Dear Rick,
I have a data set with 1,001 numerical variables. One variable is the response, the others are explanatory variable. How can I read the 1,000 explanatory variables into an IML matrix without typing every name?

That's a good question. You need to be able to perform two sub-tasks:

  1. Create a character vector that contains the names of all the variables. (If the data set contains both numeric and character variables, the character vector should contain the names of all numeric variables.)
  2. Exclude one or more elements from a character vector.

Discover the names data set variables

Just as you can use PROC CONTENTS to discover the names of variables in a data set, SAS/IML has the CONTENTS function, which returns a character vector that contains the variable names. The argument to the CONTENTS function can be the name of a data set. If you have already opened a data set you can skip the argument to obtain the variable names of the open data set, as follows:

proc iml;
use Sashelp.Heart;                  /* open data set */
varNames = contents();              /* get all variable names */

However, most of the time (as above) we do not have a data set that has only numerical variables. To obtain a vector of only the numeric variables, read one observation of the data into a matrix, and use the COLNAME= option to obtain the variable names:

read next var _NUM_ into X[colname=varNames];    /* read only numeric vars */
print varNames;

To save space, only the first few columns of the output are displayed below:


Exclude elements from a character vector

After you create a vector that contains variable names, you can use the SETDIF function to exclude certain variable. The SETDIF function also sorts the list of variable names, which can be useful:

YVar = "Weight";                    /* variable to exclude from the matrix */
XVarNames = setdif(varNames, YVar); /* exclude Y, sort remaining X */

If you want to preserve the order of the variables, use the REMOVE function and specify the indices of the elements that you want to remove. The LOC function enables you to find the indices of the elements that you want to remove, as follows:

XVarNames = remove(varNames, loc(varNames=YVar));
print XVarNames;

Putting it all together

For the Sashelp.Heart data set, here is how to read the variable Weight into a vector, and read all other numeric variables into a matrix X:

proc iml;
YVar = "Weight";                    /* var to exclude from the matrix */
dsName = Sashelp.Heart;
use (dsName);                       /* open data set */
read next var _NUM_ into X[colname=varNames];     /* read only numeric vars */
XVarNames = remove(varNames, loc(varNames=YVar)); /* exclude; preserve order */
read all var YVar into Y;           /* Y is vector */
read all var XVarNames into X;      /* X is matrix */

You can use the LOC-ELEMENT trick to exclude multiple variables. For example, you can use the following statements to exclude two variables:

YVar = {"Weight" "Height"};
XVarNames = remove(varNames, loc( element(varNames,YVar) ));
Post a Comment

Error distributions and exponential regression models


Last week I discussed ordinary least squares (OLS) regression models and showed how to illustrate the assumptions about the conditional distribution of the response variable. For a single continuous explanatory variable, the illustration is a scatter plot with a regression line and several normal probability distributions along the line.

The term "conditional distribution of the response" is a real mouthful. For brevity, I will say that the graph shows the assumed "error distributions."

A similar graph can illustrate regression models that involve a transformation of the response variable. A common transformation is to model the logarithm of the response variable, which means that the predicted curve is exponential.

There are two common ways to construct an exponential fit of a response variable, Y, with an explanatory variable, X. The two models are as follows:

  • A generalized linear model of Y that uses LOG as a link function. In SAS you can construct this model with PROC GENMOD by setting DIST=NORMAL and LINK=LOG.
  • An OLS model of log(Y), followed by exponentiation of the predicted values. In SAS you can construct this model with PROC GLM or REG, although for consistency I will use PROC GENMOD with an identity link function.

To illustrate the two models, I will use the same 'cars' data as last time. These data were used by Arthur Charpentier, whose blog post about GLMs inspired me to create my own graphs in SAS. Thanks to Randy Tobias and Stephen Mistler for commenting on an early draft of this post.

A generalized linear model with a log link

A generalized linear model of Y with a log link function assumes that the response is predicted by an exponential function of the form Y = exp(b0 + b1X) + ε and that the errors are normally distributed with a constant variance. In terms of the mean value of Y, it models the log of the mean: log(E(Y)) = b0 + b1X.


The graph to the left illustrates this model for the "cars" data used in my last post. The X variable is the speed of a car and the Y variable is the distance required to stop.

How can you create this graph in SAS? As shown in my last post, you can run a SAS procedure to get the parameter estimates, then obtain the predicted values by scoring the model on evenly spaced values of the explanatory variable. However, when you create the data for the probability distributions, be sure to apply the inverse link function, which in this case is the EXP function. This centers the error distributions on the prediction curve.

The call to PROC GENMOD is shown below. For the details of constructing the graph, download the SAS program used to create these graphs.

title "Generalized Linear Model of Y with log link";
proc genmod data=MyData;
   model y = x / dist=normal link=log;
   ods select ParameterEstimates;
   store work.GenYModel / label='Normal distribution for Y; log link';
proc plm restore=GenYModel noprint;
   score data=ScoreX out=Model pred=Fit / ILINK;  /* apply inverse link to predictions */

An OLS model of log(Y)

An alternative model is to fit an OLS model for log(Y). The data set already contains a variable called LogY = log(Y). The OLS model assumes that log(Y) is predicted by a model of the form b0 + b1X + &epsilon. The model assumes that the errors are normally distributed and that the expected value of log(Y) is linear: E(log(Y)) = b0 + b1X. You can use PROC GLM to fit the model, but the following statement uses PROC GENMOD and PROC PLM to provide an "apples-to-apples" comparison:

title "Linear Model of log(Y)";
proc genmod data=MyData;
   model logY = x / dist=normal link=identity;
   ods select ParameterEstimates;
   store work.LogYModel / label='Normal distribution for log(Y); identity link';
proc plm restore=LogYModel noprint;
   score data=ScoreX out=Model pred=Fit;
t_GLM_lognormal GLM_lognormal

On the log scale, the regression line and the error distributions look like the graph in my previous post. However, transforming to the scale of the original data provides a better comparison with the generalized linear model from the previous section. When you exponentiate the log(Y) predictions and error distribution, you obtain the graph at the left.

Notice that the error distributions are NOT normal. In fact, by definition, the distributions are lognormal. Furthermore, on this scale the assumed error distribution is heteroscedastic, with smaller variances when X is small and larger variances when X is large. These are the consequences of exponentiating the OLS model.

Incidentally, you can obtain this same model by using the FMM procedure, which enables you to fit lognormal distributions directly.

A comparison of log-link versus log(Y) models

It is worth noting that the two models result in different predictions. The following graph displays both predicted curves. They first curve (the generalized linear model with log link) goes through the "middle" of the data points, which makes sense when you think about the assumed error distributions for that model. The second curve (the exponentiated OLS model of log(Y)) is higher for large values of X than you might expect, until you consider the assumed error distributions for that model.


Both models assume that the effect of X on the mean value of Y is multiplicative, rather than additive. However, as one of my colleagues pointed out, the second model also assumes that the effect of errors is multiplicative, whereas in the generalized linear model the effect of the errors is additive. Researchers must use domain-specific knowledge to determine which model makes sense for the data. For applications such as exponential growth or decay, the second model seems more reasonable.

So there you have it. The two exponential models make different assumptions and consequently lead to different predictions. I am not an expert in generalized linear models, so I found the graphs in this article helpful to visualize the differences between the two models. I hope you did, too. If you want to see how the graphs were created, download the SAS program.

Post a Comment

Generate evenly spaced points in an interval

I've previously written about how to generate a sequence of evenly spaced points in an interval. Evenly spaced data is useful for scoring a regression model on an interval.

In the previous articles the endpoints of the interval were hard-coded. However, it is common to want to evaluate a function in the interval [min(x), max{x}], where x is an observed variable in data set. That is easily done in the DATA step by first running a PROC SQL call that puts the minimum and maximum values into macro variables. The macro variables can then be used to generate the evenly spaced values. For example, the following statements generate 101 points within the range of the Weight variable in the Sashelp.Cars data set:

/* Put min and max into macro variables */
proc sql noprint;
  select min(Weight), max(Weight) into :min_x, :max_x 
  from Sashelp.Cars;
/* Create data set of evenly spaced points */
data ScoreX;
do Weight = &min_x to &max_x by (&max_x-&min_x) / 100;  /* min(X) to max(X) */

See the article "Techniques for scoring a regression model in SAS" for various SAS procedures and statements that can score a regression model on the points in the ScoreX data set.

You can also use this technique to compute four macro variables to use in generating a uniformly spaced grid of values.

If you are doing the computation in the SAS/IML language, no macro variables are required because you can easily compute the minimum and maximum values as part of the program:

proc iml;
use Sashelp.Cars;  read all var "Weight" into x; close;
w = do(min(x), max(x), (max(x)-min(x))/100);   /* min(X) to max(X) */
Post a Comment

Plot the conditional distribution of the response in a linear regression model


A friend who teaches courses about statistical regression asked me how to create a graph in SAS that illustrates an important concept: the conditional distribution of the response variable. The basic idea is to draw a scatter plot with a regression line, then overlay several probability distributions along the line, thereby illustrating the model assumptions about the distribution of the response for each value of the explanatory variable.

For example, the image at left illustrates the theoretical assumptions for ordinary linear regression (OLS) with homoscedastic errors. The central line represents the conditional mean of the response. The observed responses are assumed to be of the form yi = β0 + β1xI + εi where εi ∼ N(0, σ) for some value of σ. The normal distributions all have the same variance and are centered on the regression line. Pictures like this are used by instructors and textbook authors to illustrate the assumptions of linear regression. You can create similar pictures for variations such as heteroscedastic errors or generalized linear models.

This article outlines how to create the graph in SAS by using the SGPLOT procedure. The data are from a blog post by Arthur Charpentier, "GLM, non-linearity and heteroscedasticity," as are some of the ideas. Charpentier also inspired a related post by Markus Gesmann, "Visualising theoretical distributions of GLMs." Thanks to Randy Tobias and Stephen Mistler for commenting on an early draft of this post.

Fitting and scoring a regression model

Charpentier uses a small data set that contains the stopping distance (in feet) when driving a car at a certain speed (in mph). The OLS regression model assumes that distance (the response) is a linear function of speed and that the observed distances follow the previous formula.

In order to make the program as clear as possible, I have used "X" as the name of the explanatory variable and "Y" as the name of the response variable. I have used macro variables in a few places in order to clarify the logic of the program. However, to prevent the program from appearing too complicated, I did not fully automate the process of generating the graph.

The following statements define the data and use PROC SQL to create macro variables that contain the minimum and maximum value of the explanatory variable. Then a DATA step creates 101 equally spaced points in the interval [min(x), max(x)], which will be used to score the regression model:

/* data and ideas from */
/* Stopping distance (ft) for a car traveling at certain speeds (mph) */
data MyData(label="cars data from R");
input x y @@;
logY = log(y);
label x = "Speed (mph)"  y = "Distance (feet)" logY = "log(Distance)";
4   2  4 10  7  4  7  22  8 16  9 10 10 18 10 26 10  34 11 17 
11 28 12 14 12 20 12  24 12 28 13 26 13 34 13 34 13  46 14 26 
14 36 14 60 14 80 15  20 15 26 15 54 16 32 16 40 17  32 17 40 
17 50 18 42 18 56 18  76 18 84 19 36 19 46 19 68 20  32 20 48 
20 52 20 56 20 64 22  66 23 54 24 70 24 92 24 93 24 120 25 85 
/* 1. Create scoring data set */
/* 1a. Put min and max into macro variables */
proc sql noprint;
  select min(x), max(x) into :min_x, :max_x 
  from MyData;
/* 1b. Create scoring data set */
data ScoreX;
do x = &min_x to &max_x by (&max_x-&min_x)/100;  /* min(X) to max(X) */

The next statements call PROC GENMOD to fit an OLS model. (I could have used PROC GLM or REG for this model, but in a future blog post I want to explore generalized linear models.) The procedure outputs the parameter estimates, as shown, and use the STORE statement to save the OLS model to a SAS item store. The PLM procedure is used to score the model at the points in the ScoreX data set. (For more about PROC PLM, see Tobias and Cai (2010).) The fitted values are saved to the SAS data set Model:

/* Model 1: linear regression of Y              */
title "Linear Regression of Y";
/* Create regression model. Save model to item store. Score with PROC PLM. */
proc genmod data=MyData;
  model y = x / dist=normal link=identity;
  ods select ParameterEstimates;
  store work.YModel / label='Normal distribution for Y; identity link';
proc plm restore=YModel noprint;
  score data=ScoreX out=Model pred=Fit; /* score model; save fitted values */

The estimate for the intercept is b0 = -17.58 and the estimate for the linear coefficient is b1 = 3.93. The Scale value is an MLE estimate for the standard deviation of the error distribution. These estimates will be used to plot a sequence of normal distributions along a fitted response curve that is overlaid on the scatter plot. If you use PROC GLM to fit the model, the RMSE statistic is a (slightly different) estimate for the scale parameter.

Plotting conditional distributions

The next step is to create data for a sequence of normal probability distributions that are spaced along the X axis and have standard deviation σ=15.07. Charpentier plots the distributions in three dimensions, but many authors simply let each distribution "fall on its side" and map the density scale into the coordinates of the data. This requires scaling the density to transform it into the data coordinate system. In the following DATA step, the maximum height of the density is half the distance between the distributions. The distributions are centered on the regression line, and extend three standard deviations in either direction. The PDF function is used to compute the density distribution, centered at the conditional mean, for the response variable.

The following code supports a link function, such as a log link. For this analysis, which uses the identity link, the ILINK macro variable is set to blank. However, for a generalized linear model with a log link, you can set the inverse link function to EXP.

/* specify parameters for model */
%let std = 15.3796;
%let b0 = -17.5791;
%let b1 =   3.9324;
%let ILINK = ;            /* blank for identity link; for log link use EXP */
/* create values to plot the assumed response distribution */
data NormalResponse(keep = s t distrib);
/* compute scale factor for height of density distributions */
ds = (&max_x - &min_x) / 6;          /* spacing for 5 distributions */
maxHt = ds / 2;                      /* max ht is 1/2 width */
maxPdf = pdf("Normal", 0, 0, &std);  /* max density for normal distrib */
ScaleFactor = maxHt / maxPDF;        /* scale density into data coords */
/* place distribs horizontally along X scale */
do s = (&min_x+ds) to (&max_x-ds) by ds;
   eta = &b0 + &b1*s;                /* linear predictor */
   pred = &ILINK(eta);               /* apply inverse link function */
   /* compute distribution of Y | X. Assume +/-3 std dev is appropriate */
   do t = pred - 3*&std to pred + 3*&std by 6*&std/50; 
      distrib = s + ScaleFactor * pdf("Normal", t, pred, &std); 

The last step is to combine the original data, the regression line, and the probability distributions into a single data set. You can then use PROC SGPLOT to create the graph shown at the top of the article. The data are plotted by using the SCATTER statement, the regression line is overlaid by using the SERIES statement, and the BAND statement is used to plot the sequence of semi-transparent probability distributions along the regression line:

/* merge data, fit, and error distribution information */
data All;  set MyData Model NormalResponse;  run;
title "Conditional Distribution of Response in Regression Model";
title2 "y = &b0 + &b1*x + eps, eps ~ N(0, &std)";
title3 "Normal Distribution, Identity Link";
proc sgplot data=All noautolegend;
  scatter x=x y=y;
  series x=x y=Fit;
  band y=t lower=s upper=distrib / group=s
       transparency=0.4 fillattrs=GraphData1;
  xaxis grid;
  yaxis grid;

The program could be shorter if I use the SAS/IML language, but I wanted to make the program available to as many SAS users as possible, so I used only Base SAS and procedures in SAS/STAT. The graph at the beginning of this article illustrates the basic assumptions of ordinary linear regression. Namely, for a given value of X, the OLS model assumes that each observed Y value is an independent random draw from a normal distribution centered at the predicted mean. The homoscedastic assumption is that the width of those probability distributions is constant.

In a future blog post I will draw a similar picture for a generalized linear model that has a log link function.

Post a Comment