Techniques for scoring a regression model in SAS

My previous post described how to use the "missing response trick" to score a regression model. As I said in that article, there are other ways to score a regression model. This article describes using the SCORE procedure, a SCORE statement, the relatively new PLM procedure, and the CODE statement.

The following DATA step defines a small set of data. The goal of the analysis is to fit various regression models to Y as a function of X, and then evaluate each regression model on a second data set, which contains 200 evenly spaced X values.

/* the original data; fit model to these values */
data A;               
input x y @@;
1  4  2  9  3 20  4 25  5  1  6  5  7 -4  8 12
/* the scoring data; evaluate model on these values */
%let NumPts = 200;
data ScoreX(keep=x);
min=1; max=8;
do i = 0 to &NumPts-1;
   x = min + i*(max-min)/(&NumPts-1);     /* evenly spaced values */
   output;                                /* no Y variable; only X */

The SCORE procedure

Some SAS/STAT procedures can output parameter estimates for a model to a SAS data set. The SCORE procedure can read those parameter estimates and use them to evaluate the model on new values of the explanatory variables. (For a regression model, the SCORE procedure performs matrix multiplication: you supply the scoring data X and the parameter estimates b and the procedure computes the predicted values p = Xb.)

The canonical example is fitting a linear regression by using PROC REG. You can use the OUTEST= option to write the parameter estimates to a data set. That data set, which is named RegOut in this example, becomes one of the two input data sets for PROC SCORE, as follows:

proc reg data=A outest=RegOut noprint;
   YHat: model y = x;    /* name of model is used by PROC SCORE */
proc score data=ScoreX score=RegOut type=parms predict out=Pred;
   var x;

It is worth noting that the label for the MODEL statement in PROC REG is used by PROC SCORE to name the predicted variable. In this example, the YHat variable in the Pred data set contains the predicted values. If you do not specify a label on the MODEL statement, then a default name such as MODEL1 is used. For more information, see the documentation for the SCORE procedure.

The SCORE statement

Nonparametric regression procedures cannot output parameter estimates they are nonparametric! Nonparametric regression procedures support a SCORE statement, which enables you to specify the scoring data set. The following example shows the syntax of the SCORE statement for the TPSPLINE procedure, which fits a thin-plate spline to the data:

proc tpspline data=A;
   model y = (x);
   score data=ScoreX out=Pred;

Other nonparametric procedures that support the SCORE statement include the ADAPTIVEREG procedure (new in SAS/STAT 12.1), the GAM procedure, and the LOESS procedure.

The STORE statement and the PLM procedure

Although the STORE statement and the PLM procedure were introduced in SAS/STAT 9.22 (way back in 2010), some SAS programmers are still not aware of these features. Briefly, the idea is that sometimes a scoring data set is not available when a model is fit, so the STORE statement saves all of the information needed to recreate and evaluate the model. The saved information can be read by the PLM procedure, which includes a SCORE statement, as well as many other capabilities. A good introduction to the PLM procedure is Tobias and Cai (2010), "Introducing PROC PLM and Postfitting Analysis for Very General Linear Models."

For this example, the GLM procedure is used to fit the data. Because of the shape of the previous thin-plate spline curve, a cubic model is fit. The STORE statement is used to save the model information in an item store named WORK.ScoreExample. (I've used the WORK libref, but use a permanent libref if you want the item store to persist across SAS sessions.) Many hours or days later, you can use the PLM procedure to evaluate the model on a new set of data, as shown in the following statements:

proc glm data=A;
   model y = x | x | x;    
   store work.ScoreExample;     /* store the model */
proc plm restore=work.ScoreExample;
   score data=ScoreX out=Pred;  /* evaluate the model on new data */

The STORE statement is supported by many SAS/STAT regression procedures, including the GENMOD, GLIMMIX, GLM, GLMSELECT, LIFEREG, LOGISTIC, MIXED, ORTHOREG, PHREG, PROBIT, SURVEYLOGISTIC, SURVEYPHREG, and SURVEYREG procedures. It also applies to the RELIABILITY procedure in SAS/QC software.

The CODE statement

In SAS/STAT 12.1 the CODE statement was added to several SAS/STAT regression procedures. It is also part of the PLM procedure. The CODE statement offers yet another option for scoring data. The CODE statement writes DATA step statements into a text file. You can then use the %INCLUDE statement to insert those statements into a DATA step. In the following example, DATA step statements are written to the file You can include that file into a DATA step in order to evaluate the model on the ScoreX data:

proc glm data=A noprint;
   model y = x | x | x;  
   code file='';
data Pred;
set ScoreX;
%include '';

For this example, the predicted values are in a variable called P_y in the Pred data set. The CODE statement is supported by many predictive modeling procedures, such as the GENMOD, GLIMMIX, GLM, GLMSELECT, LOGISTIC, MIXED, PLM, and REG procedures in SAS/STAT software. In addition, the CODE statement is supported by the HPLOGISTIC and HPREG procedures in SAS High-Performance Analytics software.

In summary, there are many ways to score SAS regression models. For PROC REG and linear models with an explicit design matrix, use the SCORE procedure. For nonparametric models, use the SCORE statement. For scoring data sets long after a model is fit, use the STORE statement and the PLM procedure. For scoring inside the DATA step, use the CODE statement. For regression procedures that do not support these options (such as PROC TRANSREG) use the missing value trick from my last post.

Did I leave anything out? What is your favorite technique to score a regression model? Leave a comment.

Post a Comment

The missing value trick for scoring a regression model

A fundamental operation in statistical data analysis is to fit a statistical regression model on one set of data and then evaluate the model on another set of data. The act of evaluating the model on the second set of data is called scoring.

One of first "tricks" that I learned when I started working at SAS was how to score regression data in procedures that do not support the SCORE statement. I think almost every SAS statistical programmer learns this trick early in their career, usually from a more experienced SAS programmer. The trick is used in examples in the SAS/STAT User's Guide and on discussion forums, but it is not often explicitly discussed, probably because it is so well known. However,I had to search the internet for a while before I found a SUGI paper that describes the trick. In an effort to assist new SAS programmers, here is an explanation of how to score regression models by using the "missing value trick," which is also called the "missing response trick" or the "missing dependent variable trick."

Suppose that you want to fit a regression model to some data. You also have a second set of explanatory values for which you want to score the regression model. For example, the following DATA step create training data for the regression model. A subsequent DATA step creates evenly spaced values in the X variable. The goal is to evaluate the regression model on the second data set.

data A;       /* the original data; fit model to these values */
input x y @@;
1  4  2  9  3 20  4 25  5  1  6  5  7 -4  8 12
%let NumPts = 200;
data ScoreX(keep=x);  /* the scoring data; evaluate model on these values */
min=1; max=8;
do i = 0 to &NumPts-1;
   x = min + i*(max-min)/(&NumPts-1);     /* evenly spaced values */
   output;                                /* no Y variable; only X */

The trick relies on two features of SAS software:

  • The first data set contains variables X and Y. The second contains only X. If the two data sets are concatenated, a missing value is assigned to Y for each value of X in the second data set.
  • When a regression procedure encounters an observation that has a missing value for the response variable, it does not use that observation to fit the model. However, provided that all of the explanatory variables are nonmissing, the procedure does compute the predicted value for that observation.

Consequently, the missing value trick is to concatenate the original data with the scoring data. If you call a regression procedure on the concatenated data, the original data are used to fit the data but predicted values are generated for the scoring data, as follows:

/* The missing response trick.
   1. Concatenate the original data with the score data */
data B;
set A ScoreX;        /* y=. for all obs in ScoreX */
/* 2. Run a regression. The model is fit to the original data. */ 
proc reg data=B plots=(NONE);
model y = x;
output out=Pred p=P;   /* predicted values for the scoring data */
proc print data=Pred(obs=12);

The table shows that the scoring data, which begins in row 9, contains predicted values as desired.

The advantage of this technique is that it is easy to implement. A disadvantage is this technique makes two extra copies of the scoring data, which might require a lot of disk space if the scoring data set is huge. A second disadvantage is that the trick increases the number of observations that the regression procedure must read. In the example, there are only eight observations in the original data, but the REG procedure has to read and write 208 observations.

There are other ways to evaluate a regression model, including using a SCORE statement, the SCORE procedure, and the relatively new PLM procedure. I will discuss these alternative methods in a future blog post.

I'm interested in hearing about when you first learned the missing response trick? Who did you learn it from? Do you still use it, or do you now use a more modern technique? Leave a comment.

Post a Comment

Fundamental theorems of mathematics and statistics

Although I currently work as a statistician, my original training was in mathematics. In many mathematical fields there is a result that is so profound that it earns the name "The Fundamental Theorem of [Topic Area]." A fundamental theorem is a deep (often surprising) result that connects two or more seemingly unrelated mathematical ideas.

It is interesting that statistical textbooks do not usually highlight a "fundamental theorem of statistics." In this article I briefly and informally discuss some of my favorite fundamental theorems in mathematics and cast my vote for the fundamental theorem of statistics.

The fundamental theorem of arithmetic

The fundamental theorem of arithmetic connects the natural numbers with primes. The theorem states that every integer greater than one can be represented uniquely as a product of primes.

This theorem connects something ordinary and common (the natural numbers) with something rare and unusual (primes). It is trivial to enumerate the natural numbers, but each natural number is "built" from prime numbers, which defy enumeration. The natural numbers are regularly spaced, but the gap between consecutive prime numbers is extremely variable. If p is a prime number, sometimes p+2 is also prime (the so-called twin primes), but sometimes there is a huge gap before the next prime.

The fundamental theorem of algebra

The fundamental theorem of algebra connects polynomials with their roots (or zeros). Along the way it informs us that the real numbers are not sufficient for solving algebraic equation, a fact known to every child who has pondered the solution to the equation x2 = –1. The fundamental theorem of algebra tells us that we need complex numbers to be able to find all roots. The theorem states that every nonconstant polynomial of degree n has exactly n roots in the complex number system. Like the fundamental theorem of arithmetic, this is an "existence" theorem: it tells you the roots are there, but doesn't help you to find them.

The fundamental theorem of calculus

The fundamental theorem of calculus (FTC) connects derivatives and integrals. Derivatives tell us about the rate at which something changes; integrals tell us how to accumulate some quantity. That these should be related is not obvious, but the FTC says that the rate of change for a certain integral is given by the function whose values are being accumulated. Specifically, if f is any continuous function on the interval [a, b], then for every value of x in [a,b] you can compute the following function:

The FTC states that F'(x) = f(x). That is, derivatives and integrals are inverse operations.

Unlike the previous theorems, the fundamental theorem of calculus provides a computational tool. It shows that you can solve integrals by constructing "antiderivatives."

The fundamental theorem of linear algebra

Not everyone knows about the fundamental theorem of linear algebra, but there is an excellent 1993 article by Gil Strang that describes its importance. For an m x n matrix A, the theorem relates the dimensions of the row space of A (R(A)) and the nullspace of A (N(A)). The result is that dim(R(A)) + dim(N(A)) = n.

The theorem also describes four important subspaces and describes the geometry of A and At when thought of as linear transformations. The theorem shows that some subspaces are orthogonal to others. (Strang actually combines four theorems into his statement of the Fundamental Theorem, including a theorem that motivates the statistical practice of ordinary least squares.)

The fundamental theorem of statistics

Although most statistical textbooks do not single out a result as THE fundamental theorem of statistics, I can think of two results that could make a claim to the title. These results are based in probability theory, so perhaps they are more aptly named fundamental theorems of probability.

  • The Law of Large Numbers (LLN) provides the mathematical basis for understanding random events. The LLN says that if you repeat a trial many times, then the average of the observed values tend to be close to the expected value. (In general, the more trials you run, the better the estimates.) For example, you toss a fair die many times and compute the average of the numbers that appear. The average should converge to 3.5, which is the expected value of the roll because (1+2+3+4+5+6)/6 = 3.5. The same theorem ensures that about one-sixth of the faces are 1s, one-sixth are 2s, and so forth.
  • The Central Limit theorem (CLT) states that the mean of a sample of size n is approximately normally distributed when n is large. Perhaps more importantly, the CLT provides the mean and the standard deviation of the sampling distribution in terms of the sample size, the population mean μ, and the population variance σ2. Specifically, the sampling distribution of the mean is approximately normally distributed with mean μ and standard deviation σ/sqrt(n).

Of these, the Central Limit theorem gets my vote for being the Fundamental Theorem of Statistics. The LLN is important, but hardly surprising. It is the basis for frequentist statistics and assures us that large random samples tend to reflect the population. In contrast, the CLT is surprising because the sampling distribution of the mean is approximately normal regardless of the distribution of the original data! As a bonus, the CLT can be used computationally. It forms the basis for many statistical tests by estimating the accuracy of a statistical estimate. Lastly, the CLT connects important concepts in statistics: means, variances, sample size, and accuracy of point estimates.

Do you have a favorite "Fundamental Theorem"? Do you marvel at an applied theorem such as the fundamental theorem of linear programming or chuckle at a pseudo-theorems such as the fundamental theorem of software engineering? Share your thoughts in the comments.

Post a Comment

Define functions with default parameter values in SAS/IML

One of my favorite new features of SAS/IML 12.1 enables you to define functions that contain default values for parameters. This is extremely useful when you want to write a function that has optional arguments.

Example: Centering a data vector

It is simple to specify a SAS/IML module with a default parameter value. Suppose that you want to write a module that centers a vector. By default, the module should center the vector so that new mean of the data is 0. However, the module should also support an arbitrary value to center the vector. With default arguments in SAS/IML 12.1, you can write the following module:

proc iml;
start Center(x, a=0);
   return ( x - mean(x) + a );

The Center module has two arguments, but the equal sign in the second argument indicates that the second argument is optional and that it receives a default value of 0. (SAS programmers who are familiar with the macro language will find this syntax familiar.) Using default values enables you to call the function with fewer parameters in the usual (default) case, but also enables you to call the function with more generality.

In this example, if you call the function by using one argument, the local variable a is set to zero. You can also specify a value for a when you call the function, as follows:

x = {-1,0,1,4};      /* data */
c0 = Center(x);      /* center x at 0 */
c10 = Center(x, 10); /* center x at 10 */
print c0 c10;

The c0 vector is centered at 0, whereas the c10 vector is centered at 10. The Center module has a simpler calling syntax for the usual case of centering a vector at 0.

Example: Testing whether two vectors are equal to within a certain precision

I have previously blogged about the dangers of testing floating-point values for equality. In a numerical program, you should avoid testing whether a vector x is (exactly) equal to another vector y. Even though two vectors should be equal (if the computations were performed in exact arithmetic), finite-precision computations can lead to small differences, which some people call rounding errors. For example, in exact arithmetic the square of the square root of x is exactly equal to x for any nonnegative value. However, the following statements show that this equality might not hold in finite precision arithmetic:

x = {4 5 2000 12345 654321};  /* some numbers */
y = sqrt(x)#sqrt(x);          /* in exact arithmetic, y=x */
diff = x-y;                   /* the diff vector is not zero */
print diff;

As you can see, some elements of x and y are not equal, which leads to a nonzero difference vector. Consequently, it is a bad idea to use the equal operator (=) to test whether the vectors are equal:

eq = all(x=y);    /* Bad idea: Don't test for exact comparison */
print eq;

To understand more about floating-point arithmetic, see the paper "What Every Computer Scientist Should Know About Floating-Point Arithmetic." A better way to compare numerical values in SAS/IML is to test whether two numbers are within some specified tolerance of each other, like you can do with PROC COMPARE. The following module computes whether the absolute value of the difference between corresponding elements is less than some criterion. By default, the criterion is set to 10–6:

/* return 1 if x[i] is within eps of y[i] for all i */
start IsEqual(x, y, eps=1e-6);
   return( all( abs(x-y)<=eps) );  
eq6  = IsEqual(x, y);        /* Default: compare within 1e-6 */
eq12 = IsEqual(x, y, 1e-12); /* compare within 1e-12 */
print eq6 eq12;

The vectors are judged to be equal when the default criterion is used. If you tighten the criterion, the vectors are judged to be unequal.

Being able to specify default parameter values is very useful to a programmer. Do you have an example for which this new feature of SAS/IML 12.1 will be useful? Describe your application in the comments.

Post a Comment

A simple way to find the root of a function of one variable

Finding the root (or zero) of a function is an important computational task because it enables you to solve nonlinear equations. I have previously blogged about using Newton's method to find a root for a function of several variables. I have also blogged about how to use the bisection method to find the zeros of a univariate function.

As of SAS/IML 12.1, there is an easy way to find the roots of function of one variable. The FROOT function solves for a root on a finite interval by using Brent’s method, which uses a combination of bisection, linear interpolation, and quadratic interpolation to converge to a root. Unlike Newton's method, Brent's method does not require that you specify a derivative for the function. All you need to provide is a module that evaluates the function, f, and an interval [a, b] such that f(a) and f(b) have different signs.

If there is a root in the interval, Brent's method is guaranteed to find it.

As an example, the following SAS/IML module defines a function that I investigated in a previous blog post:

proc iml;
start Func(x);
   return( exp(-x##2) - x##3 + 5#x +1 );

The image at the beginning of this article shows the graph of the function on the interval [–5, 5]. The function has three roots. You can use the graph to estimate intervals on which the function contains a root. The FROOT function enables you to find multiple zeros with a single call, as follows:

intervals = {-4   -1.5,         /* 1st interval [-4, -1.5] */
             -1.5  1  ,         /* 2nd interval [-1.5 1]   */
              1    4  };        /* 3rd interval [1, 4]     */
z = froot("Func", intervals);
print z;

The vector z contains three elements. The first element is the root in the first specified interval, the second element is the root in the second interval, and so forth. If you specify an interval on which the function does not have a root, then the FROOT function returns a missing value.

That's all there is to it. So next time you need to solve a nonlinear equation of one variable, remember that the FROOT function in the SAS/IML language makes the task simple.

Post a Comment

Sample without replacement in SAS

Last week I showed three ways to sample with replacement in SAS. You can use the SAMPLE function in SAS/IML 12.1 to sample from a finite set or you can use the DATA step or PROC SURVEYSELECT to extract a random sample from a SAS data set. Sampling without replacement is similar. This article describes how to use the SAS/IML SAMPLE function or the SURVEYSELECT procedure.

Simulate dealing cards

When I wrote about how to generate permutations in SAS, I used the example of dealing cards from a standard 52-card deck. The following SAS/IML statements create a 52-element vector with values 2H, 3H, ..., KS, AS, where 'H' indicates the heart suit, 'D' indicates diamonds, 'C' indicates clubs, and 'S' indicates spades:

proc iml;
/* create a deck of 52 playing cards */
suits52 = rowvec( repeat({H, D, C, S},1,13) );
vals = char(2:10,2) || {J Q K A};
vals52 = repeat( right(vals), 1, 4 );
Cards = vals52 + suits52;
/* choose 20 cards without replacement from deck */
call randseed(293053001);
deal = sample(Cards, 20, "WOR");  /* sample 20 cards without replacement */

The third argument to the SAMPLE function is the value "WOR", which stands for "without replacement." With this option, the SAMPLE function returns 20 cards from the deck such that no card appears more than once in the sample. If there are four card players who are playing poker and each gets five cards, you can use the SHAPE function to reshape the 20 cards into a matrix such that each column indicates a player's poker hand:

PokerHand = shape(deal, 0, 4);   /* reshape vector into 4 column matrix */
print PokerHand[c=("Player1":"Player4")];

Let's see what poker hands these players were dealt. The first player has a pair of 2s. The second player has a pair of queens. The third player has three kings, and the fourth player has a flush! That was a heck of a deal! (I'll leave it to the sharp-eyed reader to figure out how I "cheated" in order to simulate such an improbable "random" sample. Extra credit if you link to a blog post of mine in which I explain the subterfuge.)

The SAMPLE function provides a second way to sample without replacement. If the third argument is "NoReplace", then a faster algorithm is used to extract a sample. However, the sample is in the same order as the original elements, which might not be acceptable. For the poker example, the "WOR" option enables you to simulate a deal. If you use the "NoReplace" option, then you should first use the RANPERM function to shuffle the deck. Of course, if you only care about the sample as a set rather than as a sequence, then using the faster algorithm makes sense.

One more awesome feature of the SAMPLE function: it enables you to sample with unequal probabilities by adding a fourth argument to the function call.

Sampling without replacement by using the SURVEYSELECT procedure

As mentioned above, some algorithms generate a sample whose elements are in the same order as the original data. This is the case with the SURVEYSELECT procedure when you use the METHOD=SRS option. Suppose that you write the 52 cards to a SAS data set. You can use the SURVEYSELECT procedure to extract 20 cards without replacement, as follows:

create Deck var {"Cards"}; append; close Deck; /* create data set */
proc surveyselect data=Deck out=Poker seed=1                   
     method=srs       /* sample w/o replacement */
     sampsize=20;     /* number of observations in sample */
proc print data=Poker(obs=8); run;

The output shows the first eight observations in the sample. You can see that the hearts appear first, followed by the diamonds, followed by the clubs, and that within each suit the values of the cards are in their original order. If you want the data in a random order, imitate the DATA step code in the SAS Knowledge Base article "Simple Random Sample without Replacement."

Post a Comment

Sample with replacement in SAS

Randomly choosing a subset of elements is a fundamental operation in statistics and probability. Simple random sampling with replacement is used in bootstrap methods (where the technique is called resampling), permutation tests and simulation.

Last week I showed how to use the SAMPLE function in SAS/IML software to sample with replacement from a finite set of data. Because not everyone is a SAS/IML programmer, I want to point out two other ways to sample (with replacement) observations in a SAS data set.

  • The DATA Step: You can use the POINT= option in the SET statement to randomly select observations from a SAS data set. For many data sets you can use the SASFILE statement to read the entire sample data into memory, which improves the performance of random access.
  • The SURVEYSELECT Procedure: You can use the SURVEYSELECT procedure to randomly select observations according to several sampling schemes. Again, you can use the SASFILE statement to improve performance.

The material in this article is taken from Chapter 15, "Resampling and Bootstrap Methods," of Simulating Data with SAS.

The DATA step

The following DATA step randomly selects five observations from the Sashelp.Cars data set:

sasfile Sashelp.Cars load;              /* 1. Load data set into memory */
data Sample(drop=i);
call streaminit(1);
do i = 1 to 5;         
   p = ceil(NObs * rand("Uniform"));    /* random integer 1-NObs */
   set Sashelp.Cars nobs=NObs point=p;  /* 2. POINT= option; random access */
STOP;                                   /* 3. Use the STOP stmt */
sasfile Sashelp.Cars close;

A few statements in this DATA step require additional explanation. They are indicated by numbers inside of comments:

  1. Provided that the data set is not too large, use the SASFILE statement to load the data into memory, which speeds up random access.
  2. The NOBS= option stores the number of observations in the Sashelp.Cars data into the NObs variable before the DATA step runs. Consequently, the value of NObs is available throughout the DATA step, even on statements that execute prior to the SET statement.
  3. The POINT= option is used to read the (randomly chosen) observation.
  4. The STOP statement must be used to end the DATA step processing when you use the POINT= option, because SAS never encounteres the end-of-file indicator during random access.

For the example, the selected observation numbers are 379, 417, 218, 380, and 296. You can print the random sample to see which observations were selected:

proc print data=Sample noobs;
   var Make Model MPG_City Length Weight;

The SURVEYSELECT procedure

The main SAS procedure for (re)sampling is called the SURVEYSELECT procedure. The name is a bit unfortunate because statisticians and programmers who are new to SAS might browse the documentation and completely miss the relevance of this procedure. (To me, it is "PROC RESAMPLE.") The SURVEYSELECT procedure has many methods for sampling, but the method for sampling without replacement is known as unrestricted random sampling (URS). The following call creates an output data set that contains five observations that are sampled (with replacement) from the Sashelp.Cars data:

proc surveyselect data=Sashelp.Cars out=Sample2 NOPRINT 
     seed=1                                         /* 1 */
     method=urs sampsize=5                          /* 2 */
     outhits;                                       /* 3 */

The call to PROC SURVEYSELECT has several options, which are indicated by numbers in the comments:

  1. The SEED= option specifies the seed value for random number generation. If you specify a zero seed, then omit the NOPRINT option so that the value of the chosen seed appears in the procedure output.
  2. The METHOD=URS option specifies unrestricted random sampling, which means sampling with replacement and with equal probability. The SAMPSIZE= option specifies the number of observations to select, or you can use the SAMPRATE= option to specify a proportion of observations.
  3. The OUTHITS options specifies that the output data set contains five observations, even if a record is selected multiple times. If you omit the OUTHITS option, then the output data set might have fewer observations, and the NumberHits variable contains the number of times that each record was selected.

If you are using SAS/STAT 12.1 or later, the output data set contains exactly the same observations as for the DATA step example, because PROC SURVEYSELECT uses the same random number generator (RNG) as the RAND function. Prior to SAS/STAT 12.1, PROC SURVEYSELECT used the older RNG that is used by the RANUNI function.

In summary, if you need to sample observations from a SAS data set, you can implement a simple sampling scheme in the DATA step or you can use PROC SURVEYSELECT. I recommend PROC SURVEYSELECT because the procedure makes it clearer what sampling method is being used and because the procedure supports other, more complex, sampling schemes that are also useful.

Post a Comment

Ulam spirals: Visualizing properties of prime numbers with SAS

Prime numbers are strange beasts. They exhibit properties of both randomness and regularity. Recently I watched an excellent nine-minute video on the Numberphile video blog that shows that if you write the natural numbers in a spiral pattern (called the Ulam spiral), then there are certain lines in the pattern that are very likely to contain prime numbers.

The image to the left shows the first 22,500 natural numbers arranged in the Ulam spiral pattern within a 150 x 150 grid. Cells that contain prime numbers are colored black. Cells that contain composite numbers are colored white. You can see certain diagonal lines with slope ±1 that contain a large number of prime numbers. There are conjectures in number theory that explain why certain lines along the Ulam spiral have a greater density of prime numbers than other lines. (The diagonal lines in the spiral correspond to quadratic equations.)

I don't know enough about prime numbers to blog about the mathematical properties of the Ulam spiral, but as soon as I saw the Ulam spiral explained, I knew that I wanted to generate it in SAS. The Ulam spiral packs the natural numbers into a square matrix in a certain order. This post describes how to construct the Ulam spiral in the SAS/IML matrix language.

The Ulam spiral construction

The Ulam spiral is simple to describe. On a gridded piece of paper, write down the number 1. Then write successive integers as you spiral out in a counter-clockwise fashion, as shown in the accompanying figure.

Although you can stop writing numbers at any time, for the purpose of this post let's assume you want to fill an N x N array with numbers. Notice that when N is even, the last number in the array (N2=4, 16, 36,...), appears in the upper left corner of the array, whereas for odd N, the last number (9, 25, 49,...) appears in the lower right corner. I found this asymmetry hard to deal with, so I created an algorithm that fills the N x N matrix so that the N2 term is always in the upper left. For odd values of N, the algorithm rotates the matrix after inserting the elements. I've previously shown that it is easy to write SAS/IML statements that rotate elements in a matrix.

My algorithm constructs the spiral iteratively. Notice that if you have constructed the spiral correctly for an (N–2) x (N–2) array, you can construct the N x N array by adding two vertical columns (first and last) and two horizontal rows (first and last). I call the outer rows and columns a "frame" because they remind me of a picture frame. Given a value for N, you can figure out the starting and ending values of each row and column of the frame. You can use the SAS/IML index creation operator to create each row and column, as shown in the following program:

proc iml;
start SpiralFrame(n);
   if n=1 then return(1);
   if n=2 then return({4 3, 1 2});
   if n=3 then return({9 8 7, 2 1 6, 3 4 5});
   X = j(n,n,.);
   /* top of frame. 's' means 'start'. 'e' means 'end' */
   s = n##2; e = s - n + 2;  X[1,1:n-1] = s:e;
   /* right side of frame */
   s = e - 1; e = s - n + 2; X[1:n-1,n] = (s:e)`;
   /* bottom of frame */
   s = e - 1; e = s - n + 2; X[n,n:2] = s:e;
   /* left side of frame */
   s = e - 1; e = s - n + 2; X[n:2,1] = (s:e)`;
   return( X );
/* test the frame construction */
M2 = SpiralFrame(2);
M4 = SpiralFrame(4);
M6 = SpiralFrame(6);
print M2, M4, M6;

You can see that the M4 matrix fits inside the M6 frame. Similarly, the M2 matrix fits inside the M4 frame.

After writing the code that generates the frames, the rest of the construction is easy. Start by creating the frame of size N. Decrease N by 2 and iteratively create a frame of size N, taking care to insert the (N–2) x (N–2) array into the interior of the existing N x N array. After the array is filled, rotate the array by 180 degrees if N is odd. The following SAS/IML statements implement this algorithm:

start Rot180(m);
   return( m[nrow(m):1,ncol(m):1] );
/* Create N x N integer matrix with elements arranged in an Ulam spiral */
start UlamSpiral(n);
   X = SpiralFrame(n);             /* create outermost frame */
   k = 2;
   do i = n-2 to 2 by -2;          /* iteratively create smaller frames */
      r = k:n-k+1;                 /* rows (and cols) to insert smaller frame */
      X[r, r] = SpiralFrame(i);    /* insert smaller frame */
      k = k+1;
   if mod(n,2)=1 then return(Rot180(X));
U10 = UlamSpiral(10);              /* test program by creating 10 x 10 matrix */
print U10[f=3.];

You might not have much need to generate Ulam spirals in your work, but this exercise demonstrates several important principles of SAS/IML programming:

  • It is often convenient to calculate the first and last element of an array and to use the color index operator (:) to generate the vector.
  • Notice that I generated the largest frame first and inserted the smaller frames inside of it. That is more efficient than generating the smaller frames and then using concatenation to grow the matrix.
  • In the same way, sometimes an algorithm is simpler or more efficient to implement if your DO loops run in reverse order.

The heat map at the beginning of this article is created by using the new HEATMAPDISC subroutine in SAS/IML 13.1. I will blog about heat maps in a future post. If you have SAS/IML 13.1 and want to play with Ulam spirals yourself, you can download the SAS code used in this blog post.

Post a Comment

Sampling with replacement: Now easier than ever in the SAS/IML language

With each release of SAS/IML software, the language provides simple ways to carry out tasks that previously required more effort. In 2010 I blogged about a SAS/IML module that appeared in my book Statistical Programming with SAS/IML Software, which was written by using the SAS/IML 9.2. The blog post showed how to sample with replacement (with equal probability) from a finite set of possibilities.

As of SAS/IML 12.1, there is a built-in function that returns random samples from a finite set. The SAMPLE function makes it easy to do the following:

  • Sample with replacement with equal or varying probabilities
  • Sample without replacement
  • Generate multiple samples with a single call

Sampling with replacement is a common task for bootstrap (resampling) methods, so let's start by discussing sampling with replacement.

Sample with replacement with equal probability

You can use the SAMPLE function in the SAS/IML language to sample with replacement from a finite set. In my 2010 article, I used the example of choosing five elements at random from the set {1, 2, ..., 8}. The following call shows how to use the built-in SAMPLE function for this task:

proc iml;
call randseed(1234);
s = sample(1:8, 5);  /* randomly choose 5 elements from the set 1:8 */
print s;

The default sampling scheme is to sample with replacement, which is why the element 3 appears twice in the random sample. Notice that the random number seed for the SAMPLE function is set by using the RANDSEED subroutine.

If you want to generate multiple samples, each of size five, you can specify a two-element vector for the second argument. The first element specifies the sample size. The second element specifies the number of samples, which is the number of rows in the output matrix. For example, the following statement generates six random samples. Each row is one random sample and contains five elements.

s6 = sample(1:8, {5 6});  /* sample size=5; number of samples=6 */
print s6;

Sample with replacement with varying probabilities

The SAMPLE function also supports sampling with unequal probabilities. Since SAS is known for having free M&Ms® in the breakrooms, here's an M&M-inspired example. There is a large jar of plain M&Ms in my breakroom. The M&Ms are different colors: 30% are brown, 20% are yellow, 20% are red, 10% are green, 10% are orange, and 10% are blue. I'll use the SAMPLE function to simulate drawing 20 M&Ms from the jar. Although in real life I would never select an M&M and then replace it back into the jar (Yuck! Unsanitary!), the jar is so large that the probabilities are approximately constant during the sampling, so I can use the sampling with replacement method.

colors = {"Brown" "Yellow" "Red" "Green" "Orange" "Blue"};
prob =   {0.3 0.2 0.2 0.1 0.1 0.1};
snack = sample(colors, 20, "Replace", prob);  /* a 1x20 vector of colors */
call tabulate(category, freq, snack);         /* count how many of each color */
print freq[colname=category];

For this sample of 20 candies, more than half of the sample is brown; I did not draw any greens or oranges. The output of the SAMPLE function is a 1 x 20 vector of colors. If I only want the total number of each color—and not the sample itself—I could use the RANDMULTINOMIAL function to simulate the counts directly, rather than use the SAMPLE function and the TABULATE subroutine.

Sampling observations in a data matrix

If you have data in a SAS/IML matrix, you can sample the observations by sampling from the integers 1, 2, ..., N, where N is the number of rows of the matrix. For example, the following statements read in 428 observations from the Sashelp.Cars data set. The SAMPLE function is used to draw a random sample that contains five observations. Each observation contains information about a random vehicle in the data set.

proc iml;
call randseed(1);
use Sashelp.Cars;
varNames = {"MPG_City" "Length" "Weight"};
read all var varNames into x[rowname=Model];
close Sashelp.Cars;
obsIdx = sample(1:nrow(x), 5);   /* sample size=5, rows chosen from 1:NumRows */
s5 = x[obsidx, ];                /* extract subset of rows */
print s5[rowname=(Model[obsIdx]) colname=varNames];

In summary, this article has shown how to use the SAMPLE function in SAS/IML 12.1 to sample with replacement from a finite set. In future posts I will show how to use other SAS tools to resample from a data set and how to sample without replacement.

Post a Comment

The best articles of 2013: Twelve posts from The DO Loop that merit a second look

I began 2014 by compiling a list of 13 popular articles from my blog in 2013. Although this "People's Choice" list contains many articles that I am proud of, it did not include all of my favorites, so I decided to compile an "Editor's Choice" list. The blog posts on today's list were not widely popular, but they deserve a second look because they describe SAS computations that are elegant, surprising, or just plain useful.

I often write about four broad topics: the SAS/IML language, statistical programming, simulating data, and data analysis and visualization. My previous article included five articles on statistical graphics and data analysis, so here are a few of my favorite articles from the other categories.

The SAS/IML language and matrix programming

Here are a few articles that can help you get more out of the SAS/IML product:

Serious SAS/IML programmers should read my book Statistical Programming with SAS/IML Software.

Simulating data with SAS

Here are a few of my favorite articles from 2013 about efficient simulation of data:

Readers who want to learn more about simulating data might enjoy my book Simulating Data with SAS, which contains hundreds of examples and exercises.

Statistical programming and computing

Much of my formal training is in numerical analysis and matrix computations. Here are a few interesting computational articles that I wrote. Be warned: some of these articles have a high "geek factor"!

There you have it, my choice of 12 articles that I think are worth a second look. What is your favorite post from The DO Loop in 2013? Leave a comment.

Post a Comment