How to automatically select a smooth curve for a scatter plot in SAS

My last blog post described three ways to add a smoothing spline to a scatter plot in SAS. I ended the post with a cautionary note:

From a statistical point of view, the smoothing spline is less than ideal because the smoothing parameter must be chosen manually by the user.

This means that you can specify a smoothing spline parameter that appears to fit the data, but the selection is not based on an objective criterion. In contrast, SAS software supports several other methods for which the software can automatically choose a smoothing parameter that maximizes the goodness of fit while avoiding overfitting. This article describes penalized B-splines loess curves, and thin-plate splines.

Penalized B-splines

You can fit penalized B-splines to data by using the TRANSREG procedure or by using the PBSPLINE statement in the SGPLOT procedure. The TRANSREG procedure enables you to specify many options that control the fit, including the criterion used to select the optimal smoothing parameter. The TRANSREG procedure supports the corrected Akaike criterion (AICC), the generalized cross-validation criterion (GCV), and Schwarz’s Bayesian criterion (SBC), among others. These criteria try to produce a curve that fits the data well, but does not interpolate or overfit the data. The default criterion is the AICC.

To illustrate fitting a smooth curve to a scatter plot, I'll use the SasHelp.enso data set, which contains data on the "southern oscillation," a cyclical phenomenon in atmospheric pressure that is linked to the El Niño and La Niña temperature oscillations in the Pacific Ocean. (ENSO is an acronym for El Niño-Southern Oscillation.) The ENSO data is analyzed in an example in the PROC TRANSREG documentation.

The following statements fit a penalized B-spline to the ENSO data:

proc transreg data=sashelp.enso;
   model identity(pressure) = pbspline(year); /* AICC criterion */
run;

The data shows an oscillation of pressure in a yearly cycle. There are 14 peaks and valleys in this 14-year time series, which correspond to 14 winters and 14 summers.

Loess curves

A loess curve is not a spline. A loess model at x uses a local neighborhood of x to compute a weighted least squares estimate. The smoothing parameter is the proportion of data in the local neighborhood: a value near 0 results in a curve that nearly interpolates the data whereas a value near 1 is nearly a straight line. You can fit loess curves to data by using the LOESS procedure or by using the LOESS statement in the SGPLOT procedure. The LOESS procedure enables you to specify many options that control the fit, including the criterion used to select the optimal smoothing parameter. The LOESS procedure supports the AICC and GCV criteria, among others. The default criterion is the AICC.

The default loess fit to the ENSO data is computed as follows:

ods select FitSummary CriterionPlot FitPlot;
proc loess data=sashelp.enso;
   model pressure = year;
run;

The loess curve looks different than the penalized B-spline curve. The loess curve seems to contain irregularly spaced peaks that are 4–7 years apart. This is the El Niño oscillation cycle, which is an irregular cycle. The selected smoothing parameter is about 0.2, which means that about one-fifth of the observations are used for each local neighborhood during the estimation.

Earlier I said that the LOESS and TRANSREG procedures automatically choose parameter values that optimize some criterion. So why the difference in these plots? The answer is that the criterion curves sometimes have local minima, and for the ENSO data the LOESS procedure found a local minimum that corresponds to the El Niño cycle. The curve that shows the AICC criterion as a function of the smoothing parameter is shown below:

The LOESS procedure supports several options that enable you to explore different ranges for the smoothing parameter or to search for a global minimum. I recommend using the PRESEARCH option in order to increase the likelihood of finding a global minimum. For example, for the ENSO data, the following call to PROC LOESS finds the global minimum at 0.057 and produces a scatter plot smoother that looks very similar to the penalized B-spline smoother.

proc loess data=sashelp.enso;
   model pressure = year / select=AICC(presearch);
run;

Smoothers with the SGPLOT procedure

Because the loess and penalized B-spline smoothers work well for most data, the SGPLOT procedure supports the PBSPLINE and LOESS statements for adding smoothers to a scatter plot. By default the statements also add the scatter plot markers, so use the NOMARKERS options if you are overlaying a smoother on an existing scatter plot:

proc sgplot data=sashelp.enso;
pbspline x=year y=pressure;           /* also draws scatter plot */
loess x=year y=pressure / nomarkers;  /* uses PRESEARCH option */
run;

As you can see, the two smoothers produce very similar results on the ENSO data. That is often the case, but not always.

Thin-plate splines

I'm not going to shows examples of every nonparametric regression procedure in SAS, but I do want to mention one more. A thin-plate spline yet another spline that can be used to smooth data. The TPSPLINE procedure supports the GCV criterion for the automatic selection of the smoothing parameter. For the ENSO data, the TPSPLINE procedure produces a smoother the reveals the annual pressure cycle and is very similar to the previous curves:

ods select FitStatistics CriterionPlot FitPlot;
proc tpspline data=sashelp.enso;
   model pressure = (year);
run;

In conclusion, SAS software provides several ways to fit a smoother to data. The procedures mentioned in this article can automatically select a smoothing parameter that satisfies some objective criterion which balances goodness of fit with model parsimony. Do you have a favorite smoother? Let me know in the comments.

Post a Comment

Three ways to add a smoothing spline to a scatter plot in SAS

Like many SAS programmers, I use the Statistical Graphics (SG) procedures to graph my data in SAS. To me, the SGPLOT and SGRENDER procedures are powerful, easy to use, and produce fabulous ODS graphics. I was therefore surprised when a SAS customer told me that he continues to use the old GPLOT procedure because "the procedure makes it easy to add a smoothing spline curve to a scatter plot." The customer likes using smoothing splines, which are produced by using the INTERPOL=SMnn option on the SYMBOL statement. He added that smoothing splines are not available in PROC SGPLOT.

My response: If you like the GPLOT procedure, you are certainly welcome to continue using it. However, if the only reason you use it is to add a spline curve to a scatter plot, you should know that there are alternatives. This article shows how to fit a spline curve by using by using PROC TRANSREG or PROC IML.

Splines, splines, everywhere

A spline is one way to fit a smooth curve to two-dimensional data. There are actually many kinds of splines. The "smoothing spline" that the customer likes is a cubic spline, but SAS supports thin-plate splines and penalized B-splines, just to name two others. Let's see what a spline curve looks like as created by PROC GPLOT and the SYMBOL INTERPOL=SMnn option:

/* data set A must be sorted by x */
data A;
input x y @@;
datalines;
1  4  2  9  3 20  4 25  5  1  6  5  7 -4  8 12
;
 
title "sm45 spline smoother";
proc gplot data=A;
   plot y*x;
   symbol1 interpol=sm45 value=circle height=2; /* note that x is sorted */
run;

The graph displays a scatter plot overlaid with a spline smoother. In the SYMBOL statement, the last two digits of the INTERPOL=SM option specify the amount of smoothness in the spline curve. A value close to zero results in a curve with many wiggles that nearly passes through every data point. A value close to 100 results in a flat curve that is nearly the least-squares regression line for the data. Many examples that I have seen use smoothing values in the range 40–60.

Computing a spline curve by using the TRANSREG procedure

The easiest way to generate an ODS graph that shows the same curve is to use the TRANSREG procedure in SAS/STAT software. The TRANSREG procedure supports a SMOOTH transformation of an explanatory variable. The SM= option enables you to specify a smoothing value in the range [0, 100], which is the same range as for PROC GPLOT.

Because there are only eight observations, a plot of the predicted values versus the explanatory variable will look jagged. To see a smooth curve of predicted values, you can evaluate the spline curve at a series of evenly spaced points within the range of X. The TRANSREG function does not support a SCORE or STORE statement, but you can use the missing response trick to visualize the spline curve. The following DATA step creates 200 evenly spaced points in the range [1, 8]. A second DATA step concatenates the original data and the evenly spaced points:

/* the scoring data; evaluate model on these values */
%let NumPts = 200;
data ScoreX(keep=x);  
min=1; max=8;
do i = 0 to &NumPts-1;
   x = min + i*(max-min)/(&NumPts-1);     /* evenly spaced values */
   output;                                /* no Y variable; only X */
end;
run;
 
/* concatenate the original data and the scoring data */
data B;  set A ScoreX;  run;
 
/* use the missing response trick to evaluate the spline on the scoring data */
proc transreg data=B plots(interpolate);
model identity(y) = smooth(x / sm=45);
output out=Pred p;       /* optional: write spline curve to a data set */
run;

The spline curve produced by PROC TRANSREG is identical to the smoothing spline that is produced by PROC GPLOT. The OUTPUT statement in the TRANSREG procedure enables you to write the values from the spline curve to a SAS data set, which means that you can also use PROC SGPLOT to create this graph.

Computing a spline curve in SAS/IML

SAS/IML software computes smoothing splines by using the SPLINE subroutine. The routine enables you to compute interpolating splines, smoothing splines, and parametric splines, and to control the behavior of the curves at the endpoints of the curves. However, the SPLINE routine uses a different parameterization than the GPLOT and TRANSREG procedures.

In the SPLINE subroutine, the parameter that controls the smoothing can be any nonnegative value. The parameter is a geometric transformation of the SM= value in PROC TRANSREG, which means that you can specify the smoothing parameter from GPLOT and convert it into the corresponding parameter for the SPLINE routine. Specifically, the following SAS/IML statements compute a smoothing spline for the data:

proc iml;
  use A;  read all var {x y};  close A;   /* read data */
 
  /* supply the smoothing parameter value from the SYMBOL statement */
  sm = 45;                          /* value in the INTERPOL=SM option         */
  ssd = ssq(x-mean(x));             /* sum of squared deviations from the mean */
  t = ssd##1.5 / (10##(9-(sm*.1))); /* SPLINE parameter as a function of SM    */
 
  /* fit the spline and get interpolated points */
  call spline(fitted, x||y) smooth=t;
  create SplineFit from fitted[c={x p}];
  append from fitted;
quit;
 
/* concatenate with original data */
data C;
  set A SplineFit(rename=(x=copyX));
run;
 
proc sgplot data=C;
   title "Spline from SPLINEC routine in SAS/IML";
   scatter x=x y=y;
   series x=copyX y=p;
run;

The values produced by PROC IML are identical to the spline values that were computed by PROC TRANSREG. PROC SGPLOT produces a graph (not shown) that is nearly identical to the graph that was produced by PROC TRANSREG. In conclusion, there are two easy ways to compute smoothing splines outside of PROC GPLOT.

Are there other features of PROC GPLOT that are unsupported by PROC SGPLOT and that you can't live without? In the comments, tell me your favorite feature of PROC GPLOT that prevents you from switching to PROC SGPLOT.

Other ways to smooth a scatter plot

One last comment on splines. I've shown that you can create a smoothing spline with PROC GPLOT, PROC TRANSREG, or the SPLINE subroutine in PROC IML. However, from a statistical point of view, the smoothing spline is less than ideal because the smoothing parameter must be chosen manually by the user. My next blog post describes how to add smoothing curves whose parameters are chosen automatically to optimize a statistical criterion.

Post a Comment

Techniques for scoring a regression model in SAS

My previous post described how to use the "missing response trick" to score a regression model. As I said in that article, there are other ways to score a regression model. This article describes using the SCORE procedure, a SCORE statement, the relatively new PLM procedure, and the CODE statement.

The following DATA step defines a small set of data. The goal of the analysis is to fit various regression models to Y as a function of X, and then evaluate each regression model on a second data set, which contains 200 evenly spaced X values.

/* the original data; fit model to these values */
data A;               
input x y @@;
datalines;
1  4  2  9  3 20  4 25  5  1  6  5  7 -4  8 12
;
 
/* the scoring data; evaluate model on these values */
%let NumPts = 200;
data ScoreX(keep=x);
min=1; max=8;
do i = 0 to &NumPts-1;
   x = min + i*(max-min)/(&NumPts-1);     /* evenly spaced values */
   output;                                /* no Y variable; only X */
end;
run;

The SCORE procedure

Some SAS/STAT procedures can output parameter estimates for a model to a SAS data set. The SCORE procedure can read those parameter estimates and use them to evaluate the model on new values of the explanatory variables. (For a regression model, the SCORE procedure performs matrix multiplication: you supply the scoring data X and the parameter estimates b and the procedure computes the predicted values p = Xb.)

The canonical example is fitting a linear regression by using PROC REG. You can use the OUTEST= option to write the parameter estimates to a data set. That data set, which is named RegOut in this example, becomes one of the two input data sets for PROC SCORE, as follows:

proc reg data=A outest=RegOut noprint;
   YHat: model y = x;    /* name of model is used by PROC SCORE */
quit;
 
proc score data=ScoreX score=RegOut type=parms predict out=Pred;
   var x;
run;

It is worth noting that the label for the MODEL statement in PROC REG is used by PROC SCORE to name the predicted variable. In this example, the YHat variable in the Pred data set contains the predicted values. If you do not specify a label on the MODEL statement, then a default name such as MODEL1 is used. For more information, see the documentation for the SCORE procedure.

The SCORE statement

Nonparametric regression procedures cannot output parameter estimates because...um...because they are nonparametric! Nonparametric regression procedures support a SCORE statement, which enables you to specify the scoring data set. The following example shows the syntax of the SCORE statement for the TPSPLINE procedure, which fits a thin-plate spline to the data:

proc tpspline data=A;
   model y = (x);
   score data=ScoreX out=Pred;
run;

Other nonparametric procedures that support the SCORE statement include the ADAPTIVEREG procedure (new in SAS/STAT 12.1), the GAM procedure, and the LOESS procedure.

The STORE statement and the PLM procedure

Although the STORE statement and the PLM procedure were introduced in SAS/STAT 9.22 (way back in 2010), some SAS programmers are still not aware of these features. Briefly, the idea is that sometimes a scoring data set is not available when a model is fit, so the STORE statement saves all of the information needed to recreate and evaluate the model. The saved information can be read by the PLM procedure, which includes a SCORE statement, as well as many other capabilities. A good introduction to the PLM procedure is Tobias and Cai (2010), "Introducing PROC PLM and Postfitting Analysis for Very General Linear Models."

For this example, the GLM procedure is used to fit the data. Because of the shape of the previous thin-plate spline curve, a cubic model is fit. The STORE statement is used to save the model information in an item store named WORK.ScoreExample. (I've used the WORK libref, but use a permanent libref if you want the item store to persist across SAS sessions.) Many hours or days later, you can use the PLM procedure to evaluate the model on a new set of data, as shown in the following statements:

proc glm data=A;
   model y = x | x | x;    
   store work.ScoreExample;     /* store the model */
quit;
 
proc plm restore=work.ScoreExample;
   score data=ScoreX out=Pred;  /* evaluate the model on new data */
run;

The STORE statement is supported by many SAS/STAT regression procedures, including the GENMOD, GLIMMIX, GLM, GLMSELECT, LIFEREG, LOGISTIC, MIXED, ORTHOREG, PHREG, PROBIT, SURVEYLOGISTIC, SURVEYPHREG, and SURVEYREG procedures. It also applies to the RELIABILITY procedure in SAS/QC software.

The CODE statement

In SAS/STAT 12.1 the CODE statement was added to several SAS/STAT regression procedures. It is also part of the PLM procedure. The CODE statement offers yet another option for scoring data. The CODE statement writes DATA step statements into a text file. You can then use the %INCLUDE statement to insert those statements into a DATA step. In the following example, DATA step statements are written to the file glmScore.sas. You can include that file into a DATA step in order to evaluate the model on the ScoreX data:

proc glm data=A noprint;
   model y = x | x | x;  
   code file='glmScore.sas';
quit;
 
data Pred;
set ScoreX;
%include 'glmScore.sas';
run;

For this example, the predicted values are in a variable called P_y in the Pred data set. The CODE statement is supported by many predictive modeling procedures, such as the GENMOD, GLIMMIX, GLM, GLMSELECT, LOGISTIC, MIXED, PLM, and REG procedures in SAS/STAT software. In addition, the CODE statement is supported by the HPLOGISTIC and HPREG procedures in SAS High-Performance Analytics software.

In summary, there are many ways to score SAS regression models. For PROC REG and linear models with an explicit design matrix, use the SCORE procedure. For nonparametric models, use the SCORE statement. For scoring data sets long after a model is fit, use the STORE statement and the PLM procedure. For scoring inside the DATA step, use the CODE statement. For regression procedures that do not support these options (such as PROC TRANSREG) use the missing value trick from my last post.

Did I leave anything out? What is your favorite technique to score a regression model? Leave a comment.

Post a Comment

The missing value trick for scoring a regression model

A fundamental operation in statistical data analysis is to fit a statistical regression model on one set of data and then evaluate the model on another set of data. The act of evaluating the model on the second set of data is called scoring.

One of first "tricks" that I learned when I started working at SAS was how to score regression data in procedures that do not support the SCORE statement. I think almost every SAS statistical programmer learns this trick early in their career, usually from a more experienced SAS programmer. The trick is used in examples in the SAS/STAT User's Guide and on discussion forums, but it is not often explicitly discussed, probably because it is so well known. However,I had to search the internet for a while before I found a SUGI paper that describes the trick. In an effort to assist new SAS programmers, here is an explanation of how to score regression models by using the "missing value trick," which is also called the "missing response trick" or the "missing dependent variable trick."

Suppose that you want to fit a regression model to some data. You also have a second set of explanatory values for which you want to score the regression model. For example, the following DATA step create training data for the regression model. A subsequent DATA step creates evenly spaced values in the X variable. The goal is to evaluate the regression model on the second data set.

data A;       /* the original data; fit model to these values */
input x y @@;
cards;
1  4  2  9  3 20  4 25  5  1  6  5  7 -4  8 12
;
 
%let NumPts = 200;
data ScoreX(keep=x);  /* the scoring data; evaluate model on these values */
min=1; max=8;
do i = 0 to &NumPts-1;
   x = min + i*(max-min)/(&NumPts-1);     /* evenly spaced values */
   output;                                /* no Y variable; only X */
end;
run;

The trick relies on two features of SAS software:

  • The first data set contains variables X and Y. The second contains only X. If the two data sets are concatenated, a missing value is assigned to Y for each value of X in the second data set.
  • When a regression procedure encounters an observation that has a missing value for the response variable, it does not use that observation to fit the model. However, provided that all of the explanatory variables are nonmissing, the procedure does compute the predicted value for that observation.

Consequently, the missing value trick is to concatenate the original data with the scoring data. If you call a regression procedure on the concatenated data, the original data are used to fit the data but predicted values are generated for the scoring data, as follows:

/* The missing response trick.
   1. Concatenate the original data with the score data */
data B;
set A ScoreX;        /* y=. for all obs in ScoreX */
run;
 
/* 2. Run a regression. The model is fit to the original data. */ 
proc reg data=B plots=(NONE);
model y = x;
output out=Pred p=P;   /* predicted values for the scoring data */
quit;
 
proc print data=Pred(obs=12);
run;

The table shows that the scoring data, which begins in row 9, contains predicted values as desired.

The advantage of this technique is that it is easy to implement. A disadvantage is this technique makes two extra copies of the scoring data, which might require a lot of disk space if the scoring data set is huge. A second disadvantage is that the trick increases the number of observations that the regression procedure must read. In the example, there are only eight observations in the original data, but the REG procedure has to read and write 208 observations.

There are other ways to evaluate a regression model, including using a SCORE statement, the SCORE procedure, and the relatively new PLM procedure. I will discuss these alternative methods in a future blog post.

I'm interested in hearing about when you first learned the missing response trick? Who did you learn it from? Do you still use it, or do you now use a more modern technique? Leave a comment.

Post a Comment

Fundamental theorems of mathematics and statistics

Although I currently work as a statistician, my original training was in mathematics. In many mathematical fields there is a result that is so profound that it earns the name "The Fundamental Theorem of [Topic Area]." A fundamental theorem is a deep (often surprising) result that connects two or more seemingly unrelated mathematical ideas.

It is interesting that statistical textbooks do not usually highlight a "fundamental theorem of statistics." In this article I briefly and informally discuss some of my favorite fundamental theorems in mathematics and cast my vote for the fundamental theorem of statistics.

The fundamental theorem of arithmetic

The fundamental theorem of arithmetic connects the natural numbers with primes. The theorem states that every integer greater than one can be represented uniquely as a product of primes.

This theorem connects something ordinary and common (the natural numbers) with something rare and unusual (primes). It is trivial to enumerate the natural numbers, but each natural number is "built" from prime numbers, which defy enumeration. The natural numbers are regularly spaced, but the gap between consecutive prime numbers is extremely variable. If p is a prime number, sometimes p+2 is also prime (the so-called twin primes), but sometimes there is a huge gap before the next prime.

The fundamental theorem of algebra

The fundamental theorem of algebra connects polynomials with their roots (or zeros). Along the way it informs us that the real numbers are not sufficient for solving algebraic equation, a fact known to every child who has pondered the solution to the equation x2 = –1. The fundamental theorem of algebra tells us that we need complex numbers to be able to find all roots. The theorem states that every nonconstant polynomial of degree n has exactly n roots in the complex number system. Like the fundamental theorem of arithmetic, this is an "existence" theorem: it tells you the roots are there, but doesn't help you to find them.

The fundamental theorem of calculus

The fundamental theorem of calculus (FTC) connects derivatives and integrals. Derivatives tell us about the rate at which something changes; integrals tell us how to accumulate some quantity. That these should be related is not obvious, but the FTC says that the rate of change for a certain integral is given by the function whose values are being accumulated. Specifically, if f is any continuous function on the interval [a, b], then for every value of x in [a,b] you can compute the following function:

The FTC states that F'(x) = f(x). That is, derivatives and integrals are inverse operations.

Unlike the previous theorems, the fundamental theorem of calculus provides a computational tool. It shows that you can solve integrals by constructing "antiderivatives."

The fundamental theorem of linear algebra

Not everyone knows about the fundamental theorem of linear algebra, but there is an excellent 1993 article by Gil Strang that describes its importance. For an m x n matrix A, the theorem relates the dimensions of the row space of A (R(A)) and the nullspace of A (N(A)). The result is that dim(R(A)) + dim(N(A)) = n.

The theorem also describes four important subspaces and describes the geometry of A and At when thought of as linear transformations. The theorem shows that some subspaces are orthogonal to others. (Strang actually combines four theorems into his statement of the Fundamental Theorem, including a theorem that motivates the statistical practice of ordinary least squares.)

The fundamental theorem of statistics

Although most statistical textbooks do not single out a result as THE fundamental theorem of statistics, I can think of two results that could make a claim to the title. These results are based in probability theory, so perhaps they are more aptly named fundamental theorems of probability.

  • The Law of Large Numbers (LLN) provides the mathematical basis for understanding random events. The LLN says that if you repeat a trial many times, then the average of the observed values tend to be close to the expected value. (In general, the more trials you run, the better the estimates.) For example, you toss a fair die many times and compute the average of the numbers that appear. The average should converge to 3.5, which is the expected value of the roll because (1+2+3+4+5+6)/6 = 3.5. The same theorem ensures that about one-sixth of the faces are 1s, one-sixth are 2s, and so forth.
  • The Central Limit theorem (CLT) states that the mean of a sample of size n is approximately normally distributed when n is large. Perhaps more importantly, the CLT provides the mean and the standard deviation of the sampling distribution in terms of the sample size, the population mean μ, and the population variance σ2. Specifically, the sampling distribution of the mean is approximately normally distributed with mean μ and standard deviation σ/sqrt(n).

Of these, the Central Limit theorem gets my vote for being the Fundamental Theorem of Statistics. The LLN is important, but hardly surprising. It is the basis for frequentist statistics and assures us that large random samples tend to reflect the population. In contrast, the CLT is surprising because the sampling distribution of the mean is approximately normal regardless of the distribution of the original data! As a bonus, the CLT can be used computationally. It forms the basis for many statistical tests by estimating the accuracy of a statistical estimate. Lastly, the CLT connects important concepts in statistics: means, variances, sample size, and accuracy of point estimates.

Do you have a favorite "Fundamental Theorem"? Do you marvel at an applied theorem such as the fundamental theorem of linear programming or chuckle at a pseudo-theorems such as the fundamental theorem of software engineering? Share your thoughts in the comments.

Post a Comment

Define functions with default parameter values in SAS/IML

One of my favorite new features of SAS/IML 12.1 enables you to define functions that contain default values for parameters. This is extremely useful when you want to write a function that has optional arguments.

Example: Centering a data vector

It is simple to specify a SAS/IML module with a default parameter value. Suppose that you want to write a module that centers a vector. By default, the module should center the vector so that new mean of the data is 0. However, the module should also support an arbitrary value to center the vector. With default arguments in SAS/IML 12.1, you can write the following module:

proc iml;
start Center(x, a=0);
   return ( x - mean(x) + a );
finish;

The Center module has two arguments, but the equal sign in the second argument indicates that the second argument is optional and that it receives a default value of 0. (SAS programmers who are familiar with the macro language will find this syntax familiar.) Using default values enables you to call the function with fewer parameters in the usual (default) case, but also enables you to call the function with more generality.

In this example, if you call the function by using one argument, the local variable a is set to zero. You can also specify a value for a when you call the function, as follows:

x = {-1,0,1,4};      /* data */
c0 = Center(x);      /* center x at 0 */
c10 = Center(x, 10); /* center x at 10 */
print c0 c10;

The c0 vector is centered at 0, whereas the c10 vector is centered at 10. The Center module has a simpler calling syntax for the usual case of centering a vector at 0.

Example: Testing whether two vectors are equal to within a certain precision

I have previously blogged about the dangers of testing floating-point values for equality. In a numerical program, you should avoid testing whether a vector x is (exactly) equal to another vector y. Even though two vectors should be equal (if the computations were performed in exact arithmetic), finite-precision computations can lead to small differences, which some people call rounding errors. For example, in exact arithmetic the square of the square root of x is exactly equal to x for any nonnegative value. However, the following statements show that this equality might not hold in finite precision arithmetic:

x = {4 5 2000 12345 654321};  /* some numbers */
y = sqrt(x)#sqrt(x);          /* in exact arithmetic, y=x */
diff = x-y;                   /* the diff vector is not zero */
print diff;

As you can see, some elements of x and y are not equal, which leads to a nonzero difference vector. Consequently, it is a bad idea to use the equal operator (=) to test whether the vectors are equal:

eq = all(x=y);    /* Bad idea: Don't test for exact comparison */
print eq;

To understand more about floating-point arithmetic, see the paper "What Every Computer Scientist Should Know About Floating-Point Arithmetic." A better way to compare numerical values in SAS/IML is to test whether two numbers are within some specified tolerance of each other, like you can do with PROC COMPARE. The following module computes whether the absolute value of the difference between corresponding elements is less than some criterion. By default, the criterion is set to 10–6:

/* return 1 if x[i] is within eps of y[i] for all i */
start IsEqual(x, y, eps=1e-6);
   return( all( abs(x-y)<=eps) );  
finish;
 
eq6  = IsEqual(x, y);        /* Default: compare within 1e-6 */
eq12 = IsEqual(x, y, 1e-12); /* compare within 1e-12 */
print eq6 eq12;

The vectors are judged to be equal when the default criterion is used. If you tighten the criterion, the vectors are judged to be unequal.

Being able to specify default parameter values is very useful to a programmer. Do you have an example for which this new feature of SAS/IML 12.1 will be useful? Describe your application in the comments.

Post a Comment

A simple way to find the root of a function of one variable

Finding the root (or zero) of a function is an important computational task because it enables you to solve nonlinear equations. I have previously blogged about using Newton's method to find a root for a function of several variables. I have also blogged about how to use the bisection method to find the zeros of a univariate function.

As of SAS/IML 12.1, there is an easy way to find the roots of function of one variable. The FROOT function solves for a root on a finite interval by using Brent’s method, which uses a combination of bisection, linear interpolation, and quadratic interpolation to converge to a root. Unlike Newton's method, Brent's method does not require that you specify a derivative for the function. All you need to provide is a module that evaluates the function, f, and an interval [a, b] such that f(a) and f(b) have different signs.

If there is a root in the interval, Brent's method is guaranteed to find it.

As an example, the following SAS/IML module defines a function that I investigated in a previous blog post:

proc iml;
start Func(x);
   return( exp(-x##2) - x##3 + 5#x +1 );
finish;

The image at the beginning of this article shows the graph of the function on the interval [–5, 5]. The function has three roots. You can use the graph to estimate intervals on which the function contains a root. The FROOT function enables you to find multiple zeros with a single call, as follows:

intervals = {-4   -1.5,         /* 1st interval [-4, -1.5] */
             -1.5  1  ,         /* 2nd interval [-1.5 1]   */
              1    4  };        /* 3rd interval [1, 4]     */
z = froot("Func", intervals);
print z;

The vector z contains three elements. The first element is the root in the first specified interval, the second element is the root in the second interval, and so forth. If you specify an interval on which the function does not have a root, then the FROOT function returns a missing value.

That's all there is to it. So next time you need to solve a nonlinear equation of one variable, remember that the FROOT function in the SAS/IML language makes the task simple.

Post a Comment

Sample without replacement in SAS

Last week I showed three ways to sample with replacement in SAS. You can use the SAMPLE function in SAS/IML 12.1 to sample from a finite set or you can use the DATA step or PROC SURVEYSELECT to extract a random sample from a SAS data set. Sampling without replacement is similar. This article describes how to use the SAS/IML SAMPLE function or the SURVEYSELECT procedure.

Simulate dealing cards

When I wrote about how to generate permutations in SAS, I used the example of dealing cards from a standard 52-card deck. The following SAS/IML statements create a 52-element vector with values 2H, 3H, ..., KS, AS, where 'H' indicates the heart suit, 'D' indicates diamonds, 'C' indicates clubs, and 'S' indicates spades:

proc iml;
/* create a deck of 52 playing cards */
suits52 = rowvec( repeat({H, D, C, S},1,13) );
vals = char(2:10,2) || {J Q K A};
vals52 = repeat( right(vals), 1, 4 );
Cards = vals52 + suits52;
 
/* choose 20 cards without replacement from deck */
call randseed(293053001);
deal = sample(Cards, 20, "WOR");  /* sample 20 cards without replacement */

The third argument to the SAMPLE function is the value "WOR", which stands for "without replacement." With this option, the SAMPLE function returns 20 cards from the deck such that no card appears more than once in the sample. If there are four card players who are playing poker and each gets five cards, you can use the SHAPE function to reshape the 20 cards into a matrix such that each column indicates a player's poker hand:

PokerHand = shape(deal, 0, 4);   /* reshape vector into 4 column matrix */
print PokerHand[c=("Player1":"Player4")];

Let's see what poker hands these players were dealt. The first player has a pair of 2s. The second player has a pair of queens. The third player has three kings, and the fourth player has a flush! That was a heck of a deal! (I'll leave it to the sharp-eyed reader to figure out how I "cheated" in order to simulate such an improbable "random" sample. Extra credit if you link to a blog post of mine in which I explain the subterfuge.)

The SAMPLE function provides a second way to sample without replacement. If the third argument is "NoReplace", then a faster algorithm is used to extract a sample. However, the sample is in the same order as the original elements, which might not be acceptable. For the poker example, the "WOR" option enables you to simulate a deal. If you use the "NoReplace" option, then you should first use the RANPERM function to shuffle the deck. Of course, if you only care about the sample as a set rather than as a sequence, then using the faster algorithm makes sense.

One more awesome feature of the SAMPLE function: it enables you to sample with unequal probabilities by adding a fourth argument to the function call.

Sampling without replacement by using the SURVEYSELECT procedure

As mentioned above, some algorithms generate a sample whose elements are in the same order as the original data. This is the case with the SURVEYSELECT procedure when you use the METHOD=SRS option. Suppose that you write the 52 cards to a SAS data set. You can use the SURVEYSELECT procedure to extract 20 cards without replacement, as follows:

create Deck var {"Cards"}; append; close Deck; /* create data set */
quit;
 
proc surveyselect data=Deck out=Poker seed=1                   
     method=srs       /* sample w/o replacement */
     sampsize=20;     /* number of observations in sample */
run;
 
proc print data=Poker(obs=8); run;

The output shows the first eight observations in the sample. You can see that the hearts appear first, followed by the diamonds, followed by the clubs, and that within each suit the values of the cards are in their original order. If you want the data in a random order, imitate the DATA step code in the SAS Knowledge Base article "Simple Random Sample without Replacement."

Post a Comment

Sample with replacement in SAS

Randomly choosing a subset of elements is a fundamental operation in statistics and probability. Simple random sampling with replacement is used in bootstrap methods (where the technique is called resampling), permutation tests and simulation.

Last week I showed how to use the SAMPLE function in SAS/IML software to sample with replacement from a finite set of data. Because not everyone is a SAS/IML programmer, I want to point out two other ways to sample (with replacement) observations in a SAS data set.

  • The DATA Step: You can use the POINT= option in the SET statement to randomly select observations from a SAS data set. For many data sets you can use the SASFILE statement to read the entire sample data into memory, which improves the performance of random access.
  • The SURVEYSELECT Procedure: You can use the SURVEYSELECT procedure to randomly select observations according to several sampling schemes. Again, you can use the SASFILE statement to improve performance.

The material in this article is taken from Chapter 15, "Resampling and Bootstrap Methods," of Simulating Data with SAS.

The DATA step

The following DATA step randomly selects five observations from the Sashelp.Cars data set:

sasfile Sashelp.Cars load;              /* 1. Load data set into memory */
data Sample(drop=i);
call streaminit(1);
do i = 1 to 5;         
   p = ceil(NObs * rand("Uniform"));    /* random integer 1-NObs */
   set Sashelp.Cars nobs=NObs point=p;  /* 2. POINT= option; random access */
   output;
end;
STOP;                                   /* 3. Use the STOP stmt */
run;
sasfile Sashelp.Cars close;

A few statements in this DATA step require additional explanation. They are indicated by numbers inside of comments:

  1. Provided that the data set is not too large, use the SASFILE statement to load the data into memory, which speeds up random access.
  2. The NOBS= option stores the number of observations in the Sashelp.Cars data into the NObs variable before the DATA step runs. Consequently, the value of NObs is available throughout the DATA step, even on statements that execute prior to the SET statement.
  3. The POINT= option is used to read the (randomly chosen) observation.
  4. The STOP statement must be used to end the DATA step processing when you use the POINT= option, because SAS never encounteres the end-of-file indicator during random access.

For the example, the selected observation numbers are 379, 417, 218, 380, and 296. You can print the random sample to see which observations were selected:

proc print data=Sample noobs;
   var Make Model MPG_City Length Weight;
run;

The SURVEYSELECT procedure

The main SAS procedure for (re)sampling is called the SURVEYSELECT procedure. The name is a bit unfortunate because statisticians and programmers who are new to SAS might browse the documentation and completely miss the relevance of this procedure. (To me, it is "PROC RESAMPLE.") The SURVEYSELECT procedure has many methods for sampling, but the method for sampling without replacement is known as unrestricted random sampling (URS). The following call creates an output data set that contains five observations that are sampled (with replacement) from the Sashelp.Cars data:

proc surveyselect data=Sashelp.Cars out=Sample2 NOPRINT 
     seed=1                                         /* 1 */
     method=urs sampsize=5                          /* 2 */
     outhits;                                       /* 3 */
run;

The call to PROC SURVEYSELECT has several options, which are indicated by numbers in the comments:

  1. The SEED= option specifies the seed value for random number generation. If you specify a zero seed, then omit the NOPRINT option so that the value of the chosen seed appears in the procedure output.
  2. The METHOD=URS option specifies unrestricted random sampling, which means sampling with replacement and with equal probability. The SAMPSIZE= option specifies the number of observations to select, or you can use the SAMPRATE= option to specify a proportion of observations.
  3. The OUTHITS options specifies that the output data set contains five observations, even if a record is selected multiple times. If you omit the OUTHITS option, then the output data set might have fewer observations, and the NumberHits variable contains the number of times that each record was selected.

If you are using SAS/STAT 12.1 or later, the output data set contains exactly the same observations as for the DATA step example, because PROC SURVEYSELECT uses the same random number generator (RNG) as the RAND function. Prior to SAS/STAT 12.1, PROC SURVEYSELECT used the older RNG that is used by the RANUNI function.

In summary, if you need to sample observations from a SAS data set, you can implement a simple sampling scheme in the DATA step or you can use PROC SURVEYSELECT. I recommend PROC SURVEYSELECT because the procedure makes it clearer what sampling method is being used and because the procedure supports other, more complex, sampling schemes that are also useful.

Post a Comment

Ulam spirals: Visualizing properties of prime numbers with SAS

Prime numbers are strange beasts. They exhibit properties of both randomness and regularity. Recently I watched an excellent nine-minute video on the Numberphile video blog that shows that if you write the natural numbers in a spiral pattern (called the Ulam spiral), then there are certain lines in the pattern that are very likely to contain prime numbers.

The image to the left shows the first 22,500 natural numbers arranged in the Ulam spiral pattern within a 150 x 150 grid. Cells that contain prime numbers are colored black. Cells that contain composite numbers are colored white. You can see certain diagonal lines with slope ±1 that contain a large number of prime numbers. There are conjectures in number theory that explain why certain lines along the Ulam spiral have a greater density of prime numbers than other lines. (The diagonal lines in the spiral correspond to quadratic equations.)

I don't know enough about prime numbers to blog about the mathematical properties of the Ulam spiral, but as soon as I saw the Ulam spiral explained, I knew that I wanted to generate it in SAS. The Ulam spiral packs the natural numbers into a square matrix in a certain order. This post describes how to construct the Ulam spiral in the SAS/IML matrix language.

The Ulam spiral construction

The Ulam spiral is simple to describe. On a gridded piece of paper, write down the number 1. Then write successive integers as you spiral out in a counter-clockwise fashion, as shown in the accompanying figure.

Although you can stop writing numbers at any time, for the purpose of this post let's assume you want to fill an N x N array with numbers. Notice that when N is even, the last number in the array (N2=4, 16, 36,...), appears in the upper left corner of the array, whereas for odd N, the last number (9, 25, 49,...) appears in the lower right corner. I found this asymmetry hard to deal with, so I created an algorithm that fills the N x N matrix so that the N2 term is always in the upper left. For odd values of N, the algorithm rotates the matrix after inserting the elements. I've previously shown that it is easy to write SAS/IML statements that rotate elements in a matrix.

My algorithm constructs the spiral iteratively. Notice that if you have constructed the spiral correctly for an (N–2) x (N–2) array, you can construct the N x N array by adding two vertical columns (first and last) and two horizontal rows (first and last). I call the outer rows and columns a "frame" because they remind me of a picture frame. Given a value for N, you can figure out the starting and ending values of each row and column of the frame. You can use the SAS/IML index creation operator to create each row and column, as shown in the following program:

proc iml;
start SpiralFrame(n);
   if n=1 then return(1);
   if n=2 then return({4 3, 1 2});
   if n=3 then return({9 8 7, 2 1 6, 3 4 5});
   X = j(n,n,.);
   /* top of frame. 's' means 'start'. 'e' means 'end' */
   s = n##2; e = s - n + 2;  X[1,1:n-1] = s:e;
   /* right side of frame */
   s = e - 1; e = s - n + 2; X[1:n-1,n] = (s:e)`;
   /* bottom of frame */
   s = e - 1; e = s - n + 2; X[n,n:2] = s:e;
   /* left side of frame */
   s = e - 1; e = s - n + 2; X[n:2,1] = (s:e)`;
   return( X );
finish;
 
/* test the frame construction */
M2 = SpiralFrame(2);
M4 = SpiralFrame(4);
M6 = SpiralFrame(6);
print M2, M4, M6;

You can see that the M4 matrix fits inside the M6 frame. Similarly, the M2 matrix fits inside the M4 frame.

After writing the code that generates the frames, the rest of the construction is easy. Start by creating the frame of size N. Decrease N by 2 and iteratively create a frame of size N, taking care to insert the (N–2) x (N–2) array into the interior of the existing N x N array. After the array is filled, rotate the array by 180 degrees if N is odd. The following SAS/IML statements implement this algorithm:

start Rot180(m);
   return( m[nrow(m):1,ncol(m):1] );
finish;
 
/* Create N x N integer matrix with elements arranged in an Ulam spiral */
start UlamSpiral(n);
   X = SpiralFrame(n);             /* create outermost frame */
   k = 2;
   do i = n-2 to 2 by -2;          /* iteratively create smaller frames */
      r = k:n-k+1;                 /* rows (and cols) to insert smaller frame */
      X[r, r] = SpiralFrame(i);    /* insert smaller frame */
      k = k+1;
   end;
   if mod(n,2)=1 then return(Rot180(X));
   return(X);
finish;
 
U10 = UlamSpiral(10);              /* test program by creating 10 x 10 matrix */
print U10[f=3.];

You might not have much need to generate Ulam spirals in your work, but this exercise demonstrates several important principles of SAS/IML programming:

  • It is often convenient to calculate the first and last element of an array and to use the color index operator (:) to generate the vector.
  • Notice that I generated the largest frame first and inserted the smaller frames inside of it. That is more efficient than generating the smaller frames and then using concatenation to grow the matrix.
  • In the same way, sometimes an algorithm is simpler or more efficient to implement if your DO loops run in reverse order.

The heat map at the beginning of this article is created by using the new HEATMAPDISC subroutine in SAS/IML 13.1. I will blog about heat maps in a future post. If you have SAS/IML 13.1 and want to play with Ulam spirals yourself, you can download the SAS code used in this blog post.

Post a Comment