Define functions with optional parameters in SAS/IML

Last month I blogged about defining SAS/IML functions that have default parameter values. This language feature, which was introduced in SAS/IML 12.1, enables you to skip arguments when you call a user-defined function.

The same technique enables you to define optional parameters. Inside the function, you can determine whether the optional parameter was specified by using the ISSKIPPED function.

Print the first few rows of a matrix

For example, suppose you want to implement a module that prints the first few rows of a matrix. By default, you want to print five rows. However, you also want to support printing a specified number of rows. You can implement a module that does this task by using the new 12.1 syntax for specifying default parameter values. The UNIX operating system has a command that called head that prints the first few rows of a file, so I'll use the same name for this module. A first attempt might be as follows:

proc iml;
start head1(x, n=min(5,nrow(x)));  /* 1st attempt: one arg with default value */
   print (x[1:n,]);
x = shape(1:36, 9);    /* test matrix is 9 x 4 */
run head1(x);          /* print first 5 rows (default) */
run head1(x, 3);       /* print first 3 rows */

The HEAD1 module works as advertise: it prints five rows by default, but also enables you to specify the number of rows. Notice the way that I specified the default value for the second argument (n). In the START statement, I put an equal sign (=) after the name of the argument and then I specified an expression that determines the default value. For this example, the expression contains a call to the MIN function in order to support matrices that have fewer than five rows. In other words, the default value for n depends on the size of x, the first argument. The value for n is determined as follows:

  • If the module is called with two arguments, then n is assigned the value of the second specified argument. The expression that computes the default value is ignored.
  • If the module is called with one argument, then n is assigned a default value, which is determined by evaluating the expression min(5, nrow(x)). When x has five or more rows, then n = 5. When x has fewer than five rows, n is assigned to be the number of rows in x.

Specifying a complicated default value

In this example, the default value of the n argument was moderately complicated, but I could still specify it by using a single expression. If the default value requires executing several statements, then you can't specify the value on the START statement. Instead, use an equal sign to specify that the argument is optional, and use the ISSKIPPED function to check whether the argument was specified or whether it should receive a default value, as follows:

start head2(x, n=);  /* 2nd attempt: one optional arg */
   if isskipped(n) then do;
      /* use as many statements as necessary to assign a default value */
      n = min(5,nrow(x));
   print (x[1:n,]);

Optional arguments without default values

Suppose that you want to modify the HEAD1 module so that it supports printing a matrix by using a specified SAS format. By default, you want to print the first few matrix rows by using the default SAS/IML format. However, you want to optionally specify a format.

In this scenario, there is no appropriate default value for the format. To specify that an argument is optional (but does not receive a default value), place an equal sign (=) after the argument in the START statement. You can use the ISSKIPPED function to detect whether the optional argument was specified, as follows:

/* n has a default value; format is optional */
start head(x, n=min(5,nrow(x)), format=);
   labl = "head (rows=" + strip(char(n)) + ")";  /* construct label */
   if isskipped(format) then
      print (x[1:n,])[label=labl];  /* print with default format */
   else do;
      y = putn(x[1:n,], format);    /* apply specified format */
      print y[label=labl];          /* print formatted matrix */
run head(x) format="4.2";           /* specify a format */

Inside the HEAD module, the ISSKIPPED function determines at run time whether the third argument was specified. If not, then the PRINT statement prints the first few rows of the matrix. (The default SAS/IML format is used.) If the third argument is specified, then the PUTN statement applies the specified format to the first few rows and prints the result.

Optional arguments can be a powerful tool. Default arguments make it easier to call a function because the user needs to specify only a small number of the possible options to the function. As this example shows, default values can depend on run time values of other arguments. Default arguments are one of my favorite features in SAS/IML 12.1. I hope that you find them as useful as I do!

Post a Comment

Where's Rick at SAS Global Forum 2014?

Once again I'll be at SAS Global Forum this year. The 2014 location is Washington, D. C., so I am looking forward to greeting many friends in the government and consulting sectors. I always enjoy talking with SAS customers about statistics, simulations, matrix computations, and the SAS/IML product, so here's my schedule at SAS Global Forum. Please say hello!

Sunday, March 23, 2014

I have no scheduled presentations. I will be at the Opening Session from 7:00 – 8:30 p.m., followed by the Get-Acquainted Reception from 8:30 – 10:30 p.m.

Monday, March 24

I am at the SAS/IML booth in the SAS Support and Demo Area for most of the day, so this is a great time to talk about what you are doing with SAS software or features that you would like to see added to a future release. My hours in the Demo Area are:

  • 10:00 a.m. – 11:30 a.m.: At the SAS/IML booth
  • 11:30 a.m. – 11:50 a.m.: Super Demo: "Ten Tips for Simulating Data with SAS"
  • 2:00 p.m. – 4:00 p.m.: At the SAS/IML booth, except during the next Super Demo
  • 2:30 p.m. – 2:50 p.m.: Super Demo: "Using Heat Maps to Visualize Matrices"
  • 6:00 p.m. – 7:30 p.m.: Grab a drink and some hors d'oeuvres and chat with me at the SAS/IML booth during the evening mixer!

Tuesday, March 25

I will be attending talks during the morning, but send me an email if you want to schedule a time to meet. Or grab me at one of the coffee breaks!

My only scheduled activity on Tuesday is that I am appearing on the SAS Tech Talks show. Chris Hemedinger will interview me about calling R from SAS during the 2:00 p.m.. show. This interview can be watched live at, or you can watch the show after the broadcast.

In the evening, please join me and other SAS professionals for a meetup session 6:00 p.m. – 8:00 p.m. in Chesapeake Room A/B/C. This meetup will include leading members of the SAS online community, including representatives from SAS-L,, the SAS Support Communities (a.k.a., "Discussion Forums"), and other online networking sites for SAS professionals. Besides providing an opportunity to network, the meetup will include the presentation of the annual SAS-L awards.

Wednesday, March 26

I will be giving a Hands-On Workshop on "Getting Started with the SAS/IML Language" from 11:00 a.m. – 12:40 p.m. There will be a room full of computers and you will have an opportunity to learn some SAS/IML programming techniques. If you have always wanted to try SAS/IML, but it is not licensed at your site, this is your chance! It is free, so please join me before you head home.

So whether you want to discuss statistical programming, comment on a blog topic, or tell me about a cool project that you are working on, please introduce yourself at the conference. I look forward to meeting old friends... and making some new ones.

Post a Comment

Finding elements in one vector that are not in another vector

The SAS/IML language has several functions for finding the unions, intersections, and differences between sets. In fact, two of my favorite utility functions are the UNIQUE function, which returns the unique elements in a matrix, and the SETDIF function, which returns the elements that are in one vector and not in another. Both functions return the elements in an ordered vector.

However, a recent question on the SAS/IML Support Community asked how to find the elements in one vector that are not in another while preserving the original order of the elements. As an example, consider the following vectors:

proc iml;
A = {6 3 4 1 2 5 1 2 4};
B = {2 3};

The problem is how to find the elements in A that are not in B. If possible, we'd like to avoid using DO loops to iterate over the elements.

If you use the SETDIF function, the duplicate values in A are removed and the result is sorted:

C = setdif(A,B);
print C;

The output of the SETDIF function shows that the elements 1, 4, 5, and 6 are contained in A, but not in B. Although A contains two 1s and two 4s, that information is not included in the result.

If you need to preserve the original order of the data or you need to retain duplicate values, then you should use the ELEMENT function, which was introduced in SAS/IML 9.3. The ELEMENT function returns a matrix that is the same dimensions as A. The returned matrix is an "indicator matrix": a 1 in position (i, j) means that A[i, j] is an element of B, as shown by the following statements:

E = element(A,B);
print (A // E)[rowname={"A","in B"}];

The output shows that the second, fifth, and eighth elements of A are elements of B. If you want to preserve the original order of elements (and preserve duplicates), you can use the LOC function and the REMOVE function to eliminate those elements, as follows:

D = remove(A,loc(element(A,B)));
print D;
/* you can also use subscripting:  D = A[ loc(^element(A,B)) ]; */

The result lists the elements of A that are not in B, but preserves the order of the elements and any duplicate elements in the list.

Post a Comment

For pi day: A continued fraction expansion of pi

Many geeky mathematical people celebrate "pi day" on March 14, because the date is written 3/14 in the US, which is evocative of the decimal representation of π = 3.14....

Most people are familiar with the decimal representation of π. The media occasionally reports on a new computational tour-de-force that approximates π to unimaginable precision. The decimal value of π has been computed to more than 10 trillion digits!

Using the decimal approximation of π is ubiquitous in mathematics, but in elementary school I learned another approximation: π ≈ 22/7. As I got older, I discovered that there were other rational approximations of π, such as 355/113.

Continued fractions

To celebrate π day, I am going to use SAS to estimate π by using its simple continued fraction representation. Every real number x can be represented as a continued fraction:

In this continued fraction, the ai are positive integers. Because all of the fractions have 1 in the numerator, the continued fraction can be compactly represented by specifying only the integers: x = [a0; a1, a2, a3, ...]. Every rational number is represented by a finite sequence; every irrational number requires an infinite sequence. Notice that the decimal representation of a number does NOT have this nice property: 1/3 is rational but its decimal representation is infinite.

The continued fraction expansion of pi

The continued fraction expansion of π is known to more than 15 billion values. The first fifteen digits are
π ≈ [3; 7, 15, 1, 292, 1, 1, 1, 2, 1, 3, 1, 14, 2, 1]
What does this mean? Well, the sequence [3; 7] corresponds to the fraction 3 + 1/7 = 22/7. Similarly, the sequence [3; 7, 15, 1] corresponds to the fraction 3 + 1/(7 + 1/(15 + 1/1)) = 355/113.

Trying to compute these fractions by hand is tedious. Why not write a SAS program to evaluate the continued fraction sequence? As the examples show, the computation starts at the end of the sequence. The computation begins by taking the reciprocal of the last number in the sequence. That values is added to the second-to-last number, and the sum is inverted. This process continues until you reach the first value. You can do this computation in the SAS DATA step (post your code in the comments!), but I have written a SAS/IML function to evaluate an arbitrary continued fraction representation. Notice that in SAS you can call the CONSTANT function to obtain a numerical approximation of π and other mathematical constants, which is useful for checking the results:

proc iml;
start ContinuedFraction(f);
   s = f[nrow(f)];              /* last value */
   do i = nrow(f)-1 to 1 by -1; /* traverse sequence in reverse order */
      s = f[i] + 1/s;           
   return( s );
/* test the function by using the continued fraction expansion for pi */
f = {3, 7, 15, 1, 292, 1, 1, 1, 2, 1, 3, 1, 14, 2, 1};
s = ContinuedFraction(f);
pi = constant("pi");    /* call the CONSTANT function to approximate pi */
print s[format=16.14], pi[format=16.14];

As you can see, the first fifteen terms of the continued fraction expansion of π are sufficient to reproduce the first 15 digits of the decimal representation.

Other continued fractions

Here are a few continued fraction expansions for some other irrational numbers:

  • The continued fraction expansion for e ≈ 2.718281828459 shows a fascinating regular pattern: two 1s followed by an even number.
    e = [2; 1, 2, 1, 1, 4, 1, 1, 6, 1, 1, 8, 1, 1, 10,....]
  • The continued fraction expansion for the square root of 2 ≈ 1.414213562373095 contains a repeating sequence of 2s:
    sqrt2 = [1; 2, 2, 2, 2, ...]
  • The continued fraction expansion for the golden ratio φ ≈ 1.61803398874989 is a repeating sequence of 1s:
    phi = [1; 1, 1, 1, 1, 1, ...]
You can use these sequences to estimate the corresponding decimal representation:
e     = ContinuedFraction({2, 1, 2, 1, 1, 4, 1, 1, 6, 1, 1, 8, 1, 1, 10});
sqrt2 = ContinuedFraction({1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2});
phi   = ContinuedFraction({1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1});
print e[format=16.14], sqrt2[format=16.14], phi[format=16.14];

The results contain less than 15-digit decimal precision. In order to obtain more precision, you would need to use more terms in the continued fraction representation.

By the way, next year's Pi Day should be special: celebrate on 3/14/15 at 9:26:53 for maximal enjoyment!

Post a Comment

Optimizing a function of an integral

Last week I showed how to find parameters that maximize the integral of a certain probability density function (PDF). Because the function was a PDF, I could evaluate the integral by calling the CDF function in SAS. (Recall that the cumulative distribution function (CDF) is the integral of a PDF.)

You can also use SAS to optimize an integral function when the integrand is not a PDF. This follow-up article shows how to optimize the integral of an arbitrary function by calling the QUAD subroutine in SAS/IML software to evaluate the integral.

Defining an integral

This example finds parameters (a,b) such that the integral of the f(x) = 3/4 – xx2 is the maximum on the interval [a, b]. That is, the problem is to maximize

subject to the constraint that a < b.

The graph of the integrand is shown at the left. The parameters determine the domain of integration. The graph shows the integral (area under the curve) for the values (a,b)=(–2,1). From the graph, you can see that the optimal parameters are (a,b) =(–1.5, 0.5), which integrates the quadratic function over the interval on which it is positive. (Because the function is negative outside of (–1.5, 0.5), we don't want to accumulate "negative area.")

To solve this problem in the SAS/IML language, you need to define the integrand and call the QUAD function to evaluate the integral, as follows:

proc iml;
/* Define the integrand, which will be called by QUAD subroutine.
   For this problem, the integrand does not depend on the parameters,
   but for other problems, it might. */
start Func(x);         /* use GLOBAL stmt if Func depends on parameters */
   return( 3/4 - x - x##2 );
/* Define a function that computes the integral. This will be called by an
   NLP routine, so the function must depend on the parameters. */
start Integral(parms); /* use GLOBAL stmt if Func depends on parameters */
   /* the parameters are:  a = parms[1]; b = parms[2]; */
   call quad(val, "Func", parms);    /* evaluate integral on [a, b] */
   return( val );

The module named Integral defines the objective function, which will be maximized by calling one of the NLP functions in SAS/IML software. However, you should NEVER start optimizing until you have verified that the objective function is defined correctly! The following statements test the objective function by evaluating the integral on three different domains. Because this quadratic function has a simple antiderivative, you can verify analytically that the Integral function is performing the integration correctly:

/* Check and debug objective function BEFORE using it in NLP! */
val1 = Integral({-2 1});
val2 = Integral({-1.5 0.5});
val3 = Integral({-0.5 1});
print val1 val2 val3;

Defining the gradient vector of derivatives

One of my favorite nonlinear optimization methods is the Newton-Raphson algorithm, which is implemented in SAS/IML as the NLPNRA subroutine. This algorithm uses derivatives of the objective function to help find the maximal value of the function. The algorithm will compute finite-difference derivatives if you do not supply an analytic derivative function, but in this example it is trivial to define the gradient vector, which is the vector of derivatives of the objective function with respect to the parameters. Because the parameters are the upper and lower limits of integration, the fundamental theorem of calculus tells us that the derivatives for this problem are found by evaluating the integrand at the parameter values:

/* Optional: Define derivative w.r.t. parameters. By the FTC, the derivative 
   is found by evaluating the integrand at the parameters. */
start Deriv(parms);
   dfda = -Func( parms[1] );  /* deriv at lower limit of integration */
   dfdb =  Func( parms[2] );  /* deriv at upper limit of integration */
   return( dfda || dfdb );

Maximizing the objective function

This maximization problem is subject to the constraint that a < b. You can specify boundary and linear constraints by using the BLC= option to the NLP subroutines. You can use the following matrix to specify the constraint to an NLP function:

/* Set up linear constraint a < b, which is also a - b < 0 */
/*     a  b OP val */
con = {.  .  .   .,  /* no constraint on lower bound of a or b */
       .  .  .   .,  /* no constraint on upper bound of a or b */
       1 -1 -1  0};  /* 1*a - 1*b <= 0      */

Finally, you can specify an initial guess and call the NLPNRA subroutine to find the parameter values that maximize the integral:

p = {-1 1};        /* initial guess for solution */
opt = {1,          /* find maximum of function   */
       2};         /* print a little bit of output */
call nlpnra(rc, result, "Integral", p, opt) blc=con jac="Deriv";
print result[c={"a" "b"}];

The result agrees with our intuition. The Newton-Raphson algorithm determined that the interval(a,b) = (–1.5, 0.5) maximizes the integral of the quadratic function. The maximal value of the integral is 4/3.

In summary, this article shows how to do the following:

  • Optimize a function that depends on one or more parameters. The "twist" is that the objective function requires evaluating an integral.
  • Use the QUAD subroutine to numerically compute the integral an arbitrary function.
  • Specify linear constraints in an optimization problem.

Post a Comment

How to get started with SAS: Free videos for beginners

On most Mondays I blog about a function, programming technique, or resource that is useful for programmers who are getting started with SAS software. Recently I learned that my colleagues in the SAS education division have been hard at work developing a series of short videos that explain basic tasks to beginners. They have developed more than 20 videos that teach the basics of SAS programming and statistics, in addition to dozens of other videos that describe frequently encountered tasks in SAS Enterprise Guide, Enterprise Miner, and other SAS products.

Topics for beginners include:

  • Creating simple statistical graphs
  • Analyzing frequencies of a categorical variable
  • Analyzing the distribution of a continuous variable
  • Analyzing correlations between variables
  • Fitting simple regression models

Do you know a friend or colleague that is new to using SAS and who learns best by watching videos? Direct them to these free SAS tutorials that teach the basics of SAS programming and analytics.

Post a Comment

Optimizing a function that evaluates an integral

SAS programmers use the SAS/IML language for many different tasks. One important task is computing an integral. Another is optimizing functions, such as maximizing a likelihood function to find parameters that best fit a set of data.

Last week I saw an interesting problem that combines these two important tasks. The problem involved optimizing a function, which in turn required evaluating an integral. There are two kinds of integrals to consider:

  1. The cumulative distribution function (CDF) is the integral of a probability density. In many applications you can numerically integrate a probability density by calling the CDF function in SAS.
  2. If the integral is not a CDF that is supported by SAS, you can use the QUAD subroutine to integrate an arbitrary user-defined function.
This post demonstrates how to optimize a function that evaluates a CDF to compute a probability. A future blog post will present an example of optimizing a function that calls the QUAD subroutine to evaluate an integral.

Computing probability by integrating a density function

You can compute a probability by integrating a probability density function (PDF). However, for most common probability distributions, you don't need to compute the integral yourself because SAS provides the CDF function. The CDF is the anti-derivative of the PDF. For example, to integrate the standard normal PDF on the interval [-1,1], you can call the CDF function as follows:

prob = cdf("Normal", 1) - cdf("Normal", -1);  /* integral of phi(x) on [-1, 1] */

The answer is 0.68, which is the probability that a random normal variable falls within one standard deviation of the mean.

Distributions often contain parameters, and a common task in statistical programming is finding parameter values that satisfy some condition (a root-finding problem) or that optimize some criterion. This article describes how to optimize an integral that depends on two parameters.

The Beta(a,b) distribution

As an example, consider the Beta(a,b) distribution, which is a function of two shape parameters. Suppose that you want to find the parameter values a > 0 and 1 ≤ b ≤ 3 that maximize the integral of the Beta(a,b) PDF over the interval [0.6, 0.8].

Several Beta(a,b) curves are shown in the figure to the left. (Click to enlarge.) The Beta(4,2) curve (teal color) and the Beta(6,2) curve (magenta color) both have large areas on the interval [0.6, 0.8], so those parameter values would be good starting points for a search that attempts to find an optimal parameter value.

Optimizing the parameters of the Beta(a,b) distribution

To maximize the Beta(a,b) density, I will call the SAS/IML NLPNRA subroutine, which performs Newton-Raphson optimization. The NLP subroutine requires a user-defined module that can evaluate the objective function. The following SAS/IML module evaluates the integral of the Beta(a,b) function on [0.6, 0.8]. The parameters (a,b) are the elements of the parameter vector.

proc iml;
/* integral of the Beta(a,b) density on [0.6, 0.8] */
start BetaProb(parm);
   a = parm[,1]; b = parm[,2];
   prob = cdf("beta", 0.8, a, b) - cdf("beta", 0.6, a, b);
   return( prob );
val = BetaProb( {4 2} );
print val;

The call to the BetaProb function for the parameters (a,b)=(4,2) results in an integral of 0.4, which looks correct from the figure. Because this is a constrained optimization problem, the next step is to specify the linear constraints for the parameters. You can then provide an initial guess and call the NLPNRA subroutine, as follows:

/* Set up the constraints a > 0 and 1 <= b <= 3 */
/*     a  b  */
con = {0  1,  /* lower bounds */
       .  3}; /* upper bounds */
p  = {4 2};   /* initial guess for solution */
opt = {1,     /* find maximum of function   */
       2};    /* print iteration history and optimal parameters */
call nlpnra(rc, OptParm, "BetaProb", p, opt) blc=con; /* OptParm is solution */

The NLPNRA subroutine computed that the parameters that maximize the integral of the Beta(a,b) density function over [0.6, 0.8] (subject to the constraints that 1 ≤ b ≤ 3) are (a,b)=(7.717, 3). The optimal Beta density is shown below.

For this problem, the fourth column of the "Optimization Results" table shows that the gradient is nonzero at the solution. This indicates that the solution is not a global maximum, but is only a maximum within the constrained parameter domain. The following contour plot shows contours of the integral as a function of the parameters. The optimal parameter value is displayed along the top edge of the graph.

In summary, this article demonstrates three things:

  • You can use SAS/IML software to optimize a function that is defined by an integral.
  • You can integrate a probability density by using the CDF function. There is no need to use the QUAD function.
  • You can constrain the parameter values. As this example demonstrates, sometimes the optimal parameter value occurs on the boundary of the parameter domain.

Next week I will show an example of optimizing a function that requires calling the QUAD function to compute an integral.

Post a Comment

Aborting a SAS/IML program upon encountering an error

A colleague sent me an interesting question: What is the best way to abort a SAS/IML program? For example, you might want to abort a program if the data is singular or does not contain a sufficient number of observations or variables.

As a first attempt would be to try to use the ABORT statement, which is discussed in the article "how to stop processing your SAS/IML program if a certain condition is met". The following program contains a subtle error in logic:

proc iml;
ErrorFlag = 1;
if ErrorFlag then ABORT;       /* does not work as intended */
print "Execute statement 1";  
print "Execute statement 2";

This approach does not work because the ABORT statement terminate the IML procedure, but then SAS continues to process the next statement, which happens to be a PRINT statement. Since a PRINT statement is not a valid global SAS statement, an error occurs.

ERROR 180-322: Statement is not valid or it is used out of proper order.

Use DO blocks or reverse the logic

The obvious solution is to use an ELSE statement to specify the conditional code to execute within PROC IML. At parse time, the PROC IML parser will process all DO blocks of an IF-THEN/ELSE statement. However, at run time only the DO block for the "true" condition will be executed. Consequently, the following statements parse and run without error:

if ErrorFlag then ABORT;
else do;
   print "Execute statement 1";  
   print "Execute statement 2";

An alternative solution is to forget about using the ABORT statement and reverse the conditional logic:

ErrorFlag = 1;
if ^ErrorFlag then do;
   print "Execute statement 1";  
   print "Execute statement 2";

In this alternative solution, the PRINT statements are executed if the ErrorFlag variable is zero; otherwise they are not. In either case, the QUIT statement is the first executable statement after the IF-THEN/ELSE statement, so the procedure ends.

Use modules

In a complex analysis, defining a module is a good way to encapsulate a task, be it validating data or computing with the validated data. For example, suppose your algorithm requires data that has at least three variables and no missing values. The following program divides the error checking task and the computational task between two modules:

start CheckData(x);
   if ncol(x) < 3 then return(1);       /* error */
   if countmiss(x) > 0 then return(2);  /* error */
   return(0);                          /* no error */
start MyComputation(x);                /* only called on validated data */
   print "Execute statement 1";  
   print "Execute statement 2";
/* read x; validate data properties */
use MyData;  read all var _num_ into x;  close MyData;
ErrorFlag = CheckData(x);
if ^ErrorFlag then 
  run MyComputation(x);

With regards to defining modules, I'll also mention that you can use the RETURN statement anywhere in a module. This means that if you encounter a situation in the MyComputation module that requires you to abort the computation, you can use the RETURN statement to leave the module. Consequently, modules offer a robust mechanism for aborting a computation if an error condition occurs.


There are other ways to organize your program so that statements can conditionally execute if there is an error. However, I will stop here. Do you have a favorite way to handle errors in PROC IML? Leave a comment.

Post a Comment

How to automatically select a smooth curve for a scatter plot in SAS

My last blog post described three ways to add a smoothing spline to a scatter plot in SAS. I ended the post with a cautionary note:

From a statistical point of view, the smoothing spline is less than ideal because the smoothing parameter must be chosen manually by the user.

This means that you can specify a smoothing spline parameter that appears to fit the data, but the selection is not based on an objective criterion. In contrast, SAS software supports several other methods for which the software can automatically choose a smoothing parameter that maximizes the goodness of fit while avoiding overfitting. This article describes penalized B-splines loess curves, and thin-plate splines.

Penalized B-splines

You can fit penalized B-splines to data by using the TRANSREG procedure or by using the PBSPLINE statement in the SGPLOT procedure. The TRANSREG procedure enables you to specify many options that control the fit, including the criterion used to select the optimal smoothing parameter. The TRANSREG procedure supports the corrected Akaike criterion (AICC), the generalized cross-validation criterion (GCV), and Schwarz’s Bayesian criterion (SBC), among others. These criteria try to produce a curve that fits the data well, but does not interpolate or overfit the data. The default criterion is the AICC.

To illustrate fitting a smooth curve to a scatter plot, I'll use the SasHelp.enso data set, which contains data on the "southern oscillation," a cyclical phenomenon in atmospheric pressure that is linked to the El Niño and La Niña temperature oscillations in the Pacific Ocean. (ENSO is an acronym for El Niño-Southern Oscillation.) The ENSO data is analyzed in an example in the PROC TRANSREG documentation.

The following statements fit a penalized B-spline to the ENSO data:

proc transreg data=sashelp.enso;
   model identity(pressure) = pbspline(year); /* AICC criterion */

The data shows an oscillation of pressure in a yearly cycle. There are 14 peaks and valleys in this 14-year time series, which correspond to 14 winters and 14 summers.

Loess curves

A loess curve is not a spline. A loess model at x uses a local neighborhood of x to compute a weighted least squares estimate. The smoothing parameter is the proportion of data in the local neighborhood: a value near 0 results in a curve that nearly interpolates the data whereas a value near 1 is nearly a straight line. You can fit loess curves to data by using the LOESS procedure or by using the LOESS statement in the SGPLOT procedure. The LOESS procedure enables you to specify many options that control the fit, including the criterion used to select the optimal smoothing parameter. The LOESS procedure supports the AICC and GCV criteria, among others. The default criterion is the AICC.

The default loess fit to the ENSO data is computed as follows:

ods select FitSummary CriterionPlot FitPlot;
proc loess data=sashelp.enso;
   model pressure = year;

The loess curve looks different than the penalized B-spline curve. The loess curve seems to contain irregularly spaced peaks that are 4–7 years apart. This is the El Niño oscillation cycle, which is an irregular cycle. The selected smoothing parameter is about 0.2, which means that about one-fifth of the observations are used for each local neighborhood during the estimation.

Earlier I said that the LOESS and TRANSREG procedures automatically choose parameter values that optimize some criterion. So why the difference in these plots? The answer is that the criterion curves sometimes have local minima, and for the ENSO data the LOESS procedure found a local minimum that corresponds to the El Niño cycle. The curve that shows the AICC criterion as a function of the smoothing parameter is shown below:

The LOESS procedure supports several options that enable you to explore different ranges for the smoothing parameter or to search for a global minimum. I recommend using the PRESEARCH option in order to increase the likelihood of finding a global minimum. For example, for the ENSO data, the following call to PROC LOESS finds the global minimum at 0.057 and produces a scatter plot smoother that looks very similar to the penalized B-spline smoother.

proc loess data=sashelp.enso;
   model pressure = year / select=AICC(presearch);

Smoothers with the SGPLOT procedure

Because the loess and penalized B-spline smoothers work well for most data, the SGPLOT procedure supports the PBSPLINE and LOESS statements for adding smoothers to a scatter plot. By default the statements also add the scatter plot markers, so use the NOMARKERS options if you are overlaying a smoother on an existing scatter plot:

proc sgplot data=sashelp.enso;
pbspline x=year y=pressure;           /* also draws scatter plot */
loess x=year y=pressure / nomarkers;  /* uses PRESEARCH option */

As you can see, the two smoothers produce very similar results on the ENSO data. That is often the case, but not always.

Thin-plate splines

I'm not going to shows examples of every nonparametric regression procedure in SAS, but I do want to mention one more. A thin-plate spline yet another spline that can be used to smooth data. The TPSPLINE procedure supports the GCV criterion for the automatic selection of the smoothing parameter. For the ENSO data, the TPSPLINE procedure produces a smoother the reveals the annual pressure cycle and is very similar to the previous curves:

ods select FitStatistics CriterionPlot FitPlot;
proc tpspline data=sashelp.enso;
   model pressure = (year);

In conclusion, SAS software provides several ways to fit a smoother to data. The procedures mentioned in this article can automatically select a smoothing parameter that satisfies some objective criterion which balances goodness of fit with model parsimony. Do you have a favorite smoother? Let me know in the comments.

Post a Comment

Three ways to add a smoothing spline to a scatter plot in SAS

Like many SAS programmers, I use the Statistical Graphics (SG) procedures to graph my data in SAS. To me, the SGPLOT and SGRENDER procedures are powerful, easy to use, and produce fabulous ODS graphics. I was therefore surprised when a SAS customer told me that he continues to use the old GPLOT procedure because "the procedure makes it easy to add a smoothing spline curve to a scatter plot." The customer likes using smoothing splines, which are produced by using the INTERPOL=SMnn option on the SYMBOL statement. He added that smoothing splines are not available in PROC SGPLOT.

My response: If you like the GPLOT procedure, you are certainly welcome to continue using it. However, if the only reason you use it is to add a spline curve to a scatter plot, you should know that there are alternatives. This article shows how to fit a spline curve by using by using PROC TRANSREG or PROC IML.

Splines, splines, everywhere

A spline is one way to fit a smooth curve to two-dimensional data. There are actually many kinds of splines. The "smoothing spline" that the customer likes is a cubic spline, but SAS supports thin-plate splines and penalized B-splines, just to name two others. Let's see what a spline curve looks like as created by PROC GPLOT and the SYMBOL INTERPOL=SMnn option:

/* data set A must be sorted by x */
data A;
input x y @@;
1  4  2  9  3 20  4 25  5  1  6  5  7 -4  8 12
title "sm45 spline smoother";
proc gplot data=A;
   plot y*x;
   symbol1 interpol=sm45 value=circle height=2; /* note that x is sorted */

The graph displays a scatter plot overlaid with a spline smoother. In the SYMBOL statement, the last two digits of the INTERPOL=SM option specify the amount of smoothness in the spline curve. A value close to zero results in a curve with many wiggles that nearly passes through every data point. A value close to 100 results in a flat curve that is nearly the least-squares regression line for the data. Many examples that I have seen use smoothing values in the range 40–60.

Computing a spline curve by using the TRANSREG procedure

The easiest way to generate an ODS graph that shows the same curve is to use the TRANSREG procedure in SAS/STAT software. The TRANSREG procedure supports a SMOOTH transformation of an explanatory variable. The SM= option enables you to specify a smoothing value in the range [0, 100], which is the same range as for PROC GPLOT.

Because there are only eight observations, a plot of the predicted values versus the explanatory variable will look jagged. To see a smooth curve of predicted values, you can evaluate the spline curve at a series of evenly spaced points within the range of X. The TRANSREG function does not support a SCORE or STORE statement, but you can use the missing response trick to visualize the spline curve. The following DATA step creates 200 evenly spaced points in the range [1, 8]. A second DATA step concatenates the original data and the evenly spaced points:

/* the scoring data; evaluate model on these values */
%let NumPts = 200;
data ScoreX(keep=x);  
min=1; max=8;
do i = 0 to &NumPts-1;
   x = min + i*(max-min)/(&NumPts-1);     /* evenly spaced values */
   output;                                /* no Y variable; only X */
/* concatenate the original data and the scoring data */
data B;  set A ScoreX;  run;
/* use the missing response trick to evaluate the spline on the scoring data */
proc transreg data=B plots(interpolate);
model identity(y) = smooth(x / sm=45);
output out=Pred p;       /* optional: write spline curve to a data set */

The spline curve produced by PROC TRANSREG is identical to the smoothing spline that is produced by PROC GPLOT. The OUTPUT statement in the TRANSREG procedure enables you to write the values from the spline curve to a SAS data set, which means that you can also use PROC SGPLOT to create this graph.

Computing a spline curve in SAS/IML

SAS/IML software computes smoothing splines by using the SPLINE subroutine. The routine enables you to compute interpolating splines, smoothing splines, and parametric splines, and to control the behavior of the curves at the endpoints of the curves. However, the SPLINE routine uses a different parameterization than the GPLOT and TRANSREG procedures.

In the SPLINE subroutine, the parameter that controls the smoothing can be any nonnegative value. The parameter is a geometric transformation of the SM= value in PROC TRANSREG, which means that you can specify the smoothing parameter from GPLOT and convert it into the corresponding parameter for the SPLINE routine. Specifically, the following SAS/IML statements compute a smoothing spline for the data:

proc iml;
  use A;  read all var {x y};  close A;   /* read data */
  /* supply the smoothing parameter value from the SYMBOL statement */
  sm = 45;                          /* value in the INTERPOL=SM option         */
  ssd = ssq(x-mean(x));             /* sum of squared deviations from the mean */
  t = ssd##1.5 / (10##(9-(sm*.1))); /* SPLINE parameter as a function of SM    */
  /* fit the spline and get interpolated points */
  call spline(fitted, x||y) smooth=t;
  create SplineFit from fitted[c={x p}];
  append from fitted;
/* concatenate with original data */
data C;
  set A SplineFit(rename=(x=copyX));
proc sgplot data=C;
   title "Spline from SPLINEC routine in SAS/IML";
   scatter x=x y=y;
   series x=copyX y=p;

The values produced by PROC IML are identical to the spline values that were computed by PROC TRANSREG. PROC SGPLOT produces a graph (not shown) that is nearly identical to the graph that was produced by PROC TRANSREG. In conclusion, there are two easy ways to compute smoothing splines outside of PROC GPLOT.

Are there other features of PROC GPLOT that are unsupported by PROC SGPLOT and that you can't live without? In the comments, tell me your favorite feature of PROC GPLOT that prevents you from switching to PROC SGPLOT.

Other ways to smooth a scatter plot

One last comment on splines. I've shown that you can create a smoothing spline with PROC GPLOT, PROC TRANSREG, or the SPLINE subroutine in PROC IML. However, from a statistical point of view, the smoothing spline is less than ideal because the smoothing parameter must be chosen manually by the user. My next blog post describes how to add smoothing curves whose parameters are chosen automatically to optimize a statistical criterion.

Post a Comment