Color scatter plot markers by values of a continuous variable in SAS

When I visualize three-dimensional data, I prefer to use interactive graphics. For example, I often use the rotating plot in SAS/IML Studio (shown at the left) to create a three-dimensional scatter plot. The interactive plot enables me to rotate the cloud of points and to use a pointer to select and query the values of interesting points.

However, in blog posts, conference proceedings, and slideshow presentations, I often need to display a static visualization of three-dimensional data. Of course I can display a static snapshot of the rotating plot, as I've done here, but there are other options, including using the G3D procedure in SAS/GRAPH software to create a static 3-D scatter plot of the data.

A third option is to draw a two-dimensional scatter plot and color the observations by the value of a third variable. This is a useful technique in many situations, such as visualizing the relationship between two variables while indicating the value of a third variable. The following 2-D scatter plot shows the same data as in the 3-D rotating plot at the top of this article:

The data are from the documentation for the GAM procedure in SAS/STAT software and depict an experiment in which the yield of a chemical reaction is plotted against two control variables. The temperature of the solution and the amount of catalyst added to the solution were both varied systematically and independently on a uniform grid of values. From this scatter plot you can quickly see that the yield tends to be high when the temperature is in the 120–103 range and the amount of catalyst is between 0.04 and 0.07.

Coloring markers by a continuous variable

It is easy to color markers according to the value of a discrete variable: use the GROUP= option on the SCATTER statement in PROC SGPLOT. But how can you create the previous scatter plot by using the SG procedures in SAS?

As of SAS 9.4, the SGPLOT procedure does not enable you to assign colors to markers based on a continuous variable. However, you can use the Graph Template Language (GTL) to create a template that creates the plot. The trick is to use the MARKERCOLORGRADIENT= and COLORMODEL= options on the SCATTERPLOT statement to associate colors with values of a continuous variable. The following template creates a scatter plot with markers that are colored according to a blue-red color ramp:

/* create a GTL template that displays a scatter plot with markers 
   colored according to values of a continuous variable */
proc template;
  define statgraph gradientplot;
  dynamic _X _Y _Z _T;
  mvar LEGENDTITLE "optional title for legend";
    begingraph; 
      entrytitle _T; 
      layout overlay; 	 
        scatterplot x=_X y=_Y / 
          markercolorgradient=_Z colormodel=(BLUE RED)
          markerattrs=(symbol=SquareFilled size=12) name="scatter";
        continuouslegend "scatter" / title=LEGENDTITLE;
      endlayout;	
    endgraph;
  end;
run;
 
%let LegendTitle = "Yield";
proc sgrender data=ExperimentA template=gradientplot;
   dynamic _X='Temperature' _Y='Catalyst' _Z='Yield' _T='Raw Data';
run;

A few comments on the GTL template:

  • The MVAR statement enables you to use macro variables in your graphs. When the SGRENDER procedure is called, the legend title will be set to the value of the LegendTitle macro, if the variable is defined.
  • The three variables in the graph are dynamic variables (_X, _Y, and _Z) that are specified when you call PROC SGRENDER. The title of the graph (_T ) is similarly specified.
  • The MARKERCOLORGRADIENT= option is used to assign marker colors according to values of the _Z variable.
  • The COLORMODEL= option is used to specify a color ramp. I've hard-coded a blue-red color ramp, but other options are possible.
  • The CONTINUOUSLEGEND statement is used to display the color ramp on the graph so that the reader can associate colors to values.

Tip: The plot will suffer from overplotting if there are two or more observations that have the same (x, y) coordinates but different z coordinates. You can still use this technique, but you might want to sort the data by the response variable. This will create a plot where the high values of the response variable are apparent because they are plotted on top of the lower values. For example, if the purpose of your plot is to demonstrate that light cars with small engines are more fuel efficient than larger vehicles, sort the Sashelp.Cars data set by the MPG_City variable before you create the scatter plot, as follows:

proc sort data=Sashelp.Cars out=Cars;   by MPG_City;  run;
 
%let LegendTitle = "MPG City";
proc sgrender data=Cars template=gradientplot;
   dynamic _X='Horsepower' _Y='Weight' _Z='MPG_City' _T='Fuel Efficiency';
run;

You can download the data and the program that creates these scatter plots. This GTL template is easily modified to support other color ramps, transparent markers, and othe options. Enjoy!

Post a Comment

Video: What's new in SAS/IML 13.1

SAS/IML 13.1 shipped a few months ago. I was asked to produce a video that highlights some of the new features in SAS/IML 13.1. In this video I describe several changes to the language before introducing the new built-in subroutines that create ODS statistical graphs.



If your browser does not support embedded video, you can go directly to the video on YouTube. For more details, see "What's New in SAS/IML 13.1" in the SAS/IML User's Guide.

By the way, did you know that there is a SAS Statistics and Operations channel on YouTube that contains dozens of videos? Many of the videos are of presentations given by SAS staff at conferences such as SAS Global Forum. Enjoy!

Post a Comment

Define functions with optional parameters in SAS/IML

Last month I blogged about defining SAS/IML functions that have default parameter values. This language feature, which was introduced in SAS/IML 12.1, enables you to skip arguments when you call a user-defined function.

The same technique enables you to define optional parameters. Inside the function, you can determine whether the optional parameter was specified by using the ISSKIPPED function.

Print the first few rows of a matrix

For example, suppose you want to implement a module that prints the first few rows of a matrix. By default, you want to print five rows. However, you also want to support printing a specified number of rows. You can implement a module that does this task by using the new 12.1 syntax for specifying default parameter values. The UNIX operating system has a command that called head that prints the first few rows of a file, so I'll use the same name for this module. A first attempt might be as follows:

proc iml;
start head1(x, n=min(5,nrow(x)));  /* 1st attempt: one arg with default value */
   print (x[1:n,]);
finish;
 
x = shape(1:36, 9);    /* test matrix is 9 x 4 */
run head1(x);          /* print first 5 rows (default) */
run head1(x, 3);       /* print first 3 rows */

The HEAD1 module works as advertise: it prints five rows by default, but also enables you to specify the number of rows. Notice the way that I specified the default value for the second argument (n). In the START statement, I put an equal sign (=) after the name of the argument and then I specified an expression that determines the default value. For this example, the expression contains a call to the MIN function in order to support matrices that have fewer than five rows. In other words, the default value for n depends on the size of x, the first argument. The value for n is determined as follows:

  • If the module is called with two arguments, then n is assigned the value of the second specified argument. The expression that computes the default value is ignored.
  • If the module is called with one argument, then n is assigned a default value, which is determined by evaluating the expression min(5, nrow(x)). When x has five or more rows, then n = 5. When x has fewer than five rows, n is assigned to be the number of rows in x.

Specifying a complicated default value

In this example, the default value of the n argument was moderately complicated, but I could still specify it by using a single expression. If the default value requires executing several statements, then you can't specify the value on the START statement. Instead, use an equal sign to specify that the argument is optional, and use the ISSKIPPED function to check whether the argument was specified or whether it should receive a default value, as follows:

start head2(x, n=);  /* 2nd attempt: one optional arg */
   if isskipped(n) then do;
      /* use as many statements as necessary to assign a default value */
      n = min(5,nrow(x));
   end;
   print (x[1:n,]);
finish;

Optional arguments without default values

Suppose that you want to modify the HEAD1 module so that it supports printing a matrix by using a specified SAS format. By default, you want to print the first few matrix rows by using the default SAS/IML format. However, you want to optionally specify a format.

In this scenario, there is no appropriate default value for the format. To specify that an argument is optional (but does not receive a default value), place an equal sign (=) after the argument in the START statement. You can use the ISSKIPPED function to detect whether the optional argument was specified, as follows:

/* n has a default value; format is optional */
start head(x, n=min(5,nrow(x)), format=);
   labl = "head (rows=" + strip(char(n)) + ")";  /* construct label */
   if isskipped(format) then
      print (x[1:n,])[label=labl];  /* print with default format */
   else do;
      y = putn(x[1:n,], format);    /* apply specified format */
      print y[label=labl];          /* print formatted matrix */
   end;
finish;
 
run head(x) format="4.2";           /* specify a format */

Inside the HEAD module, the ISSKIPPED function determines at run time whether the third argument was specified. If not, then the PRINT statement prints the first few rows of the matrix. (The default SAS/IML format is used.) If the third argument is specified, then the PUTN statement applies the specified format to the first few rows and prints the result.

Optional arguments can be a powerful tool. Default arguments make it easier to call a function because the user needs to specify only a small number of the possible options to the function. As this example shows, default values can depend on run time values of other arguments. Default arguments are one of my favorite features in SAS/IML 12.1. I hope that you find them as useful as I do!

Post a Comment

Where's Rick at SAS Global Forum 2014?

Once again I'll be at SAS Global Forum this year. The 2014 location is Washington, D. C., so I am looking forward to greeting many friends in the government and consulting sectors. I always enjoy talking with SAS customers about statistics, simulations, matrix computations, and the SAS/IML product, so here's my schedule at SAS Global Forum. Please say hello!

Sunday, March 23, 2014

I have no scheduled presentations. I will be at the Opening Session from 7:00 – 8:30 p.m., followed by the Get-Acquainted Reception from 8:30 – 10:30 p.m.

Monday, March 24

I am at the SAS/IML booth in the SAS Support and Demo Area for most of the day, so this is a great time to talk about what you are doing with SAS software or features that you would like to see added to a future release. My hours in the Demo Area are:

  • 10:00 a.m. – 11:30 a.m.: At the SAS/IML booth
  • 11:30 a.m. – 11:50 a.m.: Super Demo: "Ten Tips for Simulating Data with SAS"
  • 2:00 p.m. – 4:00 p.m.: At the SAS/IML booth, except during the next Super Demo
  • 2:30 p.m. – 2:50 p.m.: Super Demo: "Using Heat Maps to Visualize Matrices"
  • 6:00 p.m. – 7:30 p.m.: Grab a drink and some hors d'oeuvres and chat with me at the SAS/IML booth during the evening mixer!

Tuesday, March 25

I will be attending talks during the morning, but send me an email if you want to schedule a time to meet. Or grab me at one of the coffee breaks!

My only scheduled activity on Tuesday is that I am appearing on the SAS Tech Talks show. Chris Hemedinger will interview me about calling R from SAS during the 2:00 p.m.. show. This interview can be watched live at www.sasglobalforum.com, or you can watch the show after the broadcast.

In the evening, please join me and other SAS professionals for a meetup session 6:00 p.m. – 8:00 p.m. in Chesapeake Room A/B/C. This meetup will include leading members of the SAS online community, including representatives from SAS-L, sasCommunity.org, the SAS Support Communities (a.k.a., "Discussion Forums"), and other online networking sites for SAS professionals. Besides providing an opportunity to network, the meetup will include the presentation of the annual SAS-L awards.

Wednesday, March 26

I will be giving a Hands-On Workshop on "Getting Started with the SAS/IML Language" from 11:00 a.m. – 12:40 p.m. There will be a room full of computers and you will have an opportunity to learn some SAS/IML programming techniques. If you have always wanted to try SAS/IML, but it is not licensed at your site, this is your chance! It is free, so please join me before you head home.

So whether you want to discuss statistical programming, comment on a blog topic, or tell me about a cool project that you are working on, please introduce yourself at the conference. I look forward to meeting old friends... and making some new ones.

Post a Comment

Finding elements in one vector that are not in another vector

The SAS/IML language has several functions for finding the unions, intersections, and differences between sets. In fact, two of my favorite utility functions are the UNIQUE function, which returns the unique elements in a matrix, and the SETDIF function, which returns the elements that are in one vector and not in another. Both functions return the elements in an ordered vector.

However, a recent question on the SAS/IML Support Community asked how to find the elements in one vector that are not in another while preserving the original order of the elements. As an example, consider the following vectors:

proc iml;
A = {6 3 4 1 2 5 1 2 4};
B = {2 3};

The problem is how to find the elements in A that are not in B. If possible, we'd like to avoid using DO loops to iterate over the elements.

If you use the SETDIF function, the duplicate values in A are removed and the result is sorted:

C = setdif(A,B);
print C;

The output of the SETDIF function shows that the elements 1, 4, 5, and 6 are contained in A, but not in B. Although A contains two 1s and two 4s, that information is not included in the result.

If you need to preserve the original order of the data or you need to retain duplicate values, then you should use the ELEMENT function, which was introduced in SAS/IML 9.3. The ELEMENT function returns a matrix that is the same dimensions as A. The returned matrix is an "indicator matrix": a 1 in position (i, j) means that A[i, j] is an element of B, as shown by the following statements:

E = element(A,B);
print (A // E)[rowname={"A","in B"}];

The output shows that the second, fifth, and eighth elements of A are elements of B. If you want to preserve the original order of elements (and preserve duplicates), you can use the LOC function and the REMOVE function to eliminate those elements, as follows:

D = remove(A,loc(element(A,B)));
print D;
/* you can also use subscripting:  D = A[ loc(^element(A,B)) ]; */

The result lists the elements of A that are not in B, but preserves the order of the elements and any duplicate elements in the list.

Post a Comment

For pi day: A continued fraction expansion of pi

Many geeky mathematical people celebrate "pi day" on March 14, because the date is written 3/14 in the US, which is evocative of the decimal representation of π = 3.14....

Most people are familiar with the decimal representation of π. The media occasionally reports on a new computational tour-de-force that approximates π to unimaginable precision. The decimal value of π has been computed to more than 10 trillion digits!

Using the decimal approximation of π is ubiquitous in mathematics, but in elementary school I learned another approximation: π ≈ 22/7. As I got older, I discovered that there were other rational approximations of π, such as 355/113.

Continued fractions

To celebrate π day, I am going to use SAS to estimate π by using its simple continued fraction representation. Every real number x can be represented as a continued fraction:

In this continued fraction, the ai are positive integers. Because all of the fractions have 1 in the numerator, the continued fraction can be compactly represented by specifying only the integers: x = [a0; a1, a2, a3, ...]. Every rational number is represented by a finite sequence; every irrational number requires an infinite sequence. Notice that the decimal representation of a number does NOT have this nice property: 1/3 is rational but its decimal representation is infinite.

The continued fraction expansion of pi

The continued fraction expansion of π is known to more than 15 billion values. The first fifteen digits are
π ≈ [3; 7, 15, 1, 292, 1, 1, 1, 2, 1, 3, 1, 14, 2, 1]
What does this mean? Well, the sequence [3; 7] corresponds to the fraction 3 + 1/7 = 22/7. Similarly, the sequence [3; 7, 15, 1] corresponds to the fraction 3 + 1/(7 + 1/(15 + 1/1)) = 355/113.

Trying to compute these fractions by hand is tedious. Why not write a SAS program to evaluate the continued fraction sequence? As the examples show, the computation starts at the end of the sequence. The computation begins by taking the reciprocal of the last number in the sequence. That values is added to the second-to-last number, and the sum is inverted. This process continues until you reach the first value. You can do this computation in the SAS DATA step (post your code in the comments!), but I have written a SAS/IML function to evaluate an arbitrary continued fraction representation. Notice that in SAS you can call the CONSTANT function to obtain a numerical approximation of π and other mathematical constants, which is useful for checking the results:

proc iml;
start ContinuedFraction(f);
   s = f[nrow(f)];              /* last value */
   do i = nrow(f)-1 to 1 by -1; /* traverse sequence in reverse order */
      s = f[i] + 1/s;           
   end;
   return( s );
finish;
 
/* test the function by using the continued fraction expansion for pi */
f = {3, 7, 15, 1, 292, 1, 1, 1, 2, 1, 3, 1, 14, 2, 1};
s = ContinuedFraction(f);
pi = constant("pi");    /* call the CONSTANT function to approximate pi */
print s[format=16.14], pi[format=16.14];

As you can see, the first fifteen terms of the continued fraction expansion of π are sufficient to reproduce the first 15 digits of the decimal representation.

Other continued fractions

Here are a few continued fraction expansions for some other irrational numbers:

  • The continued fraction expansion for e ≈ 2.718281828459 shows a fascinating regular pattern: two 1s followed by an even number.
    e = [2; 1, 2, 1, 1, 4, 1, 1, 6, 1, 1, 8, 1, 1, 10,....]
  • The continued fraction expansion for the square root of 2 ≈ 1.414213562373095 contains a repeating sequence of 2s:
    sqrt2 = [1; 2, 2, 2, 2, ...]
  • The continued fraction expansion for the golden ratio φ ≈ 1.61803398874989 is a repeating sequence of 1s:
    phi = [1; 1, 1, 1, 1, 1, ...]
You can use these sequences to estimate the corresponding decimal representation:
e     = ContinuedFraction({2, 1, 2, 1, 1, 4, 1, 1, 6, 1, 1, 8, 1, 1, 10});
sqrt2 = ContinuedFraction({1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2});
phi   = ContinuedFraction({1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1});
print e[format=16.14], sqrt2[format=16.14], phi[format=16.14];

The results contain less than 15-digit decimal precision. In order to obtain more precision, you would need to use more terms in the continued fraction representation.

By the way, next year's Pi Day should be special: celebrate on 3/14/15 at 9:26:53 for maximal enjoyment!

Post a Comment

Optimizing a function of an integral

Last week I showed how to find parameters that maximize the integral of a certain probability density function (PDF). Because the function was a PDF, I could evaluate the integral by calling the CDF function in SAS. (Recall that the cumulative distribution function (CDF) is the integral of a PDF.)

You can also use SAS to optimize an integral function when the integrand is not a PDF. This follow-up article shows how to optimize the integral of an arbitrary function by calling the QUAD subroutine in SAS/IML software to evaluate the integral.

Defining an integral

This example finds parameters (a,b) such that the integral of the f(x) = 3/4 – xx2 is the maximum on the interval [a, b]. That is, the problem is to maximize

subject to the constraint that a < b.

The graph of the integrand is shown at the left. The parameters determine the domain of integration. The graph shows the integral (area under the curve) for the values (a,b)=(–2,1). From the graph, you can see that the optimal parameters are (a,b) =(–1.5, 0.5), which integrates the quadratic function over the interval on which it is positive. (Because the function is negative outside of (–1.5, 0.5), we don't want to accumulate "negative area.")

To solve this problem in the SAS/IML language, you need to define the integrand and call the QUAD function to evaluate the integral, as follows:

proc iml;
/* Define the integrand, which will be called by QUAD subroutine.
   For this problem, the integrand does not depend on the parameters,
   but for other problems, it might. */
start Func(x);         /* use GLOBAL stmt if Func depends on parameters */
   return( 3/4 - x - x##2 );
finish;
 
/* Define a function that computes the integral. This will be called by an
   NLP routine, so the function must depend on the parameters. */
start Integral(parms); /* use GLOBAL stmt if Func depends on parameters */
   /* the parameters are:  a = parms[1]; b = parms[2]; */
   call quad(val, "Func", parms);    /* evaluate integral on [a, b] */
   return( val );
finish;

The module named Integral defines the objective function, which will be maximized by calling one of the NLP functions in SAS/IML software. However, you should NEVER start optimizing until you have verified that the objective function is defined correctly! The following statements test the objective function by evaluating the integral on three different domains. Because this quadratic function has a simple antiderivative, you can verify analytically that the Integral function is performing the integration correctly:

/* Check and debug objective function BEFORE using it in NLP! */
val1 = Integral({-2 1});
val2 = Integral({-1.5 0.5});
val3 = Integral({-0.5 1});
print val1 val2 val3;

Defining the gradient vector of derivatives

One of my favorite nonlinear optimization methods is the Newton-Raphson algorithm, which is implemented in SAS/IML as the NLPNRA subroutine. This algorithm uses derivatives of the objective function to help find the maximal value of the function. The algorithm will compute finite-difference derivatives if you do not supply an analytic derivative function, but in this example it is trivial to define the gradient vector, which is the vector of derivatives of the objective function with respect to the parameters. Because the parameters are the upper and lower limits of integration, the fundamental theorem of calculus tells us that the derivatives for this problem are found by evaluating the integrand at the parameter values:

/* Optional: Define derivative w.r.t. parameters. By the FTC, the derivative 
   is found by evaluating the integrand at the parameters. */
start Deriv(parms);
   dfda = -Func( parms[1] );  /* deriv at lower limit of integration */
   dfdb =  Func( parms[2] );  /* deriv at upper limit of integration */
   return( dfda || dfdb );
finish;

Maximizing the objective function

This maximization problem is subject to the constraint that a < b. You can specify boundary and linear constraints by using the BLC= option to the NLP subroutines. You can use the following matrix to specify the constraint to an NLP function:

/* Set up linear constraint a < b, which is also a - b < 0 */
/*     a  b OP val */
con = {.  .  .   .,  /* no constraint on lower bound of a or b */
       .  .  .   .,  /* no constraint on upper bound of a or b */
       1 -1 -1  0};  /* 1*a - 1*b <= 0      */

Finally, you can specify an initial guess and call the NLPNRA subroutine to find the parameter values that maximize the integral:

p = {-1 1};        /* initial guess for solution */
opt = {1,          /* find maximum of function   */
       2};         /* print a little bit of output */
call nlpnra(rc, result, "Integral", p, opt) blc=con jac="Deriv";
print result[c={"a" "b"}];

The result agrees with our intuition. The Newton-Raphson algorithm determined that the interval(a,b) = (–1.5, 0.5) maximizes the integral of the quadratic function. The maximal value of the integral is 4/3.

In summary, this article shows how to do the following:

  • Optimize a function that depends on one or more parameters. The "twist" is that the objective function requires evaluating an integral.
  • Use the QUAD subroutine to numerically compute the integral an arbitrary function.
  • Specify linear constraints in an optimization problem.

Post a Comment

How to get started with SAS: Free videos for beginners

On most Mondays I blog about a function, programming technique, or resource that is useful for programmers who are getting started with SAS software. Recently I learned that my colleagues in the SAS education division have been hard at work developing a series of short videos that explain basic tasks to beginners. They have developed more than 20 videos that teach the basics of SAS programming and statistics, in addition to dozens of other videos that describe frequently encountered tasks in SAS Enterprise Guide, Enterprise Miner, and other SAS products.

Topics for beginners include:

  • Creating simple statistical graphs
  • Analyzing frequencies of a categorical variable
  • Analyzing the distribution of a continuous variable
  • Analyzing correlations between variables
  • Fitting simple regression models

Do you know a friend or colleague that is new to using SAS and who learns best by watching videos? Direct them to these free SAS tutorials that teach the basics of SAS programming and analytics.

Post a Comment

Optimizing a function that evaluates an integral

SAS programmers use the SAS/IML language for many different tasks. One important task is computing an integral. Another is optimizing functions, such as maximizing a likelihood function to find parameters that best fit a set of data.

Last week I saw an interesting problem that combines these two important tasks. The problem involved optimizing a function, which in turn required evaluating an integral. There are two kinds of integrals to consider:

  1. The cumulative distribution function (CDF) is the integral of a probability density. In many applications you can numerically integrate a probability density by calling the CDF function in SAS.
  2. If the integral is not a CDF that is supported by SAS, you can use the QUAD subroutine to integrate an arbitrary user-defined function.
This post demonstrates how to optimize a function that evaluates a CDF to compute a probability. A future blog post will present an example of optimizing a function that calls the QUAD subroutine to evaluate an integral.

Computing probability by integrating a density function

You can compute a probability by integrating a probability density function (PDF). However, for most common probability distributions, you don't need to compute the integral yourself because SAS provides the CDF function. The CDF is the anti-derivative of the PDF. For example, to integrate the standard normal PDF on the interval [-1,1], you can call the CDF function as follows:

prob = cdf("Normal", 1) - cdf("Normal", -1);  /* integral of phi(x) on [-1, 1] */

The answer is 0.68, which is the probability that a random normal variable falls within one standard deviation of the mean.

Distributions often contain parameters, and a common task in statistical programming is finding parameter values that satisfy some condition (a root-finding problem) or that optimize some criterion. This article describes how to optimize an integral that depends on two parameters.

The Beta(a,b) distribution

As an example, consider the Beta(a,b) distribution, which is a function of two shape parameters. Suppose that you want to find the parameter values a > 0 and 1 ≤ b ≤ 3 that maximize the integral of the Beta(a,b) PDF over the interval [0.6, 0.8].

Several Beta(a,b) curves are shown in the figure to the left. (Click to enlarge.) The Beta(4,2) curve (teal color) and the Beta(6,2) curve (magenta color) both have large areas on the interval [0.6, 0.8], so those parameter values would be good starting points for a search that attempts to find an optimal parameter value.

Optimizing the parameters of the Beta(a,b) distribution

To maximize the Beta(a,b) density, I will call the SAS/IML NLPNRA subroutine, which performs Newton-Raphson optimization. The NLP subroutine requires a user-defined module that can evaluate the objective function. The following SAS/IML module evaluates the integral of the Beta(a,b) function on [0.6, 0.8]. The parameters (a,b) are the elements of the parameter vector.

proc iml;
/* integral of the Beta(a,b) density on [0.6, 0.8] */
start BetaProb(parm);
   a = parm[,1]; b = parm[,2];
   prob = cdf("beta", 0.8, a, b) - cdf("beta", 0.6, a, b);
   return( prob );
finish;
 
val = BetaProb( {4 2} );
print val;

The call to the BetaProb function for the parameters (a,b)=(4,2) results in an integral of 0.4, which looks correct from the figure. Because this is a constrained optimization problem, the next step is to specify the linear constraints for the parameters. You can then provide an initial guess and call the NLPNRA subroutine, as follows:

/* Set up the constraints a > 0 and 1 <= b <= 3 */
/*     a  b  */
con = {0  1,  /* lower bounds */
       .  3}; /* upper bounds */
p  = {4 2};   /* initial guess for solution */
opt = {1,     /* find maximum of function   */
       2};    /* print iteration history and optimal parameters */
call nlpnra(rc, OptParm, "BetaProb", p, opt) blc=con; /* OptParm is solution */

The NLPNRA subroutine computed that the parameters that maximize the integral of the Beta(a,b) density function over [0.6, 0.8] (subject to the constraints that 1 ≤ b ≤ 3) are (a,b)=(7.717, 3). The optimal Beta density is shown below.

For this problem, the fourth column of the "Optimization Results" table shows that the gradient is nonzero at the solution. This indicates that the solution is not a global maximum, but is only a maximum within the constrained parameter domain. The following contour plot shows contours of the integral as a function of the parameters. The optimal parameter value is displayed along the top edge of the graph.

In summary, this article demonstrates three things:

  • You can use SAS/IML software to optimize a function that is defined by an integral.
  • You can integrate a probability density by using the CDF function. There is no need to use the QUAD function.
  • You can constrain the parameter values. As this example demonstrates, sometimes the optimal parameter value occurs on the boundary of the parameter domain.

Next week I will show an example of optimizing a function that requires calling the QUAD function to compute an integral.

Post a Comment

Aborting a SAS/IML program upon encountering an error

A colleague sent me an interesting question: What is the best way to abort a SAS/IML program? For example, you might want to abort a program if the data is singular or does not contain a sufficient number of observations or variables.

As a first attempt would be to try to use the ABORT statement, which is discussed in the article "how to stop processing your SAS/IML program if a certain condition is met". The following program contains a subtle error in logic:

proc iml;
ErrorFlag = 1;
if ErrorFlag then ABORT;       /* does not work as intended */
print "Execute statement 1";  
print "Execute statement 2";
quit;

This approach does not work because the ABORT statement terminate the IML procedure, but then SAS continues to process the next statement, which happens to be a PRINT statement. Since a PRINT statement is not a valid global SAS statement, an error occurs.

ERROR 180-322: Statement is not valid or it is used out of proper order.

Use DO blocks or reverse the logic

The obvious solution is to use an ELSE statement to specify the conditional code to execute within PROC IML. At parse time, the PROC IML parser will process all DO blocks of an IF-THEN/ELSE statement. However, at run time only the DO block for the "true" condition will be executed. Consequently, the following statements parse and run without error:

if ErrorFlag then ABORT;
else do;
   print "Execute statement 1";  
   print "Execute statement 2";
end;

An alternative solution is to forget about using the ABORT statement and reverse the conditional logic:

ErrorFlag = 1;
if ^ErrorFlag then do;
   print "Execute statement 1";  
   print "Execute statement 2";
end;
quit;

In this alternative solution, the PRINT statements are executed if the ErrorFlag variable is zero; otherwise they are not. In either case, the QUIT statement is the first executable statement after the IF-THEN/ELSE statement, so the procedure ends.

Use modules

In a complex analysis, defining a module is a good way to encapsulate a task, be it validating data or computing with the validated data. For example, suppose your algorithm requires data that has at least three variables and no missing values. The following program divides the error checking task and the computational task between two modules:

start CheckData(x);
   if ncol(x) < 3 then return(1);       /* error */
   if countmiss(x) > 0 then return(2);  /* error */
   return(0);                          /* no error */
finish;
 
start MyComputation(x);                /* only called on validated data */
   print "Execute statement 1";  
   print "Execute statement 2";
finish;
 
/* read x; validate data properties */
use MyData;  read all var _num_ into x;  close MyData;
 
ErrorFlag = CheckData(x);
if ^ErrorFlag then 
  run MyComputation(x);
quit;

With regards to defining modules, I'll also mention that you can use the RETURN statement anywhere in a module. This means that if you encounter a situation in the MyComputation module that requires you to abort the computation, you can use the RETURN statement to leave the module. Consequently, modules offer a robust mechanism for aborting a computation if an error condition occurs.

Conclusions

There are other ways to organize your program so that statements can conditionally execute if there is an error. However, I will stop here. Do you have a favorite way to handle errors in PROC IML? Leave a comment.

Post a Comment