Creating a basic heat map in SAS

heatmapcubic

Heat maps have many uses. In a previous article, I showed how to use heat maps with a discrete color ramp to visualize matrices that have a small number of unique values, such as certain covariance matrices and sparse matrices. You can also use heat maps with a continuous color ramp to visualize correlation matrices or data matrices.

This article shows how to use a heat map with a continuous color ramp to visualize a function of two variables. A heat map of a simple cubic function is shown to the left. In statistical applications the function is often a quantity—such as the mean or variance—that varies as a function of two parameters. Heat maps are like contour plots without the contours; you can use a heat map to display smooth or nonsmooth quantities.

You can create a heat map by using the HEATMAPPARM statement in the graph template language (GTL). The following template defines a simple heat map:

proc template;              /* basic heat map with continuous color ramp */
define statgraph heatmap;
dynamic _X _Y _Z _T;        /* dynamic variables */
 begingraph;
 entrytitle _T;             /* specify title at run time (optional) */
  layout overlay;
    heatmapparm x=_X y=_Y colorresponse=_Z /  /* specify variables at run time */
       name="heatmap" primary=true
       xbinaxis=false ybinaxis=false;  /* tick marks based on range of X and Y */
    continuouslegend "heatmap";
  endlayout;
endgraph;
end;
run;

The template is rendered into a graphic when you call the SGRENDER procedure and provide a source of values (the data). The data should contain three variables: two coordinate variables that define a two-dimensional grid and one response variable that provides a value at each grid point. I have used dynamic variables in the template to specify the names of variables and the title of the graph. Dynamic variables are specified at run time by using PROC SGRENDER. If you use dynamic variables, you can use the template on many data sets.

The following DATA step creates X and Y variables on a uniform two-dimensional grid. The value of the cubic function at each grid point is stored in the Z variable. These data are visualized by calling the SGRENDER procedure and specifying the variable names by using the DYNAMIC statement. The image is shown at the top of this article.

/* sample data: a cubic function of two variables */
%let Step = 0.04;
data A;
do x = -1 to 1 by &Step;
   do y = -1 to 1 by &Step;
      z = x**3 - y**2 - x + 0.5;
      output;
   end;
end;
run;
 
proc sgrender data=A template=Heatmap; 
   dynamic _X='X' _Y='Y' _Z='Z' _T="Basic Heat Map";
run;

The GTL template uses default values for most options. The default color ramp is a continuous three-color ramp. For the ODS style I am using, the default ramp uses blue to display low values, white to display middle values, and red to display high values. The XBINAXIS=false and YBINAXIS=false options override the default placement of tick marks on the axes. By default, the tick marks are be placed at data values of the X and Y variable, which are multiple of 0.04 for this data. The following image shows part of the heat map that results if I omit the XBINAXIS= and YBINAXIS= options. I prefer my tick labels to depend on the data range, rather than the particular value of &Step that I used to generate the data.

heatmapcubic2

The HEATMAPPARM statement has many options that you can use to control features of the heat map. Because the heat map is an important tool in visualizing matrices, SAS/IML 13.1 supports functions for creating heat maps directly from matrices. I will discuss one of these functions in my next blog post.

Post a Comment

Guiding numerical integration: The PEAK= option in the SAS/IML QUAD subroutine

One of the things I enjoy about blogging is that I often learn something new. Last week I wrote about how to optimize a function that is defined in terms of an integral. While developing the program in the article, I made some mistakes that generated SAS/IML error messages. By understanding my errors, I learned something new about the default options for the QUAD subroutine in SAS/IML software.

The QUAD subroutine supports the PEAK= option, but until last week I had never used the option. (Obviously the default value works well most of the time, since I have been using the QUAD routine successfully for many years!) I discovered that the PEAK= option can be important to ensuring that a numerical integration problem converges to an accurate answer.

The default value of the PEAK= option is 1. But what does that mean? Here's what I gleaned from the documentation:

  • The PEAK= option is used to scale the integrand and to determine whether the adaptive integration algorithm needs to perform additional iterations.
  • The option specifies a value of x where the magnitude of the integrand f(x) is relatively large. For integrations on infinite and semi-infinite intervals, the value of the PEAK= option specifies where the function has a lot of "mass." For example, the standard normal probability density function is centered at x=0, so you can specify PEAK=0 to integrate that function on (–∞, ∞). This example illustrates why the option is named "PEAK."
  • For best results, choose the PEAK= option to be in the interior of the interval of integration. Don't choose a value that is too close to the boundary. For example, on a finite domain [a, b], the midpoint of the interval is often a good choice: PEAK=(a + b)/2.
  • The PEAK= value should not be a location where the integrand is zero. That is, if PEAK=v, then make sure that f(v) ≠ 0.
  • In spite of its name, the value of the PEAK= option does not need to be the location of the maximum value of the integrand. For integration on a finite domain you can specify almost any value in the interval, subject to the previous rules.

The remainder of this article uses the quadratic function f(x) = 1 – x2 to illustrate how you can to control the QUAD subroutine by using the PEAK= option.

Avoid zeros of the integrand

The following SAS/IML statements define the function to integrate. The integrand has a zero at the value x = 1, which is the default value for the PEAK= option. If you attempt to integrate this function on the interval [0, 2], you see the following scary-looking error message:

proc iml;
start Func(x) global(a);
   return( 1 - x##2 );         /* note that Func(1) = 0 */
finish;
 
call quad(w, "Func", {0 2});  /* default: PEAK=1 */
Convergence could not be attained over the subinterval (0 , 2 )
 
The function might have one of the following:
     1) A slowly convergent or a divergent integral.
     2) Strong oscillations.
     3) A small scale in the integrand: in this case you can change the 
        independent variable in the integrand, or
        provide the optional vector describing roughly the mean and 
        the standard deviation of the integrand.
     4) Points of discontinuities in the interior:
        in this case, you can provide a vector containing
        the points of discontinuity and try again.

The message says that the integration did not succeed, and it identifies possible problems and offers some suggestions. For this example, the integral is not divergent, oscillatory, or discontinuous, so we should look at the third item in the list. The message seems to assume that the integrand is a statistical probability distribution, which is not true for our example. For general functions, the advice could be rewritten as "... provide values for the PEAK= and SCALE= options. The SCALE= value should provide a horizontal estimate of scale; the integrand evaluated at the PEAK= value should provide a vertical estimate of scale."

For the example, –3 ≤ f(x) ≤ 1 when x is in the interval [0, 2], so the integrand has unit scale in the vertical direction. You can choose almost any value of x for the PEAK= option, but you should avoid values where the integrand is very small. The following statement uses PEAK=0.5, but you could also use PEAK= 1.5 or PEAK=0.1234.

call quad(w, "Func", {0 2}) peak=0.5;  
print w;
t_quadpeak

In a similar way, you should make sure that the integrand is defined at the PEAK= value. Avoid singular points and points of discontinuity.

Avoid the boundaries of the interval

Here's another helpful tip: choose the PEAK= option so that it is sufficiently inside the interval of integration. Avoid choosing a value that is too close to the lower or upper limit of integration. The following statement chooses a value for the PEAK= option that is just barely inside the interval [0, 2]. A warning message results:

val = 2 - 1e-7;     /* value is barely inside interval [0,2] */
call quad(w, "Func", {0 2}) peak=val;
Due to the roundoff encountered in computing FUNC
      near the point     2.00000E+00
      convergence can be attained, but only
      with an estimated error     0.00000E+00

The warning message says that a problem occurs near the point x=2. Although the routine will continue and "convergence can be attained," the routine is letting you know that it cannot perform the integration as well as it would like to. If you move the PEAK= value farther away from the limits of integration, the warning goes away and the subroutine can integrate the function more efficiently.

So there you have it. Although the default value of the PEAK= option works well for most integrals, sometimes the programmer needs to provide the QUAD routine with additional knowledge of the integrand. When you provide the PEAK= option, be sure to avoid zeros and discontinuities of the integrand, and stay away from the boundaries of the integration interval.

Post a Comment

Ten tips for learning the SAS/IML language

A SAS customer wrote, "Now that I have access to PROC IML through the free SAS University Edition, what is the best way for me to learn to program in the SAS/IML language? How do I get started with PROC IML?"

That is an excellent question, and I'm happy to offer some suggestions. The following ideas are ordered according to your level of experience with SAS/IML programming. The first few resources will help beginners get started with PROC IML. The last few suggestions will help intermediate-level programmers develop and improve their SAS/IML programming skills.

  1. Get the SAS/IML Software. If your workplace does not license the SAS/IML product, you can download the free SAS University Edition onto your laptop for learning SAS/IML. Even if SAS/IML software is available at work, you might want to download the University Edition if you plan to learn and practice at night or on weekends.
  2. Work through the "Getting Started" chapter of the book Statistical Programming with SAS/IML Software. The chapter is available as a free excerpt from the book's Web page. Notice that I say "work through," not "read." Run the programs as you read. Change the numbers in the examples.
  3. Work through the first six chapters of the SAS/IML User's Guide. A few years ago I revised this documentation to make it more readable, especially the sections about reading and writing data.
  4. Download the SAS/IML tip sheets. By keeping a tip sheet on your desk, you can easily remind yourself of the syntax for common SAS/IML statements and functions.
  5. Subscribe to The DO Loop blog. Most Mondays I blog about elementary topics that do not require advanced programming skills. I also discuss DATA step programs, statistical graphics, and SAS/STAT procedures. I've written more than 100 blog posts that are tagged as "Getting Started."
  6. Program, program, program. The way to learn any programming language is to start writing programs in that language. When I was a university professor, I used to tell my students "Math is not a spectator sport." Programming is similar: In order to get better at programming, you need to practice programming. Many of the previous tips provided you with pre-written programs that you can modify and extend. The paper "Rediscovering SAS/IML Software: Modern Data Analysis for the Practicing Statistician" includes intermediate-level examples that demonstrate the power of the SAS/IML language.
  7. Use the SAS/IML Support Community. When you start writing programs, you will inevitably have questions. The SAS/IML Support Community is a discussion forum where you can post code and ask for help. As you gain experience, try answering questions posted by others!
  8. Think about efficiency. A difference between a novice programmer and an experienced programmer is that the experienced programmer can write efficient programs. In a matrix-vector language such as SAS/IML, that means vectorizing programs: using matrix operations instead of loops over variables and observations. Many programming tips and techniques in the first four chapters of Statistical Programming with SAS/IML Software deal with efficiency issues. As you gain experience, study the efficiency examples and vectorization examples in my blog.
  9. Use the SAS/IML Studio programming interface. I am more productive when I use SAS/IML Studio than when I use PROC IML in the SAS Windowing environment (display manager). I like the color-coded program editor and the ability to develop and run multiple SAS/IML programs simultaneously. I like the debugging features and the dynamically linked graphics are often useful for understanding relationships in data.
  10. Use the SAS/IML File Exchange. The SAS/IML File Exchange is a Web site where you can search for useful programs to use, study, or modify. The exchange is a little like those "Leave a penny; take a penny" bowls at cash registers. If you have written a cool program, contribute it so that others can use it. If you need a function that performs a particular analysis, download it from the site. The site launched in mid-2014, so we need contributions from many SAS/IML programmers before the site will become useful. Do you have a program that you can contribute?

Becoming a better SAS/IML programmer does not happen overnight. Merely reading books and blogs will not make you better. However, the ten tips in this article point out resources that you can use to improve your skills. So roll up your sleeves and start programming!

Post a Comment

Define an objective function that evaluates an integral in SAS

The SAS/IML language is used for many kinds of computations, but three important numerical tasks are integration, optimization, and root finding. Recently a SAS customer asked for help with a problem that involved all three tasks. The customer had an objective function that was defined in terms of an integral. The problem was to find the roots and the maximum value of the function.

An objective function that involves an integral

When constructing an example, it is always helpful to use a problem for which the exact answer is known. For this article, I will define the objective function as follows:

optintegral1

To make the example more realistic, I've included a parameter, a. For each value of x, this function requires evaluating a definite integral. (Integrals with an upper limit of x arise when integrating probability density functions.) The integrand is the function ax y2. The dummy variable is y, so a and x are both treated as constant parameters when evaluating the integrand. Here are some useful facts about this function:

  • The integrand is a polynomial in y, so the integral can be evaluated explicitly. The simplified expression for F is F(x; a) = axx4/3.
  • The two real roots of F can be found explicitly. They are 0 and the cube root of 3a.
  • The maximum value of F can be found explicitly. It is the cube root of 3a/4.

The remainder of this article demonstrates how to find the roots and maximum value of this function, which is defined in terms of an integral.

Defining the function in the SAS/IML language

When evaluating the integral, the values of x and a are constant. In order to integrate a function by using the QUAD subroutine, the module that defines the integrand must have exactly one argument, which supplies a value for the dummy variable, y. Consequently, list other parameters in the GLOBAL statement of the SAS/IML module that defines the integrand, as follows:

proc iml;
/* Define the integrand, which will be called by QUAD subroutine.
   The integrand is a function of ONE variable: the "dummy" variable y.
   All other parameters are listed in the GLOBAL clause. */
start Integrand(y) global(a, x);
   return( a - x # y##2 );
finish;

You can now define a SAS/IML module that evaluates the function F. I like to use param as the name of the argument to an objective function. The name helps me to remember that param is going to be manipulated by root-finding or nonlinear programming (NLP) algorithms in order to find the roots and extrema of the function. Here's how I would define the objective function that evaluates F:

/* Define a SAS/IML function that computes an integral. 
   This module will be called by a root-finding or NLP routine. */
start ObjFunc(param) global(a, x);
   x = param;                         /* copy to global symbol */
   limits = 0 || x;                   /* limits of integration: [0,x] */
   call quad(w, "Integrand", limits); 
   return( choose(x>=0, w, -w) );     /* trick to handle x<0 */
finish;

I used a few tricks and techniques to define this module:

  • The only argument to the function is param. This is the quantity that will be used to search for roots and extrema. It corresponds to the x in the mathematical expression F(x; a).
  • The argument to the module cannot be named x because I have reserved that name for the global variable that is used by the INTEGRAND module. The SAS/IML syntax does not permit you to use the same name in the argument list and in the GLOBAL list for the same module. Instead, the OBJFUNC module copies the value of param to the x variable to ensure that the global variable remains in synch.
  • For this function, the argument determines the limits of integration. If x is always positive, you don't need to do anything special. However, if you want to permit negative values for x, then you need to use the programming trick in my article about how to evaluate an integral where the limits of integration are reversed.

Visualize the function

When tasked with finding the roots or extrema of a function, I recommend that you visualize the function. This is useful for estimating the number of roots, the location of roots, and finding an initial guess for the optimum. Although visualization can be difficult when the objective function depends on a large number of variables, in this case the visualization requires merely graphing the function.

To draw the graph, you must specify values for any parameters. I will set a = 4/3. For that value of a, the function F has roots at 0 and the cube root of 4 (≈ 1.5874). The maximum of F occurs at x = 1.

/* visualize the objective function */
a = 4/3;
t = do(-1, 2, 0.05);          /* graph the function on [-1, 2] */
F = j(1, ncol(t));
do i = 1 to ncol(t);
   F[i] = ObjFunc(t[i]);      /* evaluate the objective function at each x */ 
end;
title "Objective Function for a = 4/3";
call Series(t, F) grid={x y} label={"x" "F(x)"}
                  other="refline 0 / axis=y"; /* add reference line at y=0 */
objintegral

The graph indicates that the function has two roots, one near x=0 and the other near x=1.6. The function appears to have one maximum, which is located near x=1. Of course, these values depend on the value of the parameter a.

Graphing the function is a "sanity check" that gives us confidence that there are no mistakes in the definition of the objective function. It is very important to test and debug the module that defines the objective function before you try to find a maximum! Many people post programs to the SAS/IML Support Community and report that "the optimization reports a strange error." Sometimes they ask whether there is a problem with the SAS/IML optimizing routine. To paraphrase Shakespeare, usually the fault lies not in our optimizers, but in our objective function. Always verify that the objective function is defined correctly before you call an NLP routine.

Find roots of the function

The objective function is ready to work with. The following statements use the FROOT function to find roots on the interval [–1, 1] and [1, 2]:

/* find roots on [-1, 1] and [1, 2] */
intervals = {-1 1,
              1 2};
roots = froot("ObjFunc", intervals);    /* SAS/IML 12.1 */
print roots[format=6.4];
t_objintegral

As stated earlier, the roots are at 0 and the cube root of 4 (≈ 1.5874).

Find a maximum of the function

To find the maximum, you can use one of the nonlinear programming functions in SAS/IML. In previous optimization problems I have used the NLPNRA subroutine, which uses the Newton-Raphson method to compute a maximum. However, Newton-Raphson is probably not the best choice for this problem because the function evaluation requires computing an integral, and therefore contains a small amount of numerical error. Furthermore, unless I "cheat" and use the exact solution to compute exact derivatives, the Newton-Raphson algorithm would have to use finite difference approximations to compute the derivative and second derivative at each point. For this problem these finite-difference computations are likely to be expensive and not very accurate.

Instead, I will use the Nelder-Mead simplex method, which in implemented in the NLPNMS subroutine. The simplex method solves an optimization problem without using any derivatives. This method can be a good choice for optimizing functions that have low precision, such as the result of evaluating an integral or the result of a simulation.

x0  = 0.5;      /* initial guess for maximum */
opt = {1,       /* find maximum of function  */
       0};      /* no printing */
call nlpnms(rc, Optimum, "ObjFunc", x0, opt); /* find maximum */
print Optimum[format=6.4];
t_objintegral2

As discussed previously, the maximum occurs at x=1.

This article has shown how to define a SAS/IML module that computes an integral in order to evaluate the function. The tips and techniques in this article should enable you to find the roots and extrema of complicated functions that arise in computational statistics.

Post a Comment

Stigler's seven pillars of statistical wisdom

Wisdom has built her house;
She has hewn out her seven pillars.
     – Proverbs 9:1

At the 2014 Joint Statistical Meetings in Boston, Stephen Stigler gave the ASA President's Invited Address. In forty short minutes, Stigler laid out his response to the age-old question "What is statistics?" His answer was not a pithy aphorism, but rather a presentation of seven principles that form the foundation of statistical thought. Here are Stigler's seven pillars, with a few of my own thoughts thrown in:

  1. Aggregation: It sounds like an oxymoron that you can gain knowledge by discarding information, yet that is what happens when you replace a long list of numbers by a sum or mean. Every day the news media reports a summary of billions of stock market transactions by reporting a single a weighted average of stock prices: the Dow Jones Industrial Average. Statisticians aggregate, and policy makers and business leaders use these aggregated values to make complex decisions.
  2. The law of diminishing information: If 10 pieces of data are good, are 20 pieces twice as good? No, the value of additional information diminishes like the square root of the number of observations, which is why Stigler nicknamed this pillar the "root n rule." The square root appears in formulas such as the standard error of the mean, which describes the probability that the mean of a sample will be close to the mean of a population.
  3. Likelihood: Some people say that statistics is "the science of uncertainty." One of the pillars of statistics is being able to confidently state how good a statistical estimate is. Hypothesis tests and p-values are examples of how statisticians use probability to carry out statistical inference.
  4. Intercomparisons: When analyzing data, statisticians usually make comparisons that are based on differences among the data. This is different than in some fields, where comparisons are made against some ideal "gold standard." Well-known analyses such as ANOVA and t-tests utilize this pillar.
  5. Regression and multivariate analysis: Children that are born to two extraordinarily tall parents tend to be shorter than their parents. Similarly, if both parents are shorter than average, the children tend to be taller than the parents. This is known as regression to the mean. Regression is the best known example of multivariate analysis, which also includes dimension-reduction techniques and latent factor models.
  6. Design: R. A. Fisher, in an address to the Indian Statistical Congress (1938) said "To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of." A pillar of statistics is the design of experiments, and—by extension—all data collection and planning that leads to good data. Included in this pillar is the idea that random assignment of subjects to design cells improves the analysis. This pillar is the basis for agricultural experiments and clinical trials, just to name two examples.
  7. Models and Residuals: This pillar enables you to examine shortcomings of a model by examining the difference between the observed data and the model. If the residuals have a systematic pattern, you can revise your model to explain the data better. You can continue this process until the residuals show no pattern. This pillar is used by statistical practitioners every time that they look at a diagnostic residual plot for a regression model.

I agree with Stigler's choice of the seven pillars. If someone asks, "What is statistics?" I sometimes replay "It is what statisticians do!" But what do statisticians do? They apply these seven pillars of thought to convert measurements to information. They aggregate. They glean information from small samples. They use probability to report confidence in their estimates. They create ways to quantify data differences. They analyze multivariate data. They design experiments. They build and refine models.

What do you think of Stigler's pillars? Are there others that you would add? Leave a comment.

Post a Comment

Reversing the limits of integration in SAS

In SAS software, you can use the QUAD subroutine in the SAS/IML language to evaluate definite integrals on an interval [a, b]. The integral is properly defined only for a < b, but mathematicians define the following convention, which enables you to make sense of reversing the limits of integration:

reverselimits

Last week I was investigating a question that a SAS customer had posed to SAS Technical Support. In answering the question, it became important to know how the QUAD subroutine behaves if the left limit of integration is greater than the right limit. The QUAD subroutine supports an argument that specifies the limits of integration. The documentation states:

The simplest form of [the argument] provides the limits of the integration on one interval. The first element... should contain the left limit. The second element should be the right limit.... The elements of the vector must be sorted in an ascending order.

That's pretty explicit, so I thought I might get an error message if I specify a left limit that is greater than the right limit. However, that is not what happens! The following program shows the result of integrating the function f(x) = 4 x3 on the interval [0 1]. In the first call, I specify the limits of integration in ascending order. In the second call, I reverse the limits:

proc iml;
start Integrand(y);
   return( 4*y##3 );
finish;
 
call quad(w1, "Integrand", {0 1});   /* integral on [0 1] equals 1 */
call quad(w2, "Integrand", {1 0});   /* reverse limits */
print w1[L="QUAD on [0,1]"]          /* should get 1 */
      w2[L="QUAD on [1,0]"];         /* what do we get here? */
t_reverselimits

The QUAD subroutine does not give an error when the limits are in descending order. Instead, the routine silently sorts the limits and correctly evaluates the integral on the interval [0 1].

This seems like reasonable behavior, but for the problem that I was working on last week, I wanted to define a function where reversing the limits of integration results in reversing the sign of the integral. Now that I understood how the QUAD subroutine behaves, it was easy to write the function:

/* Compute the integral on an interval where the lower
   limit of integration is a and the upper limit is b.
   If a <= b, return the integral on [a,b].
   If a > b, return the opposite of the integral on [b,a] */
start EvalIntegral(FuncName, Limits);
   call quad(w, FuncName, Limits); 
   a = Limits[1]; b = Limits[2];
   return( choose(a<=b, w, -w) );  /* trick to handle a>b */
finish;
 
w3 = EvalIntegral("Integrand", {0 1} );
w4 = EvalIntegral("Integrand", {1 0} );
print w3[L="EvalIntegral on [0,1]"] 
      w4[L="EvalIntegral on [1,0]"];
t_reverselimits2

The EvalIntegral function is very simple. It evaluates the integral with the specified limits. If the limits of integration are in the usual (ascending) order, it returns the value of the integral. If the left limit of integration is greater than the right limit, it returns the opposite of the value that was computed. In other words, it implements the mathematical convention for reversing the limits of integration.

Reversing the limits of integration is a theoretical tool that rarely comes up in practice. However, if you ever need to compute a definite integral where the limits of integration are reversed, this blog post shows how to construct a function that obeys the mathematical convention.

Post a Comment

Overview of new features in SAS/IML 12.3

Unless you diligently read the "What's New" chapter for each release of SAS software, it is easy to miss new features that appear in the language. People who have been writing SAS/IML programs for decades are sometimes surprised when I tell them about a useful new function or programming feature.

To help SAS/IML programmers keep up with the changes to the language, I occasionally highlight new features on my blog. The word cloud on the right sidebar of my blog contains certain numbers that relate to SAS or SAS/IML releases. For example, you can click on "12.1" to read about new features in SAS/IML 12.1.

This article links to relevant blog posts that discuss features that were new to SAS/IML 12.3, which was released in July, 2013, as part of SAS 9.4.

New functions and subroutines in SAS/IML 12.3

SAS/IML 12.3 introduced the following functions and subroutines for data analysis:

New ODS statistical graphs

In addition, SAS/IML 12.3 supports functions that enable you to create ODS statistical graphics as part of a SAS/IML program. You can create bar charts, box plots, histograms, scatter plots, and series plots (also known as line plots).

Enhancements to functionality

There were several enhancements and language improvements in SAS/IML 12.3. Here are the ones that I have blogged about:

It can be hard to keep up with enhancements to SAS software. Hopefully this reference page will be a useful to SAS/IML users who are upgrading their version of SAS. Have you used any of these new features? Leave a comment and tell me which is your favorite.

Post a Comment

Lexicographic combinations in SAS

t_lexico1

In a previous blog post, I described how to generate combinations in SAS by using the ALLCOMB function in SAS/IML software. The ALLCOMB function in Base SAS is the equivalent function for DATA step programmers.

Recall that a combination is a unique arrangement of k elements chosen from a set of n elements. As such, the combination can always be written in a standard order where the elements increase from left to right and no element is larger than n. The output for the ALLCOMB function is shown to the left for n = 5 and k = 3.

As you scan down the rows, you might ask yourself, "In what order are these combinations generated?" The rows are not in the order that I would generate with pencil and paper. I would start with combination {1 2 3}, then proceed to increment the rightmost digit to obtain {1 2 4} and {1 2 5}. From there, I'd move over to the second-to-last column, increment that digit, reset the last column, and generate {1 3 4} and {1 3 5}. I would iteratively continue that process until all combinations were generated.

The pencil-and-paper method generates all of the combinations that begin with item 1, then generates the combinations that begin with item 2, and ends with the combination {3 4 5}. A list of combinations in "dictionary order," is called the lexicographic ordering. Clearly, the ALLCOMB function does not generate combinations in lexicographic order. Instead, the function generates the combinations in "minimal change order." Each combination is formed from the previous combination by removing one item and inserting another. Consequently, each row has k – 1 items in common with the preceding row.

SAS also provides a way to generate the combinations in lexicographic order. The following DATA step shows how to use the COMB function and LEXCOMBI function to generate the lexicographic ordering:

%let n=5;              /* total number of items */          
%let k=3;              /* number of items per row */
/* generate combinations of n items taken k at a time */
data Comb(keep=c1-c&k);
array c[&k] (&k.*0);   /* initialize array to 0 */
ncomb = comb(&n, &k);  /* number of combinations */
do j = 1 to ncomb;
   rc = lexcombi(&n, &k, of c[*]);
   output;
end;
run;
t_lexico2

The output is shown in the table on the right. Although it is not included in the output, the LEXCOMBI function returns the index of the column that was changed, which can be useful information for some applications.

One of the great things about SAS software is that there are multiple ways to accomplish most tasks. In addition to the preceding DATA step, you could also generate combinations in lexicographic order by doing any of the following:

Programming the lexicographic combination algorithm

Most SAS programmers enjoy calling pre-written functions or procedures in SAS. However, SAS software provides the DATA step and the SAS/IML language for those occasions when you want to program an algorithm yourself.

This fact is especially pertinent for students and adult learners. With the launch of the free SAS University Edition, the next generation of SAS programmers has free access to the DATA step and the SAS/IML language. With a working knowledge of SAS programming, you can prepare data for analysis and implement new analyses that are not built into SAS.

The algorithm to "generate all combinations of n items taken k at a time," is a great programming exercise. Programming the algorithm without calling the LEXCOMBI function enables a programmer to gain experience processing DATA step arrays with k elements. It requires some subtle programming techniques, such as looping forward and backwards through indices of an array.

Would you like to try your hand at writing such a DATA step program? If so, stop reading now. Spoiler ahead!


The following DATA step was sent to me by a senior executive at SAS who still enjoys finding time to program. I found it easy to mentally trace through the program's logic for n = 5 and k = 3 to see how the program works. It keeps track of the number of combinations that it generates and prints the value of "n choose k" when the program terminates.

/* Jim Goodnight's program to generate combinations 
   (in lexicographic order) of n items taken k at a time */
data Comb(keep=c1-c&k);
array c[&k];                /* array to hold combinations      */
n=&n; k=&k;
do i=1 to k; c[i]=i; end;   /* initialize array to 1--k        */
output;
 
count=1;                    /* enumerate combinations          */
do while(count);            /* find next combination           */
   if c[k]<n then c[k]+1;   /* increment value in last column  */
   else do; 
      n1=n; found=0;        /* search earlier in array         */ 
      do i=k-1 to 1 by -1 while(found=0); 
         n1=n1-1; 
         if c[i]<n1 then found=i; /* found column to increment */
      end;
      if found=0 then goto term;    /* none found ... finished */
      c[found]+1;           /* increment value in found column */
      do j=found+1 to k; 
         c[j]=c[j-1]+1;     /* reset values of subsequent cols */
      end; 
   end; 
   output;
   count+1;
end;
term: put count=;           /* at end, print count = comb(n,k) */
run;

Can you modify the program to generate permutations in lexicographic order (see the LEXPERM function)? How would you vectorize this algorithm for the SAS/IML language? Do SAS programmers at your company implement algorithms like this, or do they mostly use the DATA step to extract, merge, transform, and prepare data for analysis? Leave a comment.

Post a Comment

Computing prediction ellipses from a covariance matrix

In a previous blog post, I showed how to overlay a prediction ellipse on a scatter plot in SAS by using the ELLIPSE statement in PROC SGPLOT. The ELLIPSE statement draws the ellipse by using a standard technique that assumes the sample is bivariate normal. Today's article describes the technique and shows how to use SAS/IML to compute a prediction ellipse from a 2 x 2 covariance matrix and a mean vector. This can be useful for plotting ellipses for subgroups, ellipses that correspond to robust covariance estimates, or an ellipse for a population (rather than for a sample).

SAS/IML routines for computing prediction ellipses have been around for a long time. A module called the CONTOUR module was in the version 6 (1989) documentation for SAS/IML. In version 6.12, the module was used to compare prediction ellipses for robust and classical covariance matrices. A module appears in Michael Friendly's 1991 book The SAS System for Statistical Graphics, and in several of his papers, including a 1990 SUGI article. Friendly continues to distribute the %ELLIPSES macro for displaying ellipses on scatter plots. The SAS/IML function in this article is similar to these earlier modules.

Prediction ellipses: The main ideas

You can compute a prediction ellipse for sample data if you provide the following information:

  • m: A vector for the center of the ellipse.
  • S: A covariance matrix. This can be a classical covariance matrix or a robust covariance matrix.
  • n: The number of nonmissing observations in the sample.
  • p: The confidence level for the prediction ellipse. Equivalently, you could specify a significance level, α, which corresponds to a 1 – α confidence level.

The implicit formula for the prediction ellipse is given in the documentation for the CORR procedure as the set of points that satisfy a quadratic equation. However, to draw the ellipse, you should parameterize the ellipse explicitly. For example, when the axes of the ellipse are aligned with the coordinate axes, the equation of an ellipse with center (c,d) and with radii a and b is defined implicitly as the set of points (x,y) that satisfies the equation
      (x-c)2 / a2 + (y-d)2 / b2 = 1.
However, if you want to draw the ellipse, the parametric form is more useful:
      x(t) = c + a cos(t)
      y(t) = d + b sin(t)
for t in the interval [0, 2π].

An algorithm to draw a prediction ellipse

I can think of two ways to draw prediction ellipses. One way is to use the geometry of Mahalanobis distance. The second way, which is used by the classical SAS/IML functions, is to use ideas from principal components analysis to plot the ellipse based on the eigendecomposition of the covariance matrix:

  1. Find the eigenvalues (λ1 and λ2) and eigenvectors (e1 and e2) of the covariance matrix, S. The eigenvectors form an orthogonal basis for the coordinate system for plotting the ellipse. The first eigenvector (e1) points in the direction of the greatest variance and defines the major axis for the prediction ellipse. The second eigenvector (e2) points in the direction of the minor axis.
  2. As mentioned previously, sines and cosines parameterize an ellipse whose axes are aligned with the standard coordinate directions. It is just as simple to parameterize an ellipse in the coordinates defined by the eigenvectors:
    • The eigenvectors have unit length, so a circle is formed by the linear combination cos(t)*e1 + sin(t)*e2 for t in the interval [0, 2π].
    • The ellipse sqrt(λ1)cos(t)*e1 + sqrt(λ2)sin(t)*e2 is a "standardized ellipse" in the sense that the standard deviations of the variables are used to scale the semi-major axes.
  3. To get a prediction ellipse, scale the standardized ellipse by a factor that depends on quantiles of the F2,n-2 distribution, the confidence level, and an adjustment factor that depends on the sample size n. The SAS/IML module contains the details.
  4. Translate the prediction ellipse by adding the vector m.

A SAS/IML module for computing a prediction ellipse

The following module accepts a vector of k confidence levels. The module returns a matrix with three columns. The meaning of each column is described in the comments.

proc iml;
start PredEllipseFromCov(m, S, n, ConfLevel=0.95, nPts=128);
/* Compute prediction ellipses centered at m from the 2x2 covariance matrix S.
   The return matrix is a matrix with three columns.
   Col 1: The X coordinate of the ellipse for the confidence level.
   Col 2: The Y coordinate of the ellipse for the confidence level.
   Col 3: The confidence level. Use this column to extract a single ellipse.
 
   Input parameters:
   m           a 1x2 vector that specifies the center of the ellipse. 
               This can be the sample mean or median.
   S           a 2x2 symmetric positive definite matrix. This can be the 
               sample covariance matrix or a robust estimate of the covariance.
   n           The number of nonmissing observations in the data. 
   ConfLevel   a 1 x k vector of (1-alpha) confidence levels that determine the
               ellipses. Each entry is 0 < ConfLevel[i] < 1.  For example,
               0.95 produces the 95% confidence ellipse for alpha=0.05.
   nPts       the number of points used to draw the ellipse. The default is 0.95.
*/
 
   /* parameterize standard ellipse in coords of eigenvectors */
   call eigen(lambda, evec, S);   /* eigenvectors are columns of evec */
   t = 2*constant("Pi") * (0:nPts-1) / nPts;
   xy  = sqrt(lambda[1])*cos(t) // sqrt(lambda[2])*sin(t);
   stdEllipse = T( evec * xy );   /* transpose for convenience */
 
   /* scale the ellipse; see documentation for PROC CORR */
   c = 2 * (n-1)/n * (n+1)/(n-2);          /* adjust for finite sample size */
   p = rowvec(ConfLevel);                  /* row vector of confidence levels */
   F = sqrt(c * quantile("F", p, 2, n-2)); /* p = 1-alpha */
 
   ellipse = j(ncol(p)*nPts, 3);  /* 3 cols: x y p */
   startIdx = 1;                  /* starting index for next ellipse */
   do i = 1 to ncol(p);           /* scale and translate */
      idx = startIdx:startIdx+nPts-1;
      ellipse[idx, 1:2] = F[i] # stdEllipse + m; 
      ellipse[idx, 3] = p[i];     /* use confidence level as ID */
      startIdx = startIdx + nPts;
   end;           
   return( ellipse );
finish;

Compare a classical and robust prediction ellipse

The SAS/IML documentation includes an example in which a classical prediction ellipse is compared with a robust prediction ellipse for three-dimensional data that contain outliers. The following SAS/IML statements define the classical and robust estimates of location and scatter for two of the variables. The PredEllipseFromCov function is called twice: once for the classical estimates, which are based on all 21 observations, and once for the robust estimates, which are based on 17 observations:

/* Stackloss data example from SAS/IML doc */
/* classical mean and covariance for stackloss data */
/*      Rate      AcidConcentration */
N = 21;
mean = {60.428571 86.285714}; 
cov  = {84.057143 24.571429,
        24.571429 28.714286};
classicalEllipse = PredEllipseFromCov(mean, cov, N, 0.9);
 
/* robust estimates of location and scatter for stackloss data */
/*         Rate      AcidConcentration */
N = 17;
robMean = {56.7      85.5}; 
robCov  = {23.5     16.1,
           16.1      32.4};
robEllipse = PredEllipseFromCov(robMean, robCov, N, 0.9);
 
/* write the ellipse data to a SAS data set */
E = classicalEllipse || robEllipse[,1:2];
create Ellipse from E[c={"classX" "classY" "ConfLevel" "robX" "robY"}];
append from E;
close Ellipse;
quit;

The following SAS statements merge the data and the coordinates for the prediction ellipses. The POLYGON statement in the SGPLOT procedure is used to overlay the ellipses on a scatter plot of the data. The POLYGON statement is available in SAS 9.4M1. Notice that the PredEllipseFromCov function returns a matrix with three columns. The third column (the confidence level) is used as the ID= variable for the POLYGON statement:

data Stackloss;
input Rate Temperature AcidCon @@;
datalines;
80  27  89  80  27  88  75  25  90  62  24  87  62  22  87
62  23  87  62  24  93  62  24  93  58  23  87  58  18  80
58  18  89  58  17  88  58  18  82  58  19  93  50  18  89
50  18  86  50  19  72  50  19  79  50  20  80  56  20  82
70  20  91
;
 
data All;
set Stackloss Ellipse;
run;
 
title "Classical and Robust Prediction Ellipses";
proc sgplot data=All;
scatter x=Rate y=AcidCon / transparency=0.5 markerattrs=(symbol=CircleFilled size=12);
polygon x=classX y=classY id=ConfLevel / 
        lineattrs=GraphData1 name="classical" legendlabel="Classical 90% Ellipse";
polygon x=robX y=robY id=ConfLevel / 
        lineattrs=GraphData2 name="robust" legendlabel="Robust 90% Ellipse";
xaxis grid; yaxis grid; 
keylegend "classical" "robust";
run;
predellipse3

The classical prediction ellipse is based on all 21 observations. The robust estimation method classified four observations as outliers, so the robust ellipse is based on 17 observations. The four outliers are the markers that are outside of the robust ellipse.

Plotting prediction ellipses for subgroups

You can also use this module to overlay prediction ellipses for subgroups of the data. For example, you can compute the mean and covariance matrix for each of the three species of flower in the sashelp.iris data. You can download the complete program that computes the prediction ellipses and overlays them on a scatter plot of the data. The following graph shows the result:

predellipse4

In summary, by using the SAS/IML language, you can write a short function that computes prediction ellipses from four quantities: a center, a covariance matrix, the sample size, and the confidence level. You can use the function to compute prediction ellipses for classical estimates, robust estimates, and subgroups of the data. You can use the POLYGON statement in PROC SGPLOT to overlay these ellipses on a scatter plot of the data.

Post a Comment

Add a prediction ellipse to a scatter plot in SAS

It is common in statistical graphics to overlay a prediction ellipse on a scatter plot. This article describes two easy ways to overlay prediction ellipses on a scatter plot by using SAS software. It also describes how to overlay multiple prediction ellipses for subpopulations.

What is a prediction ellipse?

A prediction ellipse is a region for predicting the location of a new observation under the assumption that the population is bivariate normal. For example, an 80% prediction ellipse indicates a region that would contain about 80% of a new sample that is drawn from a bivariate normal population with mean and covariance matrices that are equal to the sample estimates.

A prediction ellipse is helpful for detecting deviation from normality. If you overlay a prediction ellipse on your sample and the sample does not "fill" the ellipse in the way that bivariate normal data would, then you have a graphical indication that the data are not bivariate normal.

Because the center of the ellipse is the sample mean, a prediction ellipse can give a visual indication of skewness and outliers in the data. A prediction ellipse also displays linear correlation: If you standardize both variables, a skinny ellipse indicates highly correlated variables, whereas an ellipse that is nearly circular indicates little correlation.

How to draw a prediction ellipse in SAS

SAS provides two easy ways to overlay a prediction ellipse on a scatter plot. The SGPLOT procedure supports the SCATTER statement, which creates a scatter plot, and the ELLIPSE statement, which overlays a bivariate normal prediction ellipse, computed by using the sample mean and covariance. The following statements overlay a 95% prediction ellipse on 50 measurements of the width and length of petals for a particular species of flower:

title "95% Prediction Ellipse";
title2 "iris Versicolor";
proc sgplot data=sashelp.iris noautolegend;
   where Species='Versicolor';
   scatter x=PetalLength y=PetalWidth / jitter;  /* JITTER is optional    */
   ellipse x=PetalLength y=PetalWidth;           /* default is ALPHA=0.05 */
run;
predellipse1

The JITTER option (which requires SAS 9.4) is used to slightly displace some of the observations. Without the option, some markers overlap because the data are rounded to the nearest millimeter. By default, the ELLIPSE statement computes and displays a 95% prediction ellipse, which tends to surround most of the data. However, you can use the ALPHA= option to display a 100(1-α)% prediction ellipse. Some researchers recommend overlaying a 68% prediction ellipse (ALPHA=0.32) because 68% is the probability of observing univariate normal data that is within one standard deviation of the mean. See the article by Michael Friendly for examples and a discussion.

Because a prediction ellipse gives a graphical indication of the linear correlation between variables, you can create a prediction ellipse as part of a correlation analysis in SAS. The following call to the CORR procedure uses the PLOTS=SCATTER option to overlay a 95% prediction ellipse. The output (not shown) is similar to the previous graph.

proc corr data=sashelp.iris plots=scatter(ellipse=prediction);
   where Species='Versicolor';
   var PetalLength PetalWidth;
run;

Draw prediction ellipses for each group

Occasionally, you might want to overlay prediction ellipses for subsets of the data that correspond to subpopulations. For example, if your data contain both male and female subjects, it might be that the means and covariance for the variables are different between males and females. In that case, overlaying a prediction ellipse for each subpopulation helps you to visualize how characteristics vary between the groups.

There are several ways to draw prediction ellipses for groups within the data. Michael Friendly provides a macro that uses SAS/GRAPH to overlay ellipses. It supports a grouping variable, as shown in his 2006 paper in J. Stat. Soft. (This macro has been used in some examples by Robert Allison.)

If you want to use ODS statistical graphics to display the multiple ellipses, you need to use a little trick. Because the ELLIPSE statement in PROC SGPLOT does not support a GROUP= option as of SAS 9.4m2, you have to reshape the data so that each group becomes a new variable. This is equivalent to transposing the data from a "long form" to a "wide form." From my previous blog post, here is one way to create six variables that represent the petal length and width variables for each of the three species of iris in the sashelp.iris data set:

data Wide;
/*   |-- PetalLength --| |--- PetalWidth ---|  */
keep L_Set L_Vers L_Virg W_Set W_Vers W_Virg;  /* names of new variables */
merge sashelp.iris(where=(Species="Setosa") 
              rename=(PetalLength=L_Set PetalWidth=W_Set))
      sashelp.iris(where=(Species="Versicolor") 
              rename=(PetalLength=L_Vers PetalWidth=W_Vers))
      sashelp.iris(where=(Species="Virginica") 
              rename=(PetalLength=L_Virg PetalWidth=W_Virg));
run;

Now that the data are in separate variables, call PROC SGPLOT to overlay three scatter plots and three 95% prediction ellipses:

title "95% Prediction Ellipses for Each Group";
proc sgplot data=Wide;
  scatter x=L_Set  y=W_Set  / jitter name="Set"  legendlabel="Setosa";
  scatter x=L_Virg y=W_Virg / jitter name="Virg" legendlabel="Virginica";
  scatter x=L_Vers y=W_Vers / jitter name="Vers" legendlabel="Versicolor";
  ellipse x=L_Set  y=W_Set  / lineattrs=GraphData1;
  ellipse x=L_Virg y=W_Virg / lineattrs=GraphData2;
  ellipse x=L_Vers y=W_Vers / lineattrs=GraphData3;
  keylegend "Set" "Virg" "Vers" / title="Species:";
  xaxis label="Petal Length (mm)";
  yaxis label="Petal Width (mm)";
run;
predellipse2

Notice that I used a trick to make the color of each prediction ellipse match the color of the data to which it is associated. The colors for markers in the scatter plots are assigned automatically according to the current ODS style. The style elements used for the markers are named GraphData1, GraphData2, and GraphData3. When I create the ellipses, I use the LINEATTRS= statement to set the style attributes of the ellipse to match the corresponding attributes of the data.

The graph visualizes the relationship between the petal length and the petal width variables in these three species. At a glance, three facts are evident:

  • The means of the variables (the centers of the ellipses) are different across the groups. This is a "graphical MANOVA test."
  • The Versicolor ellipse is "thinner" than the others, which indicates that the correlation between petal length and width is greater in that species.
  • The Virginica ellipse is larger than the others, which indicates greater variance within that species.

In summary, it is easy to use the ELLIPSE statement in PROC SGPLOT to add a prediction ellipse to a scatter plot. If you want to add an ellipse for subgroups of the data, use the trick from my previous article to reshape the data. Plotting ellipses for subgroups enables you to visually inspect covariance and correlation within groups and to compare those quantities across groups. In my next blog post, I will show how to create and plot prediction ellipses directly from a covariance matrix.

Post a Comment