Timing performance improvements due to vectorization

Last week I discussed a program that had three nested loops that used scalar operations in the innermost loop. I mentioned that this program was not vectorized, and would therefore be slow in a matrix language such as SAS/IML, MATLAB, or R. I then went through a series of steps in which I rewrote the program to be more efficient by using basic linear algebra operations such as dot products (level-1 BLAS), matrix-vector multiplication (level-2 BLAS), and matrix-matrix multiplication (level-3 BLAS). At each step, the number of loops decreases, and the efficiency increases.

Someone asked me what kind of speed-ups can be expected by vectorizing a SAS/IML program. In general, the answer depends on the size of your data and the particular operations that you are performing, but it is straightforward to run a series of examples that compare the performance at each step of the vectorization of the previous example.

In this post, I generate a square matrix of size N for N=100, 200, 400, 600, 800, and 1000. For each size, I time how long it takes to compute an operation that is equivalent to X`*X. The operations are as follows:

  • Level-0: The original program, which consists of three nested loops and scalar operations.
  • Level-1: A program that contains two nested loops and dot products of vectors.
  • Level-2: A program that contains one loops and matrix-vector multiplication.
  • Level-3: A program that contains one matrix-matrix multiplication.

You can download the program that performs the experiment.

On my desktop computer, the results are summarized by the following plots. The first plot shows all of the times. However, the graph is dominated by the slow times that are associated with the original program that had three nested loops and only scalar operations. The middle plot omits the times for the original program. In this view you can see that the level-1 operation is about 25 times faster than the level-0 operations for N=1000. Again, however, the graph is dominated by the slowest method, and so the last plot shows only the timings for the level-2 and level-3 operations. The level-2 operations are about 16 times faster than the level-3 operations for N=1000. The level-3 operations are almost three times faster than level-2.

In summary, the slowest method (scalar operations in three nested loops) is about 1,000 times slower than the equivalent level-3 operation, which does not require any loops. This is good motivation: the time you invest to fully vectorize your code can pay dividends by running 1,000 times faster.

If you are adept at interpreting logarithmic scales, you can also see the results plotted on a Log10 axis.

No matter how you visualize the results, the conclusion is the same: if you want your program to scale to handle large data, vectorize the program to take advantage of the SAS/IML matrix operations.

Post a Comment

Ways to multiply in the SAS/IML language

For programmers who are learning the SAS/IML language, it is sometimes confusing that there are two kinds of multiplication operators, whereas in the SAS DATA step there is only scalar multiplication. This article describes the multiplication operators in the SAS/IML language and how to use them to perform common tasks such as the elementwise product, the dot product, and the outer product of vectors.

Elementwise multiplication (#)

The elementwise multiplication operator (#) is used to perform element-by-element scalar multiplication. This operator is not part of the DATA step syntax. If you have two matrices of the same dimension, then u#v is the matrix whose ith element is the product of the ith elements of u and v. (This product is also known as the Hadamard product.) This is shown in the following PROC IML example:

proc iml;
u = { 1, 2, 3};         /* 3x1 column vector */
v = {-1, 0, 2};         /* 3x1 column vector */
 
elemProd = u#v;         /* elementwise product (Hadamard product) */
print elemProd;

The elementwise multiplication operator can also be used in some situations in which u is a vector that has the same row or column dimension as v. See my article on how the SAS/IML language "knows what you want."

True matrix multiplication (*)

The matrix multiplication operator (*) performs true matrix multiplication. Whereas the * operator is used for scalar multiplication in the DATA step, the operator is used for matrix multiplication in PROC IML. If u and v are any two matrices where the number of rows of u matches the number of columns of v, then the matrix product u*v is defined.

When u and v are vectors, matrix multiplication gets a special name. When a row vector is multiplied with a column vector, the result is a scalar and the operation is called the dot product (or inner product or scalar product). The following example uses the transpose operator (`) to create a row vector:

dotProd = u`*v;         /* dot product (scalar product, inner product) */
print dotProd;

There is an interesting connection between the elementwise product and the dot product of two vectors. The dot product of u and v is the same as the sum of the elements of the elementwise product: u`*v = sum(u#v).

Matrix multiplication is not commutative, so you get a different result if you multiply a column vector with a row vector. The result is a rank-1 matrix. This is called the outer product of two vectors. An example follows:

outerProd = u*v`;       /* outer product: column vec times row vec */
print outerProd;

Other matrix products

The SAS/IML language supports other kinds of multiplication, including the direct product (or Kronecker product) and the horizontal direct product of matrices:

dirProd = u`@v;        /* direct product */
hdirProd = hdir(u`,v); /* horizontal direct product */

There are many special-purpose products that are not covered in this short article, but remember that you can always define your own SAS/IML function that compute any conceivable product. For example, in physics classes students use the "cross product" (also called the skew-symmetric product) to compute quantities that arise in electromagnetism. The following SAS/IML function implements the cross product computation:

/* cross product (3D vectors only) */
start CrossProd(u, v);
   i231 = {2 3 1};
   i312 = {3 1 2};
   return( u[i231]#v[i312] - v[i231]#u[i312] );
finish;
 
uxv = CrossProd(u,v);

Be aware that in statistics, the "cross product" often refers to the multiplication X`*X, where X is a data matrix. In this matrix product, the (i,j)th element of X`*X is the dot product of the ith and jth columns of X.

Post a Comment

How to vectorize computations in a matrix language

Last week someone posted an interesting question to the SAS/IML Support Community. The problem involved four nested DO loops and took hours to run. By transforming several nested DO loops into an equivalent matrix operation, I was able to reduce the run time to about one second.

The process of converting loops into vector or matrix operations is called vectorization. The programmer who posted the question knew that vectorization would improve the program, but also stated that "vectorization is really difficult." I agree. However, life is full of difficult but necessary tasks. In statistical programming, vectorization is necessary (but rewarding!) because it can dramatically decrease the run time of a program.

Statistical programmers need to know how to vectorize programs. Whether you write your programs in SAS/IML language, MATLAB, or R, it is essential to know how to convert loops into equivalent vector operations. Therefore, I have decided to post a series of examples that show how to recognize certain patterns that arise in loops, and how to replace those loops with matrix operations. These examples will be tagged with the "vectorization" tag in the word cloud (in the right sidebar) on my blog. Today is the first example.

Here are some general tips and techniques:

  • When you have nested loops, vectorize the inner loop first.
  • Convert a loop with scalar operations into a single vector operation. For example, convert sums into vector dot products.
  • Convert a loop of vector operations into a single matrix-vector multiplication.
  • Convert a loop of matrix-vector multiplications into a single matrix-matrix multiplication.

In terms of matrix computations, follow this general rule:

Solve the problem by using the highest possible level of Basic Linear Algebra Subroutine (BLAS).

A level-0 BLAS is a scalar operation. A level-1 BLAS is a vector operation, such as a vector addition, a dot product, or a vector norm. A level-2 BLAS is a matrix-vector operation, such as a matrix-vector multiplication, an outer product, or solving a triangular linear system. A level-3 BLAS is a matrix-matrix operation, such as matrix multiplication, forming a cross-product matrix, or solving triangular linear systems with multiple right-hand sides (see the SAS/IML TRISOLV function).

The notions of "level" and BLAS are discussed in Chapter 1 of Golub and van Loan's Matrix Computations.

An example problem with three nested loops

The problem in the discussion forum involved four nested loops, but I'll extract the essence of the problem into a simpler example. Suppose that you have implemented a textbook formula and your SAS/IML program looks like the following:

proc iml;
X = {1 2 3,            /* example data */
     4 3 1,
     1 1 2,
     0 0 3};
n = nrow(X);
p = ncol(X);
 
S=j(p,p,.);            /* allocate matrix to hold results */
do j = 1 to p;
   do k = 1 to p;
      u=0;     
      do m = 1 to n;
         u = u +  X[m,j]*X[m,k];
      end;
      S[j,k] = u;      /* the (j,k)th result */
   end;
end;
print S;

At this stage, the program consists of three loops and scalar operations (level-0 BLAS).

When I study these loops, I notice the following:

  • The outer loop (the j loop) is a loop over the columns of the matrix X because the expression X[m,j] appears in the innermost loop.
  • The middle loop (the k loop) is also a loop over the columns of the matrix X because the expression X[m,k] appears in the innermost loop.
  • The inner loop (the m loop) is performing a summation. (Notice that u=0 before the loop and u=u+stuff inside the loop.) The terms that are being summed are the product of the jth and kth columns.

Step 1: Attack the innermost loop

Because the inner loop is a sum, replace the inner loop by the SUM function. The SUM function takes a vector. For this example, that vector will be the elementwise product of the jth and kth columns: X[ ,j]#X[ ,k]. The following statements show the simplified program, which now has only two loops:

/* Step 1: Replace inner loop with sum of the product of j_th col and k_th col. */
S=j(p,p,.);
do j = 1 to p;
   do k = 1 to p;
      S[j,k] = sum( X[ ,j] # X[,k] );
   end;
end;

Step 2: Replace summations by dot products

The inner loop now involves the sum of the elementwise product of two vectors. This sum has a more common name: the vector dot product. Consequently, replace the SUM function by a dot product. In order to get the dimensions to match, transpose the first vector, as follows:

/* Step 2: The sum of the product of two vectors is their dot product. */
S=j(p,p,.);
do j = 1 to p;
   do k = 1 to p;
      S[j,k] = X[ ,j]` * X[,k];
   end;
end;

At this stage, the program consists of two loops and vector operations (level-1 BLAS).

Step 3: Replace vector operations with matrix-vector operations

The next step involves converting vector operations into a matrix-vector operation. Notice that the X[,j] term does not change in the inner DO loop. Furthermore, the product is over all columns, so the inner loop is equivalent to a vector-matrix multiplication. The program is thereby reduced to one loop, as follows:

/* Step 3: Convert to vector-matrix product */
S=j(p,p,.);
do j=1 to p;
   S[j, ] = X[ ,j]` * X;
end;

At this stage, the program consists of one loop and matrix-vector operations (level-2 BLAS).

Step 4: Replace matrix-vector multiplications with matrix-matrix operations

The last step is to recognize that the only remaining loop iterates over the columns of X. But that is equivalent to iterating over the rows of X` (the transpose of X). The loop of vector-matrix multiplication is consequently equivalent to the cross-product operation, X`X, as follows:

/* Step 4: A loop over all rows is equivalent to multiplication by X` */
S = X` * X;
print S;

Notice that the three loops have been replaced by a single level-3 BLAS: matrix multiplication. As stated earlier, for large problems you can realize substantial savings of time if you eliminate loops in favor of high-level linear algebra operations.

One final comment. If you've struggled and labored, but still can't figure out how to vectorize your SAS/IML program, post it to the SAS/IML Support Community. That community is a helpful place to discuss issues related to efficiency and programming in the SAS/IML language.

Post a Comment

Use regression for a univariate analysis? Yes!

I've conducted a lot of univariate analyses in SAS, yet I'm always surprised when the best way to carry out the analysis uses a SAS regression procedure. I always think, "This is a univariate analysis! Why am I using a regression procedure? Doesn't a regression require at least two variables?"

Then it dawns on me. In the SAS regression procedures, a MODEL statement that does not contain explanatory variables simply performs a univariate analysis on the response variable. For example, when there are no explanatory variables, a classical regression analysis produces sample statistics, such as the mean, the variance, and the standard error of the mean, as shown in the following output:

/* estimates of mean, variance, and std err of mean */
ods select FitStatistics ParameterEstimates ;
proc reg data=sashelp.cars;
   model MPG_City= ;
quit;

The estimates in this PROC REG output are the same as are produced by the following call to PROC MEANS:

proc means data=sashelp.cars mean std stderr cv;
 var MPG_City;
run;

The ODS graphics that are produced by PROC REG also includes a histogram of the centered data and a normal Q-Q plot.

Here are some other instances in which a SAS regression procedure can be used to carry out a univariate analysis:

  • Robust estimates of scale and location. This univariate analysis is usually performed by using PROC UNIVARIATE with the ROBUSTSCALE option. However, you can also use the ROBUSTREG procedure to estimate robust statistics. The ROBUSTREG procedure provides four different estimation methods, which you can control by using the METHOD= option. For example, the following statements display robust estimates of location and scale:
    ods select ParameterEstimates;
    proc robustreg data=sashelp.cars method=M;
       model MPG_City= ;
    run;
  • Detection of univariate outliers. How can you detect univariate outliers in SAS? One way is to call the ROBUSTREG procedure! Again, there are four estimation methods that you can use.
    ods select DiagSummary;
    proc robustreg data=sashelp.cars method=LTS;
       model MPG_City= ;
       output out=out outlier=outliers;
    run;
     
    proc print data=out(where=(outliers=1));
       var make model type mpg_city;
    run;
  • Estimation of quantiles, with confidence intervals. Last week I showed how to use PROC UNIVARIATE to compute sample quantiles and confidence intervals. An alternative approach is to use the QUANTREG procedure, as follows:
    /* estimate univariate quantiles, CIs, and std errors */
    ods output ParameterEstimates=QntlCI;
    proc quantreg data=sashelp.cars;
       model MPG_City= / quantiles=0.025 0.2 0.8 0.975;
    run;
     
    proc print data=QntlCI; run;
  • Fit discrete parametric models to univariate data. I've previously shown how to use the GENMOD procedure to fit a Poisson model to data, and the same technique can be used to fit other discrete distributions, including the binomial, geometric, multinomial, negative binomial, and some zero-inflated distributions.

  • Fit parameters for a mixed density model to univariate data. I've previously demonstrated how to use the FMM procedure to fit a finite mixture distribution to data.

There are other examples, but I hope you see that the SAS regression procedures are useful for computing univariate statistics and analyses.

Do you have a favorite univariate analysis that can be accomplished by using a SAS regression procedure? Let me know about it by leaving a comment.

Post a Comment

A three-panel visualization of a distribution

At a recent conference, I talked with a SAS customer who told me that he was using an R package to create a three-panel visualization of a distribution. Unfortunately, he couldn't remember the name of the package, and he has not returned my e-mails, so the purpose of today's article is to discuss some ideas related to this visualization and to solicit critiques of my implementation.

The customer wanted to create a paneled display in SAS that includes three graphs: a histogram, a box plot, and a normal quantile-quantile (Q-Q) plot. We sketched out an idea, and the plot at the left is my implementation of our sketch. (Click to enlarge.)

One question I asked was, "Do you want the usual Q-Q plot, or should we flip it?" The usual Q-Q plot is a scatter plot of the ordered data values (on the vertical axis) plotted against the corresponding quantiles of a normal distribution (on the horizontal axis). The purpose of a Q-Q plot is to see whether the points fall along a straight line, which would indicate that the data are normally distributed. I remarked that if we flip the plot so that the data values are displayed horizontally, then the histogram, boxplot, and Q-Q plot can all share a common horizontal axis. The customer said that this seemed like a good idea.

In SAS, you can create this kind of paneled layout by using the Graph Template Language (GTL) and the SGRENDER procedure.

A GTL template for a three-panel display

When I returned from the conference, I created a GTL template that defines a three-panel display. The top panel, which occupies 50% of the height of the display, is a histogram of the data overlaid with a normal curve and a kernel density estimate. The second panel, which occupies 10% of the height, is a horizontal box plot. The third panel is a normal Q-Q plot, but is flipped so that the normal quantiles are plotted on the vertical axis. A diagonal reference line is added to the Q-Q plot. Normally distributed data should fall near the reference line.

The threepanel template takes five dynamic variables. The data and the normal quantiles are referenced by the dynamic variables _X and _QUANTILE, respectively. The title is supplied by using the _Title variable. Lastly, the parameter estimates for the normal curve that best fits the data are supplied by using the _mu and _sigma dynamic variables. The template definition follows:

/* define 'threepanel' template that displays a histogram, box plot, and Q-Q plot */
proc template;
define statgraph threepanel;
dynamic _X _QUANTILE _Title _mu _sigma;
begingraph;
   entrytitle halign=center _Title;
   layout lattice / rowdatarange=data columndatarange=union 
      columns=1 rowgutter=5 rowweights=(0.4 0.10 0.5);
      layout overlay;
         histogram   _X / name='histogram' binaxis=false;
         densityplot _X / name='Normal' normal();
         densityplot _X / name='Kernel' kernel() lineattrs=GraphData2(thickness=2 );
         discretelegend 'Normal' 'Kernel' / border=true halign=right valign=top location=inside across=1;
      endlayout;
      layout overlay;
         boxplot y=_X / boxwidth=0.8 orient=horizontal;
      endlayout;
      layout overlay;
         scatterplot x=_X y=_QUANTILE;
         lineparm x=_mu y=0.0 slope=eval(1./_sigma) / extend=true clip=true;
      endlayout;
      columnaxes;
         columnaxis;
      endcolumnaxes;
   endlayout;
endgraph;
end;
run;

You can download the %ThreePanel macro, which creates a three-panel display for any variable in any data set. If you want to learn more about how to write GTL templates, I recommend the book Statistical Graphics in SAS: An Introduction to the Graph Template Language and the Statistical Graphics Procedures by my colleague, Warren Kuhfeld.

The macro calls PROC UNIVARIATE to compute the normal parameter estimates and the quantiles. That information is then used by PROC SGRENDER to create the plot according to the specifications in the threepanel template. The macro uses a cool trick: I get the data for the Q-Q plot by using an ODS OUTPUT statement on a graph that is created by PROC UNIVARIATE.

The image at the top of this post shows how the template renders the MPG_City variable in the Sashelp.Cars data set. The image was created as follows:

ods graphics on;
%ThreePanel(Sashelp.Cars, MPG_City)

The MPG_City variable is not normally distributed, as is evident by looking at the poor fit of the data in the Q-Q plot (lower panel). In contrast, the distribution of the SepalLength variable in the Sashelp.Iris data set appears to be more normal, as shown below:

%ThreePanel(Sashelp.Iris, SepalLength)

Discussion

What do you think? Try it out on your own data and let me know if you have suggestions to improve it. Is this a useful display? Leave a comment.

Post a Comment

Compute confidence intervals for percentiles in SAS

PROC UNIVARIATE has provided confidence intervals for standard percentiles (quartiles) for eons. However, in SAS 9.3M2 (featuring the 12.1 analytical procedures) you can use a new feature in PROC UNIVARIATE to compute confidence intervals for a specified list of percentiles.

To be clear, percentiles and quantiles are essentially the same thing. For example, the median value of a set of data is the 0.5 quantile, which is also the 50th percentile. In general, the pth quantile is the (100 p)th percentile.

The CIPCTLDF option on the PROC UNIVARIATE statement produces distribution-free confidence intervals for the 1st, 5th, 10th, 25th, 50th, 75th, 90th, 95th, and 99th percentiles as shown in the following example:

/* CI for standard percentiles: 1, 5, 10, 25, 50, 75, 90, 95, 99 */
ods select Quantiles;
proc univariate data=Sashelp.Cars cipctldf;
   var MPG_City;
run;

However, prior to the 12.1 releaase of the analytics procedures, there was not an easy way to obtain confidence intervals for arbitrary percentiles. (Recall that you can specify by nonstandard percentiles by using the PCTLPTS= option on the OUTPUT statement.)

I am happy to report that the OUTPUT statement in the UNIVARIATE procedure now supports the CIPCTLDF= option, which you can use as follows:

proc univariate data=sashelp.cars noprint;
   var MPG_City;
   output out=pctl pctlpts=2.5 20 80 97.5 pctlpre=p
          cipctldf=(lowerpre=LCL upperpre=UCL);    /* 12.1 options (SAS 9.3m2) */
run;
 
proc print noobs; run;

The CIPCTLDF= option computes distribution-free confidence intervals for the percentiles that are specified on the PCTLPTS= option. The LOWERPRE= option specifies the prefix to use for lower confidence limits; the UPPERPRE= option specifies the prefix to use for upper confidence limits.

If your data are normally distributed, you can use the CIPCTLNORMAL= option on the OUTPUT statement to compute confidence limits. However, if your data are not normally distributed, the CIPCTLNORMAL= option might produce inaccurate results. For example, on the MPG_City data, which is highly skewed, the confidence intervals for large percentiles (like the 99th percentile) do not contain the corresponding point estimate. For this reason, I prefer the distribution-free intervals for most analyses.

Post a Comment

Understanding local and global variables in the SAS/IML language

The TV show Cheers was set in a bar "where everybody knows your name." Global knowledge of a name is appealing for a neighborhood pub, but not for a programming language. Most programming languages enable you to define functions that have local variables: variables whose names are known only inside the function. This article describes local and global variables in the SAS/IML language.

Scope: the environment where everybody knows your name

One of the features of the SAS/IML language is that you can create your own user-defined functions or subroutines that extend the capabilities of SAS. For brevity, this article discusses functions, but the same ideas apply to user-defined subroutines.

In computer science, the scope of a variable is the "context... in which a variable name... is valid and can be used." A variable inside a function usually has local scope, which means that the variable’s name is known inside the function, but not outside. Furthermore, modifying a local variable does not affect any variable outside the function. For example, the following SAS/IML program defines a variable y inside a function. The local variable is created when the function executes and vanishes when the function exits. Although a variable outside the function is also named y, the outer variable is not affected by running the function:

proc iml; 
start F1(x); /* a function with local variables */
   y = 2*x;                       /* y is local */
   print y[label="y inside function (local)"]; 
   return(1); 
finish;
 
y = 0; t=1:5;
v = F1(t);
print y[label="y outside function"];

The scope of the variables is shown in the following diagram. Three variables are known to the main program: y, t, and v. Inside the function, two names are known: x and y. The local variable named y is not related to the variable y in the main scope. They have the same name, but their scope is different.

There are two ways to enable a variable inside a function to affect variables outside the function.

Sometimes SAS/IML users ask if there is a third alternative. Programmers sometimes ask whether it is possible to create a variable that is shared between several functions, but is not global to the entire program. The answer is no. The SAS/IML language does not support a namespace or variables that are global to a namespace.

Parent variables: sharing the memory, but not the name

Because the SAS/IML language passes values by reference, modifying one of the function's arguments changes the value of the matrix that was passed in. The following example illustrates this:

start F2(x); /* a function that modifies its argument */
   x = 2*x;                       /* x is an argument */
   print x[label="x inside module (argument)"]; 
   return(2); 
finish;
 
y = 0; t=1:5;
v = F2(t);
print t[label="t at main scope"];

Notice that the variable t at main scope has a different name from the parameter x inside the module, but both variables share the same memory. Because x is an argument, changing x inside the function also changes t. See my previous article for more details about passing by reference. The behavior of the F2 function is summarized in the following diagram.

Global variables: sharing the memory and the name

A variable has global scope if you include it in the GLOBAL clause of the START statement. A variable that has global scope can be read or modified inside the module. It corresponds to a variable of the same name that exists at main scope. The following example illustrates this:

start F3(x) GLOBAL(y);  /* a function that has a global variable */
   y = 2*x;                                       /* y is global */
   print y[label="y inside module (global)"]; 
   return(3); 
finish;
 
y = 0; t=1:5;
v = F3(t);
print y[label="y at main scope"];

In this example, the y vector is changed from within the F3 function because y is declared to be a global variable. The following diagram illustrates the behavior of the F3 function.

The role of global variables in SAS/IML programs

Although global variables are discouraged in computer science courses, they serve an important purpose in SAS/IML programming. Namely, when you write an optimization program in the SAS/IML language, the function that is optimized (called the objective function) must contain only one argument. The argument vector is modified until the objective function reaches an optimal value. Any parameters to the objective function must be specified as global variables. The global variables are parameters that are held constant during the optimization process.

Post a Comment

The new book on my desk

Rick Wicklin with Book; Click to enlarge

In my constant effort to keep pace with Chris Hemedinger, I am pleased to announce the availability of my new book, Simulating Data with SAS. Chris started a tradition for SAS Press authors to post a photo of themselves with their new book. Thanks to everyone who helped with the production of the book.

I have posted an overview of the book's purpose on the SAS Bookshelf blog.

One interesting aspect of the book is the cover, which is not discussed in the book. I have attached an image of the cover graphic below.

I have question for you, readers. What, if anything, does the book's cover graphic mean to you? Do you like it? Hate it? Not understand it? Think you have a better idea for a cover image? There are no wrong answers, so post whatever comes to mind. Funny, snarky, intellectual, critical, factual—all opinions are welcome!

But please remember: You can't judge a book by its cover!

Cover image
Post a Comment

How to overlay a custom density curve on a histogram in SAS

I've previously described how to overlay two or more density curves on a single plot. I've also written about how to use PROC SGPLOT to overlay custom curves on a graph. This article describes how to overlay a density curve on a histogram. For common distributions, you can overlay a density by using SAS procedures. However, this article shows how to use the Graphics Template Language (GTL) to overlay a custom density estimate. An example is shown in the figure to the left. The information in this article is adapted from Chapter 3 of my book Simulating Data with SAS.

Overlaying common densities

The UNIVARIATE procedure supports fitting about a dozen distributions to data, including the beta, exponential, gamma, lognormal, and Weibull distributions. There are several examples in the PROC UNIVARIATE documentation that demonstrate how to fit parametric densities to data.

For more complicated models, other SAS procedures are available. You can use the FMM procedure to fit finite mixtures of distributions. In SAS/ETS software, the SEVERITY procedure can fit many distributional models to data. For nonparametric models, the KDE procedure can fit one- and two-dimensional kernel density estimates.

These procedures enable you to overlay density curves for common distributions. For me, these procedures suffice 99% of the time. However, there are situations where you might want to do the following:

  1. Compute a probability density function (PDF) outside of any procedure. The density curve might come from a computation or from evaluating a formula.
  2. Add additional features to the graph. For example, you might want to add reference lines, modify the placement of tick marks, add legends, titles, and so forth.

The SGPLOT procedure does not enable you to use a HISTOGRAM statement and a SERIES statement in the same call. However, you can overlay a histogram and a curve by using the GTL.

Overlaying custom densities

The following template is a simplified version of a template that I used in my book. You can use it as is, or modify it to include additional features such as grid lines.

proc template;
define statgraph ContPDF;
dynamic _X _T _Y _Title _HAlign _binstart _binstop _binwidth;
begingraph;
   entrytitle halign=center _Title;
   layout overlay /xaxisopts=(linearopts=(viewmax=_binstop));
      histogram _X / name='hist' SCALE=DENSITY binaxis=true 
         endlabels=true xvalues=leftpoints binstart=_binstart binwidth=_binwidth;
      seriesplot x=_T y=_Y / name='PDF' legendlabel="PDF" lineattrs=(thickness=2);
      discretelegend 'PDF' / opaque=true border=true halign=_HAlign valign=top 
            across=1 location=inside;
   endlayout;
endgraph;
end;
run;

The template, which is called ContPDF, uses the LAYOUT OVERLAY statement to overlay a histogram and a series (a continuous curve). The name of the histogram variable is provided in the dynamic variable _X. The histogram is plotted on the density scale. The dynamic variables _binstart, _binstop, and _binstart are used to set the histogram bins to convenient values. The variables for the series plot are provided by the dynamic variables _T and _Y. The other two dynamic variables are _HAlign, which determines the location of an inset, and _Title, which specifies the title of the graph.

To see how this template might be used, recall that I have previously shown how to generate random values from the folded normal distribution. The following DATA step generates 1,000 values from the folded normal distribution:

data MyData(drop=i);
call streaminit(1);
do i = 1 to 1000;
   x = abs( rand("Normal") );
   output;
end;
run;

The PDF for a folded distribution is easy to compute: simply fold the distribution across zero and add the densities of each tail:

data PDFFolded;
do t = 0 to 4 by 0.1;
   y = pdf("Normal",  t) + pdf("Normal", -t);  /* =2*PDF for the normal dist */
   output;
end;
run;

The goal is to construct a histogram of the MyData data set and overlay the curve in the PDFFolded data set. There is a standard technique that you can use to overlay a curve on data: concatenate the two data sets and call an SG procedure on the combined data.

data All;
set MyData PDFFolded;
run;
 
proc sgrender data=All template=ContPDF;
dynamic _X="X" _T="T" _Y="Y" _HAlign="right"
        _binstart=0 _binstop=4 _binwidth=0.25
        _Title="Sample from Folded Normal Distribution";
run;

The graph that results is shown at the beginning of this article. Notice that this graph cannot be produced without using the GTL, because the folded normal distribution is not one of the distributions that is supported by PROC UNIVARIATE or PROC SEVERITY.

I try to avoid using the GTL, but it is necessary in this case. Perhaps one day the SGPLOT procedure will enable overlaying a custom curve on a histogram, but until then the GTL provides a way to overlay a custom density estimate on a histogram.

Post a Comment

How to overlay custom curves with PROC SGPLOT

I recently showed someone a trick to create a graph, and he was extremely pleased to learn it. The trick is well known to many SAS users, but I hope that this article will introduce it to even more SAS users.

At issue is how to use the SGPLOT procedure to overlay precomputed curves on a graph of the data. For example, how can you display a curve on a scatter plot when the curve is computed by a SAS/STAT procedure or is given by a formula? (Of course, the SGPLOT procedure supports the REG statement, the LOESS statement, and the PBSPLINE statement, which compute and overlay regression curves, but I am talking about curves that are computed outside of PROC SGPLOT.)

The trick is to create a SAS data set that contains all of the variables that need to be plotted, and that has a special structure. For example, suppose you have observations in a data set called OrigData. Suppose further that there are two other data sets, Fit1 and Fit2, that contain a series of (x,y) points on the curves that you want to overlay. In the following example, Fit1 contains points evaluated along on an exponential curve, whereas Fit2 contains points that define a piecewise linear curve:

data OrigData(keep=x y);          /* the original data */
set Sashelp.Cars;
rename weight=x  mpg_city=y;      /* scatter plot variables are X and Y */
run;
 
data Fit1;                        /* exponential curve to overlay */
do x1 = 1800 to 7200 by 200;      /* the variables for curve 1 are X1 and Y1 */
   y1 = exp( -0.25*x1/1000 + 3.86 );
   output;
end;
run;
 
data Fit2;                       /* piecewise linear curve to overlay */
input x2 y2 @@;                  /* the variables for curve 2 are X2 and Y2 */
datalines;
1800 50 3700 18 7000 12
;

Notice that the data sets are different sizes: The original data set has 428 observations, the exponential curve is evaluated at 28 points, and the piecewise-linear curve contains only three points. Nevertheless, you can concatenate these data sets into a single SAS data set, as follow.

data PlotData;                  /* concatenate data */
set OrigData Fit1 Fit2;
run;

For this trick to work, it is important that the variable names for the precomputed curves be unique. In particular, the variable names must be different from any variable name in the original data set. If necessary, use the RENAME= option to make sure that there is no conflict between variable names. The unique names ensure that the PlotData data set has a block structure, as shown in the following figure. The blue block in the upper-left corner represents the original data. Variables that are not in the original data are assigned missing values, which are colored white in the figure. Similarly, the blue "middle block" represents (x,y) coordinates for the first curve, and the blue lower-right block represents (x,y) coordinates for the second curve.

When a data set has this block structure, you can plot the original data and overlay the precomputed curves with a single call to PROC SGPLOT. You use the SERIES statement to overlay the precomputed curves. You can even use the REG, LOESS, or PBSPLINE statements to compute additional smoothers, as follows:

proc sgplot data=PlotData noautolegend cycleattrs;
  scatter x=x  y=y / markerattrs=(color=black) transparency=0.75; /* orig data */
  reg     x=x  y=y / nomarkers CurveLabel="Linear"; /* compute curve */
  series  x=x1 y=y1 / CurveLabel="Exponential";     /* curve 1   */
  series  x=x2 y=y2 / CurveLabel="Piecewise";       /* curve 2   */
run;

This example shows how to overlay precomputed curves on a scatter plot, but I have also used this technique to overlay a theoretical statistical distribution on a bar chart or on a histogram.

So there you have it. Remember this trick when you use a SAS procedure to compute a curve and you want to use PROC SGPLOT or PROC SGRENDER to overlay the curve on the data.

Note: A variation on this technique uses the MERGE statement, rather than the SET statement, to combine the curves and the data. The variation creates a data set (shown schematically below) that does not have a block structure, and therefore has fewer observations than for the concatenated data. However, some people object that the MERGE statement without a BY statement is confusing, and that a row in the merged data set is no longer a valid observation because a row of the X1, X2, Y1, and Y2 variables is unrelated to an observation in the original data. If you are bothered by such things, use the SET statement.

Post a Comment