Ways to multiply in the SAS/IML language

For programmers who are learning the SAS/IML language, it is sometimes confusing that there are two kinds of multiplication operators, whereas in the SAS DATA step there is only scalar multiplication. This article describes the multiplication operators in the SAS/IML language and how to use them to perform common tasks such as the elementwise product, the dot product, and the outer product of vectors.

Elementwise multiplication (#)

The elementwise multiplication operator (#) is used to perform element-by-element scalar multiplication. This operator is not part of the DATA step syntax. If you have two matrices of the same dimension, then u#v is the matrix whose ith element is the product of the ith elements of u and v. (This product is also known as the Hadamard product.) This is shown in the following PROC IML example:

proc iml;
u = { 1, 2, 3};         /* 3x1 column vector */
v = {-1, 0, 2};         /* 3x1 column vector */
 
elemProd = u#v;         /* elementwise product (Hadamard product) */
print elemProd;

The elementwise multiplication operator can also be used in some situations in which u is a vector that has the same row or column dimension as v. See my article on how the SAS/IML language "knows what you want."

True matrix multiplication (*)

The matrix multiplication operator (*) performs true matrix multiplication. Whereas the * operator is used for scalar multiplication in the DATA step, the operator is used for matrix multiplication in PROC IML. If u and v are any two matrices where the number of rows of u matches the number of columns of v, then the matrix product u*v is defined.

When u and v are vectors, matrix multiplication gets a special name. When a row vector is multiplied with a column vector, the result is a scalar and the operation is called the dot product (or inner product or scalar product). The following example uses the transpose operator (`) to create a row vector:

dotProd = u`*v;         /* dot product (scalar product, inner product) */
print dotProd;

There is an interesting connection between the elementwise product and the dot product of two vectors. The dot product of u and v is the same as the sum of the elements of the elementwise product: u`*v = sum(u#v).

Matrix multiplication is not commutative, so you get a different result if you multiply a column vector with a row vector. The result is a rank-1 matrix. This is called the outer product of two vectors. An example follows:

outerProd = u*v`;       /* outer product: column vec times row vec */
print outerProd;

Other matrix products

The SAS/IML language supports other kinds of multiplication, including the direct product (or Kronecker product) and the horizontal direct product of matrices:

dirProd = u`@v;        /* direct product */
hdirProd = hdir(u`,v); /* horizontal direct product */

There are many special-purpose products that are not covered in this short article, but remember that you can always define your own SAS/IML function that compute any conceivable product. For example, in physics classes students use the "cross product" (also called the skew-symmetric product) to compute quantities that arise in electromagnetism. The following SAS/IML function implements the cross product computation:

/* cross product (3D vectors only) */
start CrossProd(u, v);
   i231 = {2 3 1};
   i312 = {3 1 2};
   return( u[i231]#v[i312] - v[i231]#u[i312] );
finish;
 
uxv = CrossProd(u,v);

Be aware that in statistics, the "cross product" often refers to the multiplication X`*X, where X is a data matrix. In this matrix product, the (i,j)th element of X`*X is the dot product of the ith and jth columns of X.

Post a Comment

How to vectorize computations in a matrix language

Last week someone posted an interesting question to the SAS/IML Support Community. The problem involved four nested DO loops and took hours to run. By transforming several nested DO loops into an equivalent matrix operation, I was able to reduce the run time to about one second.

The process of converting loops into vector or matrix operations is called vectorization. The programmer who posted the question knew that vectorization would improve the program, but also stated that "vectorization is really difficult." I agree. However, life is full of difficult but necessary tasks. In statistical programming, vectorization is necessary (but rewarding!) because it can dramatically decrease the run time of a program.

Statistical programmers need to know how to vectorize programs. Whether you write your programs in SAS/IML language, MATLAB, or R, it is essential to know how to convert loops into equivalent vector operations. Therefore, I have decided to post a series of examples that show how to recognize certain patterns that arise in loops, and how to replace those loops with matrix operations. These examples will be tagged with the "vectorization" tag in the word cloud (in the right sidebar) on my blog. Today is the first example.

Here are some general tips and techniques:

  • When you have nested loops, vectorize the inner loop first.
  • Convert a loop with scalar operations into a single vector operation. For example, convert sums into vector dot products.
  • Convert a loop of vector operations into a single matrix-vector multiplication.
  • Convert a loop of matrix-vector multiplications into a single matrix-matrix multiplication.

In terms of matrix computations, follow this general rule:

Solve the problem by using the highest possible level of Basic Linear Algebra Subroutine (BLAS).

A level-0 BLAS is a scalar operation. A level-1 BLAS is a vector operation, such as a vector addition, a dot product, or a vector norm. A level-2 BLAS is a matrix-vector operation, such as a matrix-vector multiplication, an outer product, or solving a triangular linear system. A level-3 BLAS is a matrix-matrix operation, such as matrix multiplication, forming a cross-product matrix, or solving triangular linear systems with multiple right-hand sides (see the SAS/IML TRISOLV function).

The notions of "level" and BLAS are discussed in Chapter 1 of Golub and van Loan's Matrix Computations.

An example problem with three nested loops

The problem in the discussion forum involved four nested loops, but I'll extract the essence of the problem into a simpler example. Suppose that you have implemented a textbook formula and your SAS/IML program looks like the following:

proc iml;
X = {1 2 3,            /* example data */
     4 3 1,
     1 1 2,
     0 0 3};
n = nrow(X);
p = ncol(X);
 
S=j(p,p,.);            /* allocate matrix to hold results */
do j = 1 to p;
   do k = 1 to p;
      u=0;     
      do m = 1 to n;
         u = u +  X[m,j]*X[m,k];
      end;
      S[j,k] = u;      /* the (j,k)th result */
   end;
end;
print S;

At this stage, the program consists of three loops and scalar operations (level-0 BLAS).

When I study these loops, I notice the following:

  • The outer loop (the j loop) is a loop over the columns of the matrix X because the expression X[m,j] appears in the innermost loop.
  • The middle loop (the k loop) is also a loop over the columns of the matrix X because the expression X[m,k] appears in the innermost loop.
  • The inner loop (the m loop) is performing a summation. (Notice that u=0 before the loop and u=u+stuff inside the loop.) The terms that are being summed are the product of the jth and kth columns.

Step 1: Attack the innermost loop

Because the inner loop is a sum, replace the inner loop by the SUM function. The SUM function takes a vector. For this example, that vector will be the elementwise product of the jth and kth columns: X[ ,j]#X[ ,k]. The following statements show the simplified program, which now has only two loops:

/* Step 1: Replace inner loop with sum of the product of j_th col and k_th col. */
S=j(p,p,.);
do j = 1 to p;
   do k = 1 to p;
      S[j,k] = sum( X[ ,j] # X[,k] );
   end;
end;

Step 2: Replace summations by dot products

The inner loop now involves the sum of the elementwise product of two vectors. This sum has a more common name: the vector dot product. Consequently, replace the SUM function by a dot product. In order to get the dimensions to match, transpose the first vector, as follows:

/* Step 2: The sum of the product of two vectors is their dot product. */
S=j(p,p,.);
do j = 1 to p;
   do k = 1 to p;
      S[j,k] = X[ ,j]` * X[,k];
   end;
end;

At this stage, the program consists of two loops and vector operations (level-1 BLAS).

Step 3: Replace vector operations with matrix-vector operations

The next step involves converting vector operations into a matrix-vector operation. Notice that the X[,j] term does not change in the inner DO loop. Furthermore, the product is over all columns, so the inner loop is equivalent to a vector-matrix multiplication. The program is thereby reduced to one loop, as follows:

/* Step 3: Convert to vector-matrix product */
S=j(p,p,.);
do j=1 to p;
   S[j, ] = X[ ,j]` * X;
end;

At this stage, the program consists of one loop and matrix-vector operations (level-2 BLAS).

Step 4: Replace matrix-vector multiplications with matrix-matrix operations

The last step is to recognize that the only remaining loop iterates over the columns of X. But that is equivalent to iterating over the rows of X` (the transpose of X). The loop of vector-matrix multiplication is consequently equivalent to the cross-product operation, X`X, as follows:

/* Step 4: A loop over all rows is equivalent to multiplication by X` */
S = X` * X;
print S;

Notice that the three loops have been replaced by a single level-3 BLAS: matrix multiplication. As stated earlier, for large problems you can realize substantial savings of time if you eliminate loops in favor of high-level linear algebra operations.

One final comment. If you've struggled and labored, but still can't figure out how to vectorize your SAS/IML program, post it to the SAS/IML Support Community. That community is a helpful place to discuss issues related to efficiency and programming in the SAS/IML language.

Post a Comment

Use regression for a univariate analysis? Yes!

I've conducted a lot of univariate analyses in SAS, yet I'm always surprised when the best way to carry out the analysis uses a SAS regression procedure. I always think, "This is a univariate analysis! Why am I using a regression procedure? Doesn't a regression require at least two variables?"

Then it dawns on me. In the SAS regression procedures, a MODEL statement that does not contain explanatory variables simply performs a univariate analysis on the response variable. For example, when there are no explanatory variables, a classical regression analysis produces sample statistics, such as the mean, the variance, and the standard error of the mean, as shown in the following output:

/* estimates of mean, variance, and std err of mean */
ods select FitStatistics ParameterEstimates ;
proc reg data=sashelp.cars;
   model MPG_City= ;
quit;

The estimates in this PROC REG output are the same as are produced by the following call to PROC MEANS:

proc means data=sashelp.cars mean std stderr cv;
 var MPG_City;
run;

The ODS graphics that are produced by PROC REG also includes a histogram of the centered data and a normal Q-Q plot.

Here are some other instances in which a SAS regression procedure can be used to carry out a univariate analysis:

  • Robust estimates of scale and location. This univariate analysis is usually performed by using PROC UNIVARIATE with the ROBUSTSCALE option. However, you can also use the ROBUSTREG procedure to estimate robust statistics. The ROBUSTREG procedure provides four different estimation methods, which you can control by using the METHOD= option. For example, the following statements display robust estimates of location and scale:
    ods select ParameterEstimates;
    proc robustreg data=sashelp.cars method=M;
       model MPG_City= ;
    run;
  • Detection of univariate outliers. How can you detect univariate outliers in SAS? One way is to call the ROBUSTREG procedure! Again, there are four estimation methods that you can use.
    ods select DiagSummary;
    proc robustreg data=sashelp.cars method=LTS;
       model MPG_City= ;
       output out=out outlier=outliers;
    run;
     
    proc print data=out(where=(outliers=1));
       var make model type mpg_city;
    run;
  • Estimation of quantiles, with confidence intervals. Last week I showed how to use PROC UNIVARIATE to compute sample quantiles and confidence intervals. An alternative approach is to use the QUANTREG procedure, as follows:
    /* estimate univariate quantiles, CIs, and std errors */
    ods output ParameterEstimates=QntlCI;
    proc quantreg data=sashelp.cars;
       model MPG_City= / quantiles=0.025 0.2 0.8 0.975;
    run;
     
    proc print data=QntlCI; run;
  • Fit discrete parametric models to univariate data. I've previously shown how to use the GENMOD procedure to fit a Poisson model to data, and the same technique can be used to fit other discrete distributions, including the binomial, geometric, multinomial, negative binomial, and some zero-inflated distributions.

  • Fit parameters for a mixed density model to univariate data. I've previously demonstrated how to use the FMM procedure to fit a finite mixture distribution to data.

There are other examples, but I hope you see that the SAS regression procedures are useful for computing univariate statistics and analyses.

Do you have a favorite univariate analysis that can be accomplished by using a SAS regression procedure? Let me know about it by leaving a comment.

Post a Comment

A three-panel visualization of a distribution

At a recent conference, I talked with a SAS customer who told me that he was using an R package to create a three-panel visualization of a distribution. Unfortunately, he couldn't remember the name of the package, and he has not returned my e-mails, so the purpose of today's article is to discuss some ideas related to this visualization and to solicit critiques of my implementation.

The customer wanted to create a paneled display in SAS that includes three graphs: a histogram, a box plot, and a normal quantile-quantile (Q-Q) plot. We sketched out an idea, and the plot at the left is my implementation of our sketch. (Click to enlarge.)

One question I asked was, "Do you want the usual Q-Q plot, or should we flip it?" The usual Q-Q plot is a scatter plot of the ordered data values (on the vertical axis) plotted against the corresponding quantiles of a normal distribution (on the horizontal axis). The purpose of a Q-Q plot is to see whether the points fall along a straight line, which would indicate that the data are normally distributed. I remarked that if we flip the plot so that the data values are displayed horizontally, then the histogram, boxplot, and Q-Q plot can all share a common horizontal axis. The customer said that this seemed like a good idea.

In SAS, you can create this kind of paneled layout by using the Graph Template Language (GTL) and the SGRENDER procedure.

A GTL template for a three-panel display

When I returned from the conference, I created a GTL template that defines a three-panel display. The top panel, which occupies 50% of the height of the display, is a histogram of the data overlaid with a normal curve and a kernel density estimate. The second panel, which occupies 10% of the height, is a horizontal box plot. The third panel is a normal Q-Q plot, but is flipped so that the normal quantiles are plotted on the vertical axis. A diagonal reference line is added to the Q-Q plot. Normally distributed data should fall near the reference line.

The threepanel template takes five dynamic variables. The data and the normal quantiles are referenced by the dynamic variables _X and _QUANTILE, respectively. The title is supplied by using the _Title variable. Lastly, the parameter estimates for the normal curve that best fits the data are supplied by using the _mu and _sigma dynamic variables. The template definition follows:

/* define 'threepanel' template that displays a histogram, box plot, and Q-Q plot */
proc template;
define statgraph threepanel;
dynamic _X _QUANTILE _Title _mu _sigma;
begingraph;
   entrytitle halign=center _Title;
   layout lattice / rowdatarange=data columndatarange=union 
      columns=1 rowgutter=5 rowweights=(0.4 0.10 0.5);
      layout overlay;
         histogram   _X / name='histogram' binaxis=false;
         densityplot _X / name='Normal' normal();
         densityplot _X / name='Kernel' kernel() lineattrs=GraphData2(thickness=2 );
         discretelegend 'Normal' 'Kernel' / border=true halign=right valign=top location=inside across=1;
      endlayout;
      layout overlay;
         boxplot y=_X / boxwidth=0.8 orient=horizontal;
      endlayout;
      layout overlay;
         scatterplot x=_X y=_QUANTILE;
         lineparm x=_mu y=0.0 slope=eval(1./_sigma) / extend=true clip=true;
      endlayout;
      columnaxes;
         columnaxis;
      endcolumnaxes;
   endlayout;
endgraph;
end;
run;

You can download the %ThreePanel macro, which creates a three-panel display for any variable in any data set. If you want to learn more about how to write GTL templates, I recommend the book Statistical Graphics in SAS: An Introduction to the Graph Template Language and the Statistical Graphics Procedures by my colleague, Warren Kuhfeld.

The macro calls PROC UNIVARIATE to compute the normal parameter estimates and the quantiles. That information is then used by PROC SGRENDER to create the plot according to the specifications in the threepanel template. The macro uses a cool trick: I get the data for the Q-Q plot by using an ODS OUTPUT statement on a graph that is created by PROC UNIVARIATE.

The image at the top of this post shows how the template renders the MPG_City variable in the Sashelp.Cars data set. The image was created as follows:

ods graphics on;
%ThreePanel(Sashelp.Cars, MPG_City)

The MPG_City variable is not normally distributed, as is evident by looking at the poor fit of the data in the Q-Q plot (lower panel). In contrast, the distribution of the SepalLength variable in the Sashelp.Iris data set appears to be more normal, as shown below:

%ThreePanel(Sashelp.Iris, SepalLength)

Discussion

What do you think? Try it out on your own data and let me know if you have suggestions to improve it. Is this a useful display? Leave a comment.

Post a Comment

Compute confidence intervals for percentiles in SAS

PROC UNIVARIATE has provided confidence intervals for standard percentiles (quartiles) for eons. However, in SAS 9.3M2 (featuring the 12.1 analytical procedures) you can use a new feature in PROC UNIVARIATE to compute confidence intervals for a specified list of percentiles.

To be clear, percentiles and quantiles are essentially the same thing. For example, the median value of a set of data is the 0.5 quantile, which is also the 50th percentile. In general, the pth quantile is the (100 p)th percentile.

The CIPCTLDF option on the PROC UNIVARIATE statement produces distribution-free confidence intervals for the 1st, 5th, 10th, 25th, 50th, 75th, 90th, 95th, and 99th percentiles as shown in the following example:

/* CI for standard percentiles: 1, 5, 10, 25, 50, 75, 90, 95, 99 */
ods select Quantiles;
proc univariate data=Sashelp.Cars cipctldf;
   var MPG_City;
run;

However, prior to the 12.1 releaase of the analytics procedures, there was not an easy way to obtain confidence intervals for arbitrary percentiles. (Recall that you can specify by nonstandard percentiles by using the PCTLPTS= option on the OUTPUT statement.)

I am happy to report that the OUTPUT statement in the UNIVARIATE procedure now supports the CIPCTLDF= option, which you can use as follows:

proc univariate data=sashelp.cars noprint;
   var MPG_City;
   output out=pctl pctlpts=2.5 20 80 97.5 pctlpre=p
          cipctldf=(lowerpre=LCL upperpre=UCL);    /* 12.1 options (SAS 9.3m2) */
run;
 
proc print noobs; run;

The CIPCTLDF= option computes distribution-free confidence intervals for the percentiles that are specified on the PCTLPTS= option. The LOWERPRE= option specifies the prefix to use for lower confidence limits; the UPPERPRE= option specifies the prefix to use for upper confidence limits.

If your data are normally distributed, you can use the CIPCTLNORMAL= option on the OUTPUT statement to compute confidence limits. However, if your data are not normally distributed, the CIPCTLNORMAL= option might produce inaccurate results. For example, on the MPG_City data, which is highly skewed, the confidence intervals for large percentiles (like the 99th percentile) do not contain the corresponding point estimate. For this reason, I prefer the distribution-free intervals for most analyses.

Post a Comment

Understanding local and global variables in the SAS/IML language

The TV show Cheers was set in a bar "where everybody knows your name." Global knowledge of a name is appealing for a neighborhood pub, but not for a programming language. Most programming languages enable you to define functions that have local variables: variables whose names are known only inside the function. This article describes local and global variables in the SAS/IML language.

Scope: the environment where everybody knows your name

One of the features of the SAS/IML language is that you can create your own user-defined functions or subroutines that extend the capabilities of SAS. For brevity, this article discusses functions, but the same ideas apply to user-defined subroutines.

In computer science, the scope of a variable is the "context... in which a variable name... is valid and can be used." A variable inside a function usually has local scope, which means that the variable’s name is known inside the function, but not outside. Furthermore, modifying a local variable does not affect any variable outside the function. For example, the following SAS/IML program defines a variable y inside a function. The local variable is created when the function executes and vanishes when the function exits. Although a variable outside the function is also named y, the outer variable is not affected by running the function:

proc iml; 
start F1(x); /* a function with local variables */
   y = 2*x;                       /* y is local */
   print y[label="y inside function (local)"]; 
   return(1); 
finish;
 
y = 0; t=1:5;
v = F1(t);
print y[label="y outside function"];

The scope of the variables is shown in the following diagram. Three variables are known to the main program: y, t, and v. Inside the function, two names are known: x and y. The local variable named y is not related to the variable y in the main scope. They have the same name, but their scope is different.

There are two ways to enable a variable inside a function to affect variables outside the function.

Sometimes SAS/IML users ask if there is a third alternative. Programmers sometimes ask whether it is possible to create a variable that is shared between several functions, but is not global to the entire program. The answer is no. The SAS/IML language does not support a namespace or variables that are global to a namespace.

Parent variables: sharing the memory, but not the name

Because the SAS/IML language passes values by reference, modifying one of the function's arguments changes the value of the matrix that was passed in. The following example illustrates this:

start F2(x); /* a function that modifies its argument */
   x = 2*x;                       /* x is an argument */
   print x[label="x inside module (argument)"]; 
   return(2); 
finish;
 
y = 0; t=1:5;
v = F2(t);
print t[label="t at main scope"];

Notice that the variable t at main scope has a different name from the parameter x inside the module, but both variables share the same memory. Because x is an argument, changing x inside the function also changes t. See my previous article for more details about passing by reference. The behavior of the F2 function is summarized in the following diagram.

Global variables: sharing the memory and the name

A variable has global scope if you include it in the GLOBAL clause of the START statement. A variable that has global scope can be read or modified inside the module. It corresponds to a variable of the same name that exists at main scope. The following example illustrates this:

start F3(x) GLOBAL(y);  /* a function that has a global variable */
   y = 2*x;                                       /* y is global */
   print y[label="y inside module (global)"]; 
   return(3); 
finish;
 
y = 0; t=1:5;
v = F3(t);
print y[label="y at main scope"];

In this example, the y vector is changed from within the F3 function because y is declared to be a global variable. The following diagram illustrates the behavior of the F3 function.

The role of global variables in SAS/IML programs

Although global variables are discouraged in computer science courses, they serve an important purpose in SAS/IML programming. Namely, when you write an optimization program in the SAS/IML language, the function that is optimized (called the objective function) must contain only one argument. The argument vector is modified until the objective function reaches an optimal value. Any parameters to the objective function must be specified as global variables. The global variables are parameters that are held constant during the optimization process.

Post a Comment

The new book on my desk

Rick Wicklin with Book; Click to enlarge

In my constant effort to keep pace with Chris Hemedinger, I am pleased to announce the availability of my new book, Simulating Data with SAS. Chris started a tradition for SAS Press authors to post a photo of themselves with their new book. Thanks to everyone who helped with the production of the book.

I have posted an overview of the book's purpose on the SAS Bookshelf blog.

One interesting aspect of the book is the cover, which is not discussed in the book. I have attached an image of the cover graphic below.

I have question for you, readers. What, if anything, does the book's cover graphic mean to you? Do you like it? Hate it? Not understand it? Think you have a better idea for a cover image? There are no wrong answers, so post whatever comes to mind. Funny, snarky, intellectual, critical, factual—all opinions are welcome!

But please remember: You can't judge a book by its cover!

Cover image
Post a Comment

How to overlay a custom density curve on a histogram in SAS

I've previously described how to overlay two or more density curves on a single plot. I've also written about how to use PROC SGPLOT to overlay custom curves on a graph. This article describes how to overlay a density curve on a histogram. For common distributions, you can overlay a density by using SAS procedures. However, this article shows how to use the Graphics Template Language (GTL) to overlay a custom density estimate. An example is shown in the figure to the left. The information in this article is adapted from Chapter 3 of my book Simulating Data with SAS.

Overlaying common densities

The UNIVARIATE procedure supports fitting about a dozen distributions to data, including the beta, exponential, gamma, lognormal, and Weibull distributions. There are several examples in the PROC UNIVARIATE documentation that demonstrate how to fit parametric densities to data.

For more complicated models, other SAS procedures are available. You can use the FMM procedure to fit finite mixtures of distributions. In SAS/ETS software, the SEVERITY procedure can fit many distributional models to data. For nonparametric models, the KDE procedure can fit one- and two-dimensional kernel density estimates.

These procedures enable you to overlay density curves for common distributions. For me, these procedures suffice 99% of the time. However, there are situations where you might want to do the following:

  1. Compute a probability density function (PDF) outside of any procedure. The density curve might come from a computation or from evaluating a formula.
  2. Add additional features to the graph. For example, you might want to add reference lines, modify the placement of tick marks, add legends, titles, and so forth.

The SGPLOT procedure does not enable you to use a HISTOGRAM statement and a SERIES statement in the same call. However, you can overlay a histogram and a curve by using the GTL.

Overlaying custom densities

The following template is a simplified version of a template that I used in my book. You can use it as is, or modify it to include additional features such as grid lines.

proc template;
define statgraph ContPDF;
dynamic _X _T _Y _Title _HAlign _binstart _binstop _binwidth;
begingraph;
   entrytitle halign=center _Title;
   layout overlay /xaxisopts=(linearopts=(viewmax=_binstop));
      histogram _X / name='hist' SCALE=DENSITY binaxis=true 
         endlabels=true xvalues=leftpoints binstart=_binstart binwidth=_binwidth;
      seriesplot x=_T y=_Y / name='PDF' legendlabel="PDF" lineattrs=(thickness=2);
      discretelegend 'PDF' / opaque=true border=true halign=_HAlign valign=top 
            across=1 location=inside;
   endlayout;
endgraph;
end;
run;

The template, which is called ContPDF, uses the LAYOUT OVERLAY statement to overlay a histogram and a series (a continuous curve). The name of the histogram variable is provided in the dynamic variable _X. The histogram is plotted on the density scale. The dynamic variables _binstart, _binstop, and _binstart are used to set the histogram bins to convenient values. The variables for the series plot are provided by the dynamic variables _T and _Y. The other two dynamic variables are _HAlign, which determines the location of an inset, and _Title, which specifies the title of the graph.

To see how this template might be used, recall that I have previously shown how to generate random values from the folded normal distribution. The following DATA step generates 1,000 values from the folded normal distribution:

data MyData(drop=i);
call streaminit(1);
do i = 1 to 1000;
   x = abs( rand("Normal") );
   output;
end;
run;

The PDF for a folded distribution is easy to compute: simply fold the distribution across zero and add the densities of each tail:

data PDFFolded;
do t = 0 to 4 by 0.1;
   y = pdf("Normal",  t) + pdf("Normal", -t);  /* =2*PDF for the normal dist */
   output;
end;
run;

The goal is to construct a histogram of the MyData data set and overlay the curve in the PDFFolded data set. There is a standard technique that you can use to overlay a curve on data: concatenate the two data sets and call an SG procedure on the combined data.

data All;
set MyData PDFFolded;
run;
 
proc sgrender data=All template=ContPDF;
dynamic _X="X" _T="T" _Y="Y" _HAlign="right"
        _binstart=0 _binstop=4 _binwidth=0.25
        _Title="Sample from Folded Normal Distribution";
run;

The graph that results is shown at the beginning of this article. Notice that this graph cannot be produced without using the GTL, because the folded normal distribution is not one of the distributions that is supported by PROC UNIVARIATE or PROC SEVERITY.

I try to avoid using the GTL, but it is necessary in this case. Perhaps one day the SGPLOT procedure will enable overlaying a custom curve on a histogram, but until then the GTL provides a way to overlay a custom density estimate on a histogram.

Post a Comment

How to overlay custom curves with PROC SGPLOT

I recently showed someone a trick to create a graph, and he was extremely pleased to learn it. The trick is well known to many SAS users, but I hope that this article will introduce it to even more SAS users.

At issue is how to use the SGPLOT procedure to overlay precomputed curves on a graph of the data. For example, how can you display a curve on a scatter plot when the curve is computed by a SAS/STAT procedure or is given by a formula? (Of course, the SGPLOT procedure supports the REG statement, the LOESS statement, and the PBSPLINE statement, which compute and overlay regression curves, but I am talking about curves that are computed outside of PROC SGPLOT.)

The trick is to create a SAS data set that contains all of the variables that need to be plotted, and that has a special structure. For example, suppose you have observations in a data set called OrigData. Suppose further that there are two other data sets, Fit1 and Fit2, that contain a series of (x,y) points on the curves that you want to overlay. In the following example, Fit1 contains points evaluated along on an exponential curve, whereas Fit2 contains points that define a piecewise linear curve:

data OrigData(keep=x y);          /* the original data */
set Sashelp.Cars;
rename weight=x  mpg_city=y;      /* scatter plot variables are X and Y */
run;
 
data Fit1;                        /* exponential curve to overlay */
do x1 = 1800 to 7200 by 200;      /* the variables for curve 1 are X1 and Y1 */
   y1 = exp( -0.25*x1/1000 + 3.86 );
   output;
end;
run;
 
data Fit2;                       /* piecewise linear curve to overlay */
input x2 y2 @@;                  /* the variables for curve 2 are X2 and Y2 */
datalines;
1800 50 3700 18 7000 12
;

Notice that the data sets are different sizes: The original data set has 428 observations, the exponential curve is evaluated at 28 points, and the piecewise-linear curve contains only three points. Nevertheless, you can concatenate these data sets into a single SAS data set, as follow.

data PlotData;                  /* concatenate data */
set OrigData Fit1 Fit2;
run;

For this trick to work, it is important that the variable names for the precomputed curves be unique. In particular, the variable names must be different from any variable name in the original data set. If necessary, use the RENAME= option to make sure that there is no conflict between variable names. The unique names ensure that the PlotData data set has a block structure, as shown in the following figure. The blue block in the upper-left corner represents the original data. Variables that are not in the original data are assigned missing values, which are colored white in the figure. Similarly, the blue "middle block" represents (x,y) coordinates for the first curve, and the blue lower-right block represents (x,y) coordinates for the second curve.

When a data set has this block structure, you can plot the original data and overlay the precomputed curves with a single call to PROC SGPLOT. You use the SERIES statement to overlay the precomputed curves. You can even use the REG, LOESS, or PBSPLINE statements to compute additional smoothers, as follows:

proc sgplot data=PlotData noautolegend cycleattrs;
  scatter x=x  y=y / markerattrs=(color=black) transparency=0.75; /* orig data */
  reg     x=x  y=y / nomarkers CurveLabel="Linear"; /* compute curve */
  series  x=x1 y=y1 / CurveLabel="Exponential";     /* curve 1   */
  series  x=x2 y=y2 / CurveLabel="Piecewise";       /* curve 2   */
run;

This example shows how to overlay precomputed curves on a scatter plot, but I have also used this technique to overlay a theoretical statistical distribution on a bar chart or on a histogram.

So there you have it. Remember this trick when you use a SAS procedure to compute a curve and you want to use PROC SGPLOT or PROC SGRENDER to overlay the curve on the data.

Note: A variation on this technique uses the MERGE statement, rather than the SET statement, to combine the curves and the data. The variation creates a data set (shown schematically below) that does not have a block structure, and therefore has fewer observations than for the concatenated data. However, some people object that the MERGE statement without a BY statement is confusing, and that a row in the merged data set is no longer a valid observation because a row of the X1, X2, Y1, and Y2 variables is unrelated to an observation in the original data. If you are bothered by such things, use the SET statement.

Post a Comment

Quantile regression: Better than connecting the sample quantiles of binned data

I often see variations of the following question posted on statistical discussion forums:

I want to bin the X variable into a small number of values. For each bin, I want to draw the quartiles of the Y variable for that bin. Then I want to connect the corresponding quartile points from bin to bin.

In other words, the question asks how to create a plot like the following:

The plot attempts to show how the first quartile, second quartile (median), and third quartile of the response variable vary as the explanatory variable changes. However, I do not recommend using this plot when the explanatory variable is continuous. When I see a question like this, I sometimes respond by saying "you might want to look at quantile regression."

This article takes a quick look at quantile regression. What is quantile regression? How can you perform quantile regression in SAS? And how does it relate to the "binned quantile plot" that is shown above?

Start with the familiar: Standard regression

Before discussing quantile regression, let's introduce some data and think about a typical analysis. The data are salaries (in 1991) for 459 statistics professors in U.S. colleges and universities. A professor's salary (the response) is assumed to be related to his or her years of service (the explanatory variable). For these data, the explanatory variable is already binned into 25 discrete values. These data and this analysis are based on an example in the documentation for the QUANTREG procedure. You can download the data and the SAS statements that are used in this article.

A traditional regression analysis predicts the mean salary for a professor, given the years of service. In SAS, there are many regression procedures, such as the parametric GLM procedure and the nonparametric LOESS procedure. For these procedure, you can also call the regression directly from the SGPLOT procedure. For example, the following statements add a loess curve and a cubic regression curve to the data:

title "Salary by Experience";
proc sgplot data=salary;
   loess x=Year y=Salaries / smooth=0.5;       /* nonparametric regression */
   reg x=Year y=Salaries / degree=3 nomarkers legendLabel="Cubic Regression";
run;

For the loess model, salaries appear to increase with experience for the first six years and for years 12–18; the salaries appear to be flat for the other intervals. For the cubic regression model, the salaries appear to increase gradually, although they increase more quickly for the first 10 years.

These regression curves smooth the data. For a parametric model, you can sometimes interpret the parameters of the model in terms of physical quantities. Do you believe that the mean salary of a professor depends smoothly on the number of years of service? These models reflect that assumption.

Now think for a moment about what we did not do. We did not compute the sample mean for each year and "connect the dots." That would result in a jagged line, which does not smooth the data. "Connecting the dots" is not a model; it does not provide insight into the relationship between salary and years of service, not does it accomodate random errors in the data.

Moving from means to quantiles

Given the number of years of service, each of the previous curves predicts the mean salary. Statistically speaking, the regression curves model the "conditional mean" of the response variable. However, you can also design a regression procedure to model the conditional median of the response. In fact, you can model other quantiles, such as the upper quartile (75th percentile), the lower decile (10th percentile), or other values.

A model for a conditional quantile is known as quantile regression. In SAS, quantile regression is computed by using the QUANTREG procedure, the QUANTSELECT procedure (which supports variable selection), or the QUANTLIFE procedure (which support censored observations).

What is quantile regression?

There have been several introductory papers written on quantile regression:

I don't intend to duplicate these papers. Instead, let's use PROC QUANTREG to compute a quantile regression and compare it with the binned quantile plot:

title "Quantile Regression of Salary vs. Year";
ods graphics on;
proc quantreg data=salary ci=sparsity;
   model salaries = year year*year year*year*year /
                    quantile=0.25 0.5 0.75 plot=fitplot(showlimits);
run;

For these data, the model for the lower quartile increases for about 10 years, then levels off. The model for the conditional median is qualitatively similar to the cubic model for the conditional mean. The model for the upper quartile indicates a higher growth rate that for the median quartile. The quantile curves enable you to estimate how the inter-quartile range (the gap between the upper and lower quartiles) grows with time. For newly hired professors, only about $7,000 separates the relatively high salaries from the relatively low salaries. However, for professors with twenty or more years of experience, that gap has widened to more than $10,000.

When compared with the binned quantile plot at the beginning of this article, this quantile regression plot has several advantages:

  1. The curves smooth the data. Given the number of years of service, you can read off the predicted quartiles of salary from the plot.
  2. You can use the CI= option on the PROC QUANTREG statement to obtain confidence intervals (shown as shaded bands) for the predicted quartiles. See the PROC QUANTREG documentation for details.
  3. For parametric models (especially linear models), you might be able to interpret the parameters to gain insight into the process that generates the data. You can also compute nonparametric models by using the EFFECT statement to create spline effects.
  4. Quantile regression extends easily to multiple explanatory variables, whereas binning data gets harder as the dimension increases, and you often get bins for which there are no data.

So reach for quantile regression when you want to investigate how quartiles, quintiles, or deciles of the response variable change with covariates. The QUANTREG procedure is easy to run, and the results are superior to ad-hoc methods such as binning the data and connecting the sample quantiles.

Post a Comment