Pairwise comparisons of a data vector

A SAS customer showed me a SAS/IML program that he had obtained from a book. The program was taking a long time to run on his data, which was somewhat large. He was wondering if I could identify any inefficiencies in the program.

The first thing I did was to "profile" the program to identify trouble spots. (See Ch. 15, "Timing Computations and the Performance of Algorithms," in Statistical Programming with SAS/IML Software.) I discovered one function that was taking about 80% of the total time. The inefficient function (if I may oversimplify the problem) essentially took a vector x with N elements and returned an N x N indicator matrix, A, whose (i, j)th element is 1 if x[i] ≥ x[i] and is 0 otherwise. A small example of the x vector is used in the following statements. The inefficient code looked something like this:

proc iml;
x = {16, 15, 6, 6, 20, 5, 6, 12, 5, 12};  /* data for pairwise comparison */
N = nrow(x);
A = j(N,N,0);
do i = 1 to N;
   do j = 1 to N;
      if x[i] >= x[j] then 
         A[i,j] = 1;       /* A[i,j] = 1 if x[i] >= x[j]; 0 otherwise */

The 10 x 10 indicator matrix for this example is shown. When N is large, this double DO loop approach can be slow. I have written about similar computations before. For example, I've written about how to compute pairwise differences for a vector of values, which is almost the same problem. The key is to eliminate nested DO loops and replace them with vector or matrix operations, a process known as vectorization.

Pairwise comparisons: An exercise in vectorization

To implement a vector-based computation, think about the jth column of A. What is the jth column? It is the indicator vector obtained by comparing x to the jth value of x. The following statements rewrite the computation by using a single DO loop that iterates over each column of the resulting matrix:

B = j(N,N,0);
do j = 1 to N;
   B[,j] = (x >= x[j]);

Equivalently, you could loop over rows of the indicator matrix by using B[i,] = (x` <= x[i]) (but notice the direction of the inequality!). Either method produces a big speedup in the computation.

A matrix approach is less intuitive, but can be more efficient unless the matrices are huge. The idea is to form a matrix with N columns, each column being a copy of the data vector x. If you compare the elements of this matrix to the elements of its transpose, you obtain the desired indicator matrix:

xx = repeat(x,  1, N);    /* the i_th row is x[i]     */
C = (xx >= xx`);          /* indicator matrix         */

The matrices A, B, and C are equal.

A slight twist

In the previous section, I oversimplified the problem. The real problem evaluated several complicated logical expressions (by using IF-THEN statements) within the doubly-nested DO loops. The result of those logical expressions affected the values of the indicator matrix. To simplify matters, I'll illustrate the situation by using a binary vector and a simple logical AND operator. Assume that the IF-THEN statements resolve to either 1 or 0 (true or false) for each value of the column index, j, as follows:

s = {0, 1, 1, 0, 0, 1, 0, 1, 1, 0};   /* result of logical expressions, j=1 to N */
A = j(N,N,0);
do i = 1 to N;
   do j = 1 to N;
      if x[i] >= x[j] & s[j]=1 then   /* more complicated IF-THEN logic: */
         A[i,j] = 1;                  /* A[i,j] = 1 if x[i] >= x[j] AND s[j]=1 */

Although this new situation is more complicated, the lessons of the previous section remain relevant. Here's how you can compute the indicator matrix by using vector operations:

B = j(N,N,0);
do j = 1 to N;
   B[,j] = (x >= x[j]) & (s[j]=1);

Here's the equivalent matrix operations:

xx = repeat(x, 1, N);
ss = repeat(s, 1, N);
C = (xx >= xx`) & (ss`=1);

If the logic within the doubly-nested loops depends on both rows and columns (i and j), then the variable s becomes a matrix and the computations are modified accordingly.

In summary, many SAS/IML programmers know that looping over rows and columns of a matrix is inefficient and that vectorizing the computations leads to performance improvements. This article applies the same principles to pairwise comparisons of data in a vector. By avoiding scalar comparisons, you can improve the performance of the computation.

Post a Comment

Create standard statistical graphs from SAS/IML

Last week I showed how to use the SUBMIT and ENDSUBMIT statements in the SAS/IML language to call the SGPLOT procedure to create ODS graphs of data that are in SAS/IML vectors and matrices. I also showed how to create a SAS/IML module that hides the details and enables you to create a plot by using a single statement.

Wouldn't it be great if someone used these ideas to write SAS/IML modules that enable you to create frequently used statistical graphics like bar charts, histograms, and scatter plots?

Someone did! SAS/IML 12.3 (which shipped with SAS 9.4) includes five modules that enable you to create basic statistical graphs. The modules are named for a corresponding SGPLOT statement, as follow:

  • The BAR subroutine creates a bar chart of a categorical variable. If you specify a grouping variable, you can create stacked or clustered bar charts. You can control the ordering of the categories in the bar chart.
  • The BOX subroutine creates a box plot of a continuous variable or multiple box plots when you also specify a categorical variable. You can control the ordering of the categories in the box plot.
  • The HISTOGRAM subroutine creates a histogram of a continuous variable. You can overlay a normal density or kernel density estimate. You can control the arrangement of bins in the histogram.
  • The SCATTER subroutine creates a scatter plot of two continuous variables. You can color markers according to a third grouping variable, and you can specify labels for markers.
  • The SERIES subroutine creates a series plot, which is sometimes called a line plot. You can specify a grouping variable to overlay multiple line plots.

You can watch a seven-minute video that shows examples of statistical graphs that you can create by using these subroutines. The SAS/IML documentation also contains a new chapter on these statistical graphs.

Examples of creating statistical graphs

Because the video does not enable you to copy and paste, the rest of this article shows SAS/IML statements that create some of the graphs in the video. You can modify these statements to display new data, add or delete options, and so forth.

The video examples use data from the Sashelp.Cars data set, which contains information about 428 cars, trucks, and SUVs. The following statements read the data and create five SAS/IML vectors: three categorical variables (Model, Type, and Origin) and two continuous variables (MPG_City and Weight):

proc iml;
use Sashelp.Cars where(type ? {"SUV" "Truck" "Sedan"});
read all var {Model Type Origin MPG_City Weight};
close Sashelp.Cars;

The following statement creates a clustered bar chart of the Origin variable, grouped by the Type variable:

title "Clustered Bar Chart";
call Bar(Origin) group=Type groupopt="Cluster" grid="Y";

The graph shows the number of vehicles in the data that are manufactured in Asia, Europe, and the US. The graph shows that there are no European trucks in the data and that each region produces more than 75 sedans.

You can create a box plot that shows the relative fuel economy for vehicles that are manufactured in each of the three regions:

title "Box Plot with Data Labels";
call Box(MPG_City) category=Origin option="spread" datalabel=putc(Model,"$10.");

The box plots show that the average and median fuel economy of Asian vehicles is greater than for European and US-built vehicles. Outliers are labeled by using the first 10 characters of the Model variable. Outliers that would otherwise overlap are spread horizontally.

You can use a histogram to visualize the distribution of the MPG_City variable. The following statement creates a histogram and overlays a normal and kernel density estimate:

title "Histogram with Density Curves";
call Histogram(MPG_City) density={"Normal" "Kernel"} rebin={0 5};

The normal curve does not fit the data well. The kernel density estimate shows three peaks because many vehicles get about 20 mpg, about 25 mpg, and about 30 mpg.

You can create a scatter plot that shows the relationship between fuel economy and the weight of a vehicle. The following statement creates a scatter plot and colors each marker according to whether it represents a sedan, truck, or SUV.

title "Scatter Plot with Groups";
run Scatter(Weight, MPG_City) group=Type;

The scatter plot shows that lighter vehicles tend to have better fuel economy. Sedans, which tend to be relatively light, often have better fuel economy than trucks and SUVs.

The Sashelp.Cars data set is not appropriate for demonstrating a series plot. The following statements evaluate the standard normal probability density function on the range [-5, 5] and call the SERIES subroutine to visualize the density function. Two vertical reference lines are overlaid on the plot.

x = do(-5, 5, 0.1);
Density = pdf("Normal", x, 0, 1);
title "Series Plot with Reference Lines";
call Series(x, Density) other="refline -2 2 / axis=x" grid={X Y};

This article has briefly described five SAS/IML subroutines in SAS/IML 12.3 that enable you quickly to create ODS statistical graphs from data in vectors or matrices. The routines are not intended to be a complete interface to the SGPLOT procedure. Rather, the routines expose common options for frequently used statistical graphs. If you want to create more complicated graphs that use more esoteric options, you can use the techniques from my previous post to write your own SAS/IML modules that create the exact graph that you need.

Post a Comment

Simulate many samples from a logistic regression model

My last blog post showed how to simulate data for a logistic regression model with two continuous variables. To keep the discussion simple, I simulated a single sample with N observations. However, to obtain the sampling distribution of statistics, you need to generate many samples from the same logistic model. This article shows how to simulate many samples efficiently in the SAS/IML language. The example is from my book Simulating Data with SAS.

The program in my previous post consists of four steps. (Or five, if you count creating a data set as a separate step.) To simulate multiple samples, put a DO loop around Step 4, the step that generates a random binary response vector from the probabilities that were computed for each observation in the model. The following program writes a single data set that contains 100 samples. Each sample is identified by an ordinal variable named SampleID. The data set is in the form required for BY-group processing, which is the efficient way to simulate data with SAS.

%let N = 150;                                     /* N = sample size */
%let NumSamples = 100;                            /* number of samples */
proc iml;
call randseed(1);
X = j(&N, 3, 1);                               /* X[,1] is intercept */
/* 1. Read design matrix for X or assign X randomly.
   For this example, x1 ~ U(0,1) and x2 ~ N(0,2)  */
X[,2] = randfun(&N, "Uniform"); 
X[,3] = randfun(&N, "Normal", 0, 2);
/* Logistic model with parameters {2, -4, 1} */
beta = {2, -4, 1};
eta = X*beta;                       /* 2. linear model */
mu = logistic(eta);                 /* 3. transform by inverse logit */
/* create output data set. Output data during each loop iteration.     */
varNames = {"SampleID" "y"} || ("x1":"x2");           /* 1st col is ID */
Out = j(&N,1) || X;                /* matrix to store simulated data   */
create LogisticData from Out[c=varNames];
y = j(&N,1);                       /* allocate response vector         */
do i = 1 to &NumSamples;           /* the simulation loop              */
   Out[,1] = i;                         /* update value of SampleID    */
   call randgen(y, "Bernoulli", mu);    /* 4. simulate binary response */
   Out[,2] = y;
   append from Out;                     /* 5. output this sample       */
end;                               /* end loop                         */
close LogisticData; 

You can call PROC LOGISTIC and specify the SampleID variable on the BY statement. This will produce a set of 100 statistics. The union is an approximate sampling distribution for the statistics. The following call to PROC LOGISTIC computes 100 parameter estimates and saves them to the PE data set.

options nonotes;
proc logistic data=LogisticData noprint outest=PE;
   by SampleId;
   model y(Event='1') = x1 x2;
options notes;

Notice that the call to PROC LOGISTIC uses the NOPRINT option (and the NONOTES option) to suppress unnecessary output during the simulation.

You can use PROC UNIVARIATE to examine the sampling distribution of each parameter estimate. Of course, the parameter estimates also have correlation with each other, so you might also be interested in how the estimates are jointly distributed. You can use the CORR procedure to analyze and visualize the pairwise distributions of the estimates.

In the following statements, I do something slightly more complicated. I use PROC SGSCATTER to create a matrix of pairwise scatter plots of the parameter estimates, with the true parameter values overlaid on the estimates. It requires extra DATA steps to create and merge the parameter values, but I like visualizing the location of the estimates relative to the parameter values. Click on the image to enlarge it.

/* visualize sampling distribution of estimates; overlay parameter values */
data Param;   
   Intercept = 2; x1 = -4; x2 = 1;     /* parameter values */
data PE2;
   set PE Param(in=param);             /* append parameter values */
   if param then Beta = "True Value";  /* add ID variable */
   else          Beta = "Estimate  ";
ods graphics / reset ATTRPRIORITY=none;
title "Parameter Estimates from Logistic Regression";
title2 "&NumSamples Simulated Data Sets, N = &N";
proc sgscatter data=PE2 datasymbols=(Circle StarFilled);
   matrix Intercept x1 x2 / group=Beta markerattrs=(size=11);

The scatter plot matrix shows pairwise views of the sampling distribution of the three regression coefficients. The graph shows the variability of these estimates in a logistic model with the specified design matrix. It also shows how the coefficients are correlated.

By using the techniques in this post, you can simulate data for other generalized or mixed regression models. You can use similar techniques simulate data from other kinds of multivariate models.

Post a Comment

Simulating data for a logistic regression model

In my book Simulating Data with SAS, I show how to use the SAS DATA step to simulate data from a logistic regression model. Recently there have been discussions on the SAS/IML Support Community about simulating logistic data by using the SAS/IML language. This article describes how to efficiently simulate logistic data in SAS/IML, and is based on the DATA step example in my book.

The SAS documentation for the LOGISTIC procedure includes a brief discussion of the mathematics of logistic regression. To simulate logistic data, you need to do the following:

  1. Assign the design matrix (X) of the explanatory variables. This step is done once. It establishes the values of the explanatory variables in the (simulated) study.
  2. Compute the linear predictor, η = X β, where β is a vector of parameters. The parameters are the "true values" of the regression coefficients.
  3. Transform the linear predictor by the logistic (inverse logit) function. The transformed values are in the range (0,1) and represent probabilities for each observation of the explanatory variables.
  4. Simulate a binary response vector from the Bernoulli distribution, where each 0/1 response is randomly generated according to the specified probabilities from Step 3.

Presumably you are simulating the data so that you can call PROC LOGISTIC and obtain parameter estimates and other statistics for the simulated data. So, optionally, Step 5 is to write the data to a SAS data set so that PROC LOGISTIC can read it. Here's a SAS/IML program that generates a single data set of 150 simulated observations:

/* Example from _Simulating Data with SAS_, p. 226--229 */
%let N = 150;                                     /* N = sample size */
proc iml;
call randseed(1);     
X = j(&N, 3, 1);                               /* X[,1] is intercept */
/* 1. Read design matrix for X or assign X randomly.
   For this example, x1 ~ U(0,1) and x2 ~ N(0,2)  */
X[,2] = randfun(&N, "Uniform"); 
X[,3] = randfun(&N, "Normal", 0, 2);
/* Logistic model with parameters {2, -4, 1} */
beta = {2, -4, 1};
eta = X*beta;                       /* 2. linear model */
mu = logistic(eta);                 /* 3. transform by inverse logit */
/* 4. Simulate binary response. Notice that the 
      "probability of success" is a vector (SAS/IML 12.1)            */
y = j(&N,1);                             /* allocate response vector */
call randgen(y, "Bernoulli", mu);        /* simulate binary response */
/* 5. Write y and x1-x2 to data set*/
varNames = "y" || ("x1":"x2");
Out = X; Out[,1] = y;            /* simulated response in 1st column */
create LogisticData from Out[c=varNames];  /* no data is written yet */
append from Out;                               /* output this sample */
close LogisticData;

The SAS/IML statements for simulating logistic data are concise. A few comments on the program:

  • The design matrix contains two continuous variables. In this simulation, one is uniformly distributed; the other is normally distributed. This was done for convenience. You could also read in a data set that defines the observed values of the explanatory variables. See the discussion and examples in Simulating Data with SAS, p. 201–205. (Notice also that I used the RANDFUN function in SAS/IML 12.3 to generate these variables.)
  • The linear model does not contain an error term, as would be the case for linear regression. Instead, the binary response is random because we called the RANDGEN subroutine to generate a Bernoulli random variable.
  • The parameter to the Bernoulli distribution is a vector of probabilities. This feature requires SAS/IML 12.1.
  • So that the program can be extended to support an arbitrary number of explanatory variables, I copy all of the numerical data into the Out matrix, which is a copy of the X matrix except that the intercept column is replaced by the simulated binary responses. This matrix is written to a SAS data set.

What do the simulated data look like? The following call to SGPLOT produces a scatter plot of the X1 and X2 variables, with "events" colored red and "non-events" colored blue:

proc sgplot data=LogisticData;
scatter x=x1 y=x2 / group=y markerattrs=(symbol=CircleFilled);

The graph shows that the probability of an event decreases as you move from X1=0 (where there are many red markers) to X1=1 (where there are few red markers). This makes sense because the model coefficient for X1 is negative. In contrast, the probability of an event increases as you move from negative values of X2 to positive values. This makes sense because the model coefficient for X2 is positive. In general, most events (red markers) are in the upper left portion of the graph, whereas most non-events (blue markers) are in the lower right portion.

For small samples, there is a lot of uncertainty in the parameter estimates for a logistic regression. In this example, 150 observations were generated so that you can run PROC LOGISTIC against the simulated data and see that the parameter estimates are close to the parameter values. The following statement uses the CLMPARM=WALD option to compute parameter estimates and confidence intervals for the parameters:

proc logistic data=LogisticData plots(only)=Effect;
   model y(Event='1') = x1 x2 / clparm=wald;
   ods select CLParmWald EffectPlot;
t_LogisticSim LogisticSim

Notice that the parameter estimates are somewhat close to the parameter values of (2, -4, 1). The 95% confidence intervals include the parameter values. (Re-run the program with N = 1000 or N = 10000 to see whether the estimates get closer to the parameters!) The graph shows the predicted probability of an event as a function of X1, conditioned on X2=0.1, which is the mean value of X2. You could request a similar plot for the predicted probability as a function of X2 (at the mean of X1) by using the option
     plots(only)=Effect(showobs X=(X1 X2));

This article shows how to generate a single sample of data that fits a logistic regression model. In my next blog post, I will show you how to generalize this program to efficiently simulate and analyze many samples.

Post a Comment

Creating ODS graphics from the SAS/IML language

As you develop a program in the SAS/IML language, it is often useful to create graphs to visualize intermediate results. I do this all the time in my preferred development environment, which is SAS/IML Studio. In SAS/IML Studio, you can write a single statement to create a scatter plot, bar chart, histogram, and other standard statistical graphics. (For an overview of IMLPlus graphics, see the documentation chapter "Creating Dynamically Linked Graphs." For a longer discussion of IMLPlus programming, see Chapter 7 of Statistical Programming with SAS/IML Software.)

If you are writing a SAS/IML program by using a programming environment such as the SAS Windowing environment or SAS Enterprise Guide, then the IMLPlus graphics are not available. However, recall that you can call SAS procedures from the SAS/IML language. In particular, you can generate ODS graphics without leaving PROC IML by using the SUBMIT and ENDSUBMIT statements to call the SGPLOT procedure.

To give an example, suppose that you recently read my blog post on the new RANDFUN function and you decide to try it out. You call the RANDFUN function to simulate 1,000 observations from a normal distribution with mean 10 and standard deviation 2. Because you have never used that function before, you might want to create a histogram of the simulated data to make sure that it appears to be correct. No problem: just use the CREATE statement to create a SAS data set from a SAS/IML matrix, then use the SUBMIT/ENDSUBMIT statements to call PROC SGPLOT:

proc iml;
call randseed(1);
x = randfun(1000, "Normal", 10, 2);          /* X ~ N(10,2) */
/* graph the simulated data */
create Normal var {x}; append; close Normal; /* write data */
  proc sgplot data=Normal;
    histogram x;                             /* histogram */
    density x / type=normal(mu=10 sigma=2);  /* overlay normal */

The graph overlays a normal density curve on a histogram of the data. The graph indicates that the data are normally distributed with the specified parameters.

This technique is very powerful: from within your SAS/IML program you can create any graph by calling PROC SGPLOT or some other SG procedure!

In fact, you can do even more. You can write a module that encapsulates the creation of the data set and the call to PROC SGPLOT. The module can support optional parameters that invoke various options for PROC SGPLOT. For example, when called with one argument the following module displays a histogram. If you specify the optional second parameter, you can overlay a kernel density estimate on the histogram, as shown below:

/* call PROC SGPLOT to create a histogram; optionally overlay a KDE */
start MakeHistogram(x, addKernel=0);
   create _Temp var {x}; append; close _Temp; /* write data */
   densityStmt = choose(addKernel=0,  " ",  "density x / type=kernel");
   submit densityStmt;
     proc sgplot data=_Temp;
       histogram x;      /* create histogram */ 
       &densityStmt;     /* optionally add kernel density */
   call delete("_Temp"); /* delete temp data set */
run MakeHistogram(x);              /* just create a histogram         */
run MakeHistogram(x) addKernel=1;  /* overlay kernel density estimate */

The histogram with the kernel density estimate is shown above. The module uses an optional parameter with a default value to determine whether to overlay the kernel density estimate. Notice that the variable densityStmt is a character matrix that is either blank or contains a valid DENSITY statement for the SGPLOT procedure. The densityStmt variable is listed on the SUBMIT statement, which is the way to pass a string value to a SAS procedure. Inside the SUBMIT block, the token &densityStmt resolves to either a blank string or to a valid DENSITY statement. Notice that the temporary data set (_Temp) is created and deleted within the module.

This article shows how you can write a module to create any graph of data in SAS/IML vectors by calling the SGPLOT procedure. The possibilities are endless. Do you have ideas about how you might use this technique in your own work? Leave a comment.

Post a Comment

Approximating a distribution from published quantiles


A colleague asked me an interesting question:

I have a journal article that includes sample quantiles for a variable. Given a new data value, I want to approximate its quantile. I also want to simulate data from the distribution of the published data. Is that possible?

This situation is common. You want to run an algorithm on published data, but you don't have the data. Even if you attempt to contact the original investigators, the data are not always available. The investigators might be dead or retired, or the data are confidential or proprietary. If you cannot obtain the original data, one alternative is to simulate data based on the published descriptive statistics.

Simulating data based on descriptive statistics often requires making some assumptions about the distribution of the published data. My colleague had access to quantile data, such as might be produced by the UNIVARIATE procedure in SAS. To focus the discussion, let's look at an example. The table to the left shows some sample quantiles of the total cholesterol in 35 million men aged 40-59 in the years 1999–2002. The data were published as part of the NHANES study.

With 35 million observations, the empirical cumulative distribution function (ECDF) is a good approximation to the distribution of cholesterol in the entire male middle-aged population in the US. The published quantiles only give the ECDF at 11 points, but if you graph these quantiles and connect them with straight lines, you obtain an approximation to the empirical cumulative distribution function, as shown below:


My colleague's first question is answered by looking at the plot. If a man's total cholesterol is 200, you can approximate the quantile by using linear interpolation between the known values. For 200, the quantile is about 0.41. If you want a more exact computation, you can use a linear interpolation program.

The same graph also suggest a way to simulate data based on published quantiles of the data. The graph shows an approximate ECDF by using interpolation to "connect the dots" of the quantiles. You can use the inverse CDF method for simulation to simulate data from this approximate ECDF. Here are the main steps for the simulation:

  • Use interpolation (linear, cubic, or spline) to model the ECDF. For a large sample of a continuous variable, the approximate ECDF will be a monotonic increasing function from the possible range of the variable X onto the interval [0,1].
  • Generate N random uniform values in [0,1]. Call these values u1, u2, ..., uN.
  • Because the approximate ECDF is monotonic, for each value ui in [0,1] there is a unique value of x (call it xi) that is mapped onto ui. For cubic interpolation or spline interpolation, you need to use a root-finding algorithm to obtain each xi. For linear interpolation, the inverse mapping is also piecewise linear, so you can use a linear interpolation program to solve directly for xi.
  • The N values x1, x2, ..., xN are distributed according to the approximate ECDF.

The following graph shows the distribution of 10,000 simulated values from the piecewise linear approximation of the ECDF. The following table compares the quantiles of the simulated data (the SimEst column) to the quantiles of the NHANES data. The quantiles are similar except in the tails. This is typical behavior because the extreme quantiles are more variable than the inner quantiles. You can download the SAS/IML program that simulates the data and that creates all the images in this article.

t_empdistrib2 empdistrib2


A few comments on this method:

  • The more quantiles there are, the better the ECDF is approximated by the function that interpolate the quantiles.
  • Extreme quantiles are important for capturing the tail behavior.
  • The simulated data will never exceed the range of the original data. In particular, the minimum and maximum value of the sample determine the range of the simulated data.
  • Sometimes the minimum and maximum values of the data are not published. For example, the table might end begin with the 0.05 quantile and end with the 0.95 quantile. In this case, you have three choices:
    • Assign missing values to the 10% of the simulated values that fall outside the published range. They simulated data will be from a truncated distribution of the data. The simulated quantiles will be biased.
    • Use interpolation to estimate the minimum and maximum values of the data. For example, you could use the 0.90 and 0.95 quantiles to extrapolate a value for the maximum data value.
    • Use domain-specific knowledge to supply reasonable values for the minimum and maximum data values. For example, if the data are positive, you could use 0 as a lower bound for the minimum data value.

Have you ever had to simulate data from a published table of summary statistics? What technique did you use? Leave a comment.

Post a Comment

Convenient functions vs. efficient subroutines: Your choice

I've pointed out in the past that in the SAS/IML language matrices are passed to modules "by reference." This means that large matrices are not copied in and out of modules but are updated "in place." As a result, the SAS/IML language can be very efficient when it computes with large matrices. For example, when running large simulations, the SAS/IML language is often much faster than competing software that supports only function calls.

However, subroutines (which are called by using the CALL or RUN statements) are sometimes less convenient for a programmer to use. Each call to a subroutine is a separate statement. In contrast, you can nest function calls and use several function calls within a single complicated expression.

The convenience of functions is evident in some simulation studies. For example, you can regard the negative binomial distribution as a Poisson(λ) distribution, where λ is a gamma-distributed random variable. A distribution that has a random parameter is called a compound distribution. In a previous blog post I discussed the following program, which calls the RANDGEN subroutine to simulate data from a compound distribution:

proc iml;
call randseed(12345);
N = 1000; a = 5; b = 2;
/* Simulate X from a compound distribution: 
   X ~ Poisson(lambda), where lambda ~ Gamma(a,b) */
lambda = j(N,1); 
call randgen(lambda, "Gamma", a, b); /* lambda ~ Gamma(a,b) */
x = j(N,1);
call randgen(x, "Poisson", lambda);  /* pass vector of parameters */

After setting up the parameters, it takes four statements to simulate the data: two to allocate vectors to hold the results and two calls to the RANDGEN subroutine to fill those vectors with random values. Fewer statements are needed if a language supports a function that returns N random values.

Well, as of SAS/IML 12.3 (shipped with SAS 9.4), you can use the RANDFUN function when you want the convenience of calling a function that returns random values from a distribution. The syntax is
   x = RANDFUN(N, "Distribution" <,param1> <,param2> <,param3>);
The following statements call the RANDFUN function to simulate data from a compound distribution:

lambda = randfun(N, "Gamma", a, b);    /* lambda ~ Gamma(a,b) */
x = randfun(N, "Poisson", lambda);     /* X ~ Poisson(lambda) */

If you want to nest the calls to the RANDFUN function, you can simulate the data with a single statement:

x = randfun(N, "Poisson", randfun(N, "Gamma", a, b)); /* compound distribution */

If you are simulating data inside a DO loop, I strongly suggest that you continue to use the more efficient RANDGEN subroutine. However, in applications where performance is not a factor, you can enjoy the convenience of simulating data with a function call by using the new RANDFUN function.

Post a Comment

Geometry, sensitivity, and parameters of the lognormal distribution

Today is my 500th blog post for The DO Loop. I decided to celebrate by doing what I always do: discuss a statistical problem and show how to solve it by writing a program in SAS.

Two ways to parameterize the lognormal distribution

I recently blogged about the relationship between the parameters in the lognormal family and the underlying normal family. This inspired me to look closer into how the mean and standard deviation of the normal distribution are related to the mean and standard deviation of the lognormal distribution.

Recall that if X ~ N(μ, σ) is normally distributed with parameters μ and σ, then Y = exp(X) is lognormally distributed with the same parameters. The mean of Y is m = exp(μ + σ2/2) and the variance is v = (exp(σ2) -1) exp(2μ + σ2). As usual, the standard deviation of Y is sqrt(v).

These formulas establish a one-to-one (and therefore invertible) relationship between the central moments (means and standard deviations) of the normal and lognormal distributions. If you have the lognormal parameters (m, sqrt(v)), you can compute the corresponding normal parameters, and vice versa. The formulas for μ and σ as functions of m and v are μ = ln(m2 / φ) and σ = sqrt( ln(φ2 / m2) ) where φ = sqrt(v + m2). You can think of these formulas as defining a function, F, that maps (m, sqrt(v)) values into (μ, σ) values.


What does this mapping tell us about the relationship between the central moments of the normal and lognormal distributions? What is the geometry of the transformation that maps one set of means and standard deviations to the other?

If F is the function that gives (μ, σ) as a function of (m, sqrt(v), you can visualize F by using the graphic to the left. (Click to enlarge.) The grid in the upper part of the picture represents a uniform mesh [1,30] x [1,30] in the domain of F. These are means and standard deviations for lognormal distributions. Vertical lines (m is constant) are drawn in red; horizontal lines (v is constant) are drawn in blue. The corresponding (μ, σ) values (the image of the grid under F) are shown in the bottom part of the picture. The bottom picture shows the mean and standard deviation of the corresponding normal distributions.

By looking at the scale of the graphs, you can see that the transformation is highly contractive on this domain. The function maps a huge rectangle (841 square units) into a little wing-shaped region whose area is about 11 square units. The contraction is especially strong when the lognormal parameters m and v are large. For example the entire right half of the grid in the domain of F (where m ≥ 15) is mapped into a tiny area near (μ,σ)=(3,0.5) in the range.

This fact is not too surprising because the inverse mapping (F–1) is exponential, and therefore highly expansive. Nevertheless, it is illuminating to see the strength of the contraction. For example, the following graphic "zooms in" on an area of the domain that is centered near the point (m, sqrt(v)) = (22.5, 11.5), which is the approximate pre-image of (μ, σ) = (3, 0.5). The parallelogram in the domain of F is mapped to a tiny square that is centered near (μ, σ) = (3, 0.5). On this scale, the mapping is well-approximated by its linearization, which is the Jacobian matrix of first derivatives. The determinant of the Jacobian matrix at the point (m, sqrt(v) = (22.5, 11.5) is about 1/600, which means that the big parallelogram on the left has about 600 times the area of the tiny square on the right. (In the illustration, the square is not drawn to scale.)


Implications for data analysis

These pictures and computations tell us a lot about the relationship between the central moments of the normal and the lognormal distribution. The main result is that small changes in the parameters of the normal distribution cause large changes in the mean and standard deviation of the lognormal distribution. This leads to the following corollaries:

  • Small variations in normal data can lead to big difference in the lognormal data. If you simulate a small sample of data from an N(3, 0.5) distribution, you will get a sample moments that vary slightly from the parameter values. For example, the sample mean might range between 2.9 and 3.1, and the sample standard deviation might range between 0.4 and 0.6. However, if you exponentiate the simulated values to form a lognormal distribution, the mean and standard deviation of the lognormal data will vary widely from sample to sample. In short, the mean and standard deviation of lognormal data are very sensitive to variation in the normal samples.
  • For lognormal data with a large mean, the parameter estimates are not sensitive to variation in the data. Suppose you have a small data set with mean 22.5 and the standard deviation 11.5, and you think it lognormally distributed. If you use PROC UNIVARIATE to estimate the parameters for a lognormal fit, you will get estimates that are close to μ=3 and σ=0.5. Because of the small sample size, you might be worried that the parameter estimates aren't very good. But look at it this way: if you draw a different sample whose mean and standard deviation are several units away from those values, the parameter estimates will still be close to μ=3 and σ=0.5! Variations in the lognormal data are not very important when you estimate the parameters of the underlying normal data.

For simplicity, this discussion focused on parameter values near (μ, σ)=(3, 0.5), but you can use the same ideas and pictures to study other regions of parameter space. Moreover, the same ideas apply to other data transformations.

Computational details

The SAS/IML language is an excellent way to compute the quantities and to create the pictures in this article. In particular, I used the dynamically linked graphics in SAS/IML Studio to explore this problem because of the highly flexible graphics in IMLPlus. You can download the complete SAS/IML Studio program that I used to create the images.

The grids in the graphs were created by using the EXPANDGRID function that I discussed in the article how to generate a grid of points in SAS. It is easy to apply the transformation to the grid:

proc iml;
/* lognormal to normal transformation (m, sqrt(v)) --> (mu, sigma) */
start LN2N(p);
   m = p[,1]; v = p[,2]##2;
   phi = sqrt(v + m##2);
   mu = log(m##2 / phi);
   sigma = sqrt(log(phi##2/m##2));
   return( mu || sigma );
m = 1:30;                      /* horizontal grid points */
sqrtV = 1:30;                  /* vertical grid points */
grid = expandgrid(m, sqrtV);   /* make grid (SAS/IML 12.3) */
Fgrid = LN2N(grid);            /* compute image of grid */

You can easily compute a numerical Jacobian matrix (the matrix of first derivatives) by using the NLPFDD function, which computes finite-difference derivatives. You can compute the determinant of the Jacobian matrix by using the DET function. The following statements compute the Jacobian matrix and determinant at the point (22.5, 11.5). The determinant is approximately 1/600.

/* NLPFDD wants a function that returns a column vector */
start LN2Ncol(x);  
   return( T(LN2N(x)) );
/* linearization of lognormal-->normal transformation */
x0 = {22.5 11.5};    /* approximate pre-image of (3, 0.5) */
call nlpfdd(f, J, JpJ, "LN2Ncol", x0, {2 . .});
detJ = det(J);       /* det(J) shows strong local contraction */
print J[c={"dx1" "dx2"} r={"dF1" "dF2"}], detJ;

This article—my 500th blog post—provides a good example of what I try to accomplish with my blog. The problem was inspired by a question from a SAS customer. I analyzed it by writing a SAS/IML program. The analysis requires applications of statistics, multivariate analysis, geometry, and matrix computations. I created some cool graphs. Thanks for reading.

Post a Comment

How to find an initial guess for an optimization


Nonlinear optimization routines enable you to find the values of variables that optimize an objective function of those variables. When you use a numerical optimization routine, you need to provide an initial guess, often called a "starting point" for the algorithm. Optimization routines iteratively improve the initial guess in an attempt to converge to an optimal solution. Consequently, the choice of a starting point determines how quickly the algorithm converges to a solution and—for functions with multiple local extrema—to which optimum the algorithm converges.

But how can you produce a "good" starting point for multivariate functions? Some analysts use a constant vector, such as a vector of all 0s or a vector of all 1s. Although this practice is widespread, this default initial guess might not converge to a global extremum. You can usually do better by choosing a starting point that is based on a few easy evaluations of the objective function.

I recently described how to use the EXPANDGRID function in SAS/IML software to construct a set of ordered k-tuples that are arranged on a uniformly spaced grid. By evaluating a function on a grid, you can visualize the function and determine a starting point for an optimization algorithm. This "evaluate on a grid" technique is sometimes called a "pre-search" technique. This article shows how to choose an initial guess when there are no constraints on the parameters of the objective function ("unconstrained optimization").

Defining objective functions in SAS/IML

To use the pre-search technique, you should write your objective function so that you can pass in a matrix. The columns of the matrix are the variables in the equation for the objective function. Each row of the matrix is a point at which to evaluate the function.

The NLP routines in SAS/IML will work even if your objective function evaluates only scalar values, but the pre-search technique is much more efficient if you write a vectorized function. To vectorize a function, use the elementwise multiplication operator (#) in your formulas. For example, if your objective function is g(x,y) = x2 + x*y + y2, define the following SAS/IML function:

proc iml;
start g(m);
   x = m[,1]; y = m[,2];        /* variables are columns of m */
   return( x##2 + x#y + y##2 ); /* use elementwise multiplication */
m = {0 0,     /* test it: define three points in the domain of g */
     0 1,
     1 2};
z = g(m);     /* one call evaluates g(x,y) at three points! */

By writing the function in a vectorized fashion, you can evaluate arbitrarily many points (rows of a matrix) with a single call.

Visualizing a function of two variables

There are several "test suites" of functions that are used to test optimization algorithms. One test function is called the "four-hump camel function" because it has four local maxima (from Dixon and Szego (1978), "The global optimization problem: an introduction"). Some people call it the "six-hump camel function" because it also has two local minima, but that's a misnomer. The camel function is a polynomial equation in two variables:
   f(x,y) = (4 - 2.1 x2 + x4 /3) x2 + xy + (-4 + 4 y2) y2

The following SAS/IML function defines the four-hump camel function, generates a uniform grid of points on the rectangle [–3, 3] x [–2, 2], and creates a rotating surface plot by using the RotatingPlot graph in the SAS/IML Studio application:

proc iml;
/* four-hump camel has two global maxima:
   p1 = {-0.08984201  0.71265640 };
   p2 = { 0.08984201 -0.71265640 };   */
start Camel(m);
   x = m[,1];  y = m[,2];
   f = ( 4 - 2.1*x##2 + x##4 /3) # x##2 + x#y + 
       (-4 + 4  *y##2) # y##2;
   return( -f );   /* return -f so that extrema are maxima */
m = ExpandGrid( do(-3, 3, 0.1), do(-2, 2, 0.1) );   /* SAS/IML 12.3 */
z = Camel(m);
/* In SAS/IML Studio, visualize surface with 3D rotating plot */
RotatingPlot.Create("Camel", m[,1], m[,2], z);

The surface plot is shown at the top of this article. It indicates that the function is fairly flat on top, but has small bumps that contain local maxima. By using the interactive graphics in SAS/IML Studio, it is easy to find the absolute maxima (there are two), or you can find the maxima programmatically by using the following statements:

optIndex = loc(z=max(z));    /* find maximum values of z   */
optX = m[optIndex,];         /* corresponding (x,y) values */ 
print optX;

Prior to these computations, I had never seen the "camel function" before. I did not know where the maxima might be. However, by evaluating the function on a coarse grid of 2,500 points, I now know that if I want to find global maxima, I should look in a neighborhood of (–0.1, 0.7) and (0.1, –0.7). These would be good choices to use as starting points for an optimization algorithm.

Visualizing a function of three variables


Although it is harder to visualize a function of three variables (the graph is in four-dimensional space!), you can still get a feel for high and low values of a function. You can evaluate a function of three variables on a uniform grid and then color each point in the grid according to the functional value at that point. This generalizes the two-dimensional example that I wrote about in the article "Color scatter plot markers by values of a continuous variable." As an example, define the following quadratic function:
   Q(x,y,z) = –(x - 1.1)2 – (y - 2.2)2 – (z - 3.3)2

By inspection, the function Q(x,y,z) has a single global maximum at the point (1.1, 2.2, 3.3). The following statements evaluate the function on a 3-D grid of half integers:

start Q(x);
   return( -(x[,1] - 1.1)##2 - (x[,2] - 2.2)##2 - (x[,3] - 3.3)##2 );
m = ExpandGrid( do(0,5,0.5), do(0,5,0.5), do(0,5,0.5) );
f = Q(m);
optIndex = f[<:>];           /* find maximum values of Q     */
optX = m[optIndex,];         /* corresponding (x,y,z) values */ 
print optX;

The result shows that (1, 2, 3.5) is the point on the grid for which the function is the largest. If you start a numerical optimization from that point, you will converge very quickly to the optimal solution. The 3-D scatter plot at the beginning of this section is a scatter plot of the grid, where each point has been colored according to the value of Q. The big blob of red indicates where the function is the largest. This graph crudely visualizes the function by showing where it is large and where it is small.

Finding an initial guess for an extremum

In theory, the pre-search technique works for any multivariate function. However, the size of the grid grows geometrically as the number of variables increases, so you will need to reduce the coarseness of the grid for functions of many variables. For example, if you use k grid points for each variable of a 10-dimensional function, then the grid will contain k10 points! Clearly, you should choose k to be small. If k = 5, then the grid contains 510 = 9.7 million points, which is large, but still computationally manageable. For functions with more than 15 variables, the pre-search technique on a "full factorial" grid becomes impractical.

Nevertheless, a coarse grid gives you more information than no grid at all, and a starting point based on the pre-search technique is usually superior to an arbitrary starting point. It is easy to implement the pre-search technique in the SAS/IML language, and you can often use it to provide a good initial guess to a nonlinear optimization routine.

Post a Comment

How to generate a grid of points in SAS

In many areas of statistics, it is convenient to be able to easily construct a uniform grid of points. You can use a grid of parameter values to visualize functions and to get a rough feel for how an objective function in an optimization problem depends on the parameters. And of course, grids are used to construct a full factorial design, which is an experimental design in which every setting of every factor appears with every setting of every other factor. I have previously blogged about how to generate two-dimensional grids in SAS, but this article describes how to generate grids with more dimensions.

In the SAS DATA step, you can create a grid of points by writing nested DO loops. For example, suppose that there are three factors: X with discrete values {–2, –1, 0, 1, 2}, Y with values {–1, 0, 1}, and Z with values {–1, 1}. The following DATA step creates data for each combination of the three factors:

data grid;
do x = -2 to 2;
   do y = -1 to 1;
      do z = -1, 1;
proc print data=grid; run;

In this data set, the Z variable changes the fastest and the X variable changes the slowest as you read through the observations. This corresponds to the fact that the Z variable is the innermost of the nested DO loops in the DATA step. By changing the order of the DO loops, you can change which variable varies the fastest. The data set contains 5*3*2 = 30 observations, and each observation is a unique combination of the levels of the three variables.

In the DATA step, a grid that involves k variables requires k nested DO loops. In a matrix language such as SAS/IML, vector operations and calls to built-in functions are more efficient than DO loops. In SAS/IML 12.3 (released with SAS 9.4), you can use a single call to the EXPANDGRID function to generate a grid of points for up to 15 variables. The following SAS/IML program creates a matrix with three columns and 30 rows. The values of the matrix are identical to the values that are generated by the DATA step. In particular, the first variable changes the slowest and the last variable changes the fastest as you move down the rows of the matrix.

proc iml;
/* varies  SLOW <-----------------> FAST */
/* values for    x       y       z      */
m = expandgrid(-2:2,   -1:1,  {-1 1});

The EXPANDGRID function is very useful. If you have a function of k variables, each row of m is a valid ordered k-tuple for the function. (k = 3 for this example.) If you evaluate the function at each point of the grid, you obtain a grid of function values. This enables you to obtain a feeling for where the function is large and where it is small within the range of the grid. For example, the following statements define the mulitvariate quadratic function f(x) = (x-1.8)2 + (y-1.1)2 + z2 and evaluates the function at each of the points that make up the 30 rows of m:

start Func(x);
   return( (x[,1]-1.8)##2 + (x[,2]-1.1)##2 + x[,3]##2 );
f = Func(m);
print m[c={"x" "y" "z"}] f;

You can use the EXPANDGRID function in many ways. When k equals 2 or 3, you can visualize a function by evaluating it on a uniform grid of points and plotting the result. When performing nonlinear optimization, you can use the EXPANDGRID function to obtain an initial guess for the optimal parameter value. I will demonstrate these techniques in my next blog post.

Post a Comment