A simple way to construct a large correlation matrix

Occasionally a SAS statistical programmer will ask me, "How can I construct a large correlation matrix?" Often they are simulating data with SAS or developing a matrix algorithm that involves a correlation matrix. Typically they want a correlation matrix that is too large to input by hand, such as a correlation matrix for 100 variables. The particular correlations are not usually important; they just want any valid correlation matrix.

Except for diagonal and diagonally dominant matrices, it is not easy to write down a valid correlation matrix. I've previously written about the fact that not every symmetric matrix with a unit diagonal is a correlation matrix. Correlation matrices are symmetric and positive definite (PD), which means that all the eigenvalues of the matrix are positive. (Technically, a correlation matrix can have a zero eigenvalues, but that is a degenerate case that I prefer to avoid.)

So here is a tip: you can generate a large correlation matrix by using a special Toeplitz matrix.

Recall that a Toeplitz matrix has a banded structure. The diagonals that are parallel to the main diagonal are constant. The SAS/IML language has a built-in TOEPLITZ function that returns a Toeplitz matrix, as shown:

proc iml;
T = toeplitz(5:1);    /* 5x5 Toeplitz matrix */
print T;

A Toeplitz matrix is obviously symmetric, and the following computation shows that the smallest eigenvalue is positive, which means that the matrix is positive definite:

smallestEigenvalue = min( eigval(T) );  /* compute smallest eigenvalue */
print smallestEigenvalue;

Generating positive definite Toeplitz matrices

In the previous example, the matrix was generated by the vector {5,4,3,2,1}. Because it is symmetric and PD, it is a valid covariance matrix. Equivalently, the scaled Toeplitz matrix that is generated by the vector {1,0.8,0.6,0.4,0.2} is a correlation matrix that is also PD. This example is a specific case of a very cool mathematical fact:

A Toeplitz matrix generated from any linearly decreasing sequence of nonnegative values is positive definite.

In other words, for every positive number R and increment h, the k-element vector {R, R-h, R-2h, ..., R-(k-1)h} generates a valid covariance matrix provided that R-(k-1)h > 0, which is equivalent to hR/(k-1). A convenient choice is h = R / k. This is a useful fact because it enables you to construct arbitrarily large Toeplitz matrices from a decreasing sequence. For example, the following statements uses R=1 and h=0.01 to construct a 100 x 100 correlation matrix. You can use the matrix to simulate data from a multivariate normal correlated distribution with 100 variables.

p = 100;
h = 1/p;                         /* stepsize for decreasing sequence */
v = do(1, h, -h);                /* {1, 0.99, 0.98, ..., 0.01} */
Sigma = toeplitz(v);             /* PD correlation matrix */
mu = j(1, ncol(v), 0);           /* population mean vector: (0,0,0,...0) */
X = randnormal(500, mu, Sigma);  /* generate 500 obs from MVN(mu, Sigma) */

A proof of the fact that the Toeplitz matrix is PD is in a paper by Bogoya, Böttcher, and Grudsky (J. Spectral Theory, 2012), who prove the result on p. 273.

A stronger result: Allowing negative correlations

In fact, Bogoya, Böttcher, and Grudsky prove a stronger result. The previous example assumes that all 100 variables are positively correlated with each other. The paper also proves that you can use a decreasing sequence of length n that contains negative values, as long as the sum of the values is positive. Thus to generate 100 variables with almost as many negative correlations as positive, you can choose h = 2/p, as follows:

h = 2/p;                   /* stepsize for decreasing sequence */
v = do(1, -1+h, -h);       /* {1, 0.98, 0.96, ..., -0.96, -0.98} */
Sigma = toeplitz(v);       /* PD correlation matrix */

Large covariance matrices

A Toeplitz matrix creates a covariance matrix that has a constant diagonal, which corresponds to having the same variance for all variables. However, you can use the CORR2COV function in SAS/IML to convert a correlation matrix to a covariance matrix. Algebraically, this transformation is accomplished by pre- and post-multiplying by a diagonal matrix, which will preserve positive definiteness. For example, the following statement converts Sigma into a covariance matrix for which each variable has a different variance:

sd = 1:p;                  
Cov = corr2cov(Sigma, sd);  /* make std(x_i) = i */

The next time you need to generate a large symmetric positive-definite matrix, remember the Toeplitz matrix.

Post a Comment

Excluding variables: Read all but one variable into a matrix

Dear Rick,
I have a data set with 1,001 numerical variables. One variable is the response, the others are explanatory variable. How can I read the 1,000 explanatory variables into an IML matrix without typing every name?

That's a good question. You need to be able to perform two sub-tasks:

  1. Create a character vector that contains the names of all the variables. (If the data set contains both numeric and character variables, the character vector should contain the names of all numeric variables.)
  2. Exclude one or more elements from a character vector.

Discover the names data set variables

Just as you can use PROC CONTENTS to discover the names of variables in a data set, SAS/IML has the CONTENTS function, which returns a character vector that contains the variable names. The argument to the CONTENTS function can be the name of a data set. If you have already opened a data set you can skip the argument to obtain the variable names of the open data set, as follows:

proc iml;
use Sashelp.Heart;                  /* open data set */
varNames = contents();              /* get all variable names */

However, most of the time (as above) we do not have a data set that has only numerical variables. To obtain a vector of only the numeric variables, read one observation of the data into a matrix, and use the COLNAME= option to obtain the variable names:

read next var _NUM_ into X[colname=varNames];    /* read only numeric vars */
print varNames;

To save space, only the first few columns of the output are displayed below:


Exclude elements from a character vector

After you create a vector that contains variable names, you can use the SETDIF function to exclude certain variable. The SETDIF function also sorts the list of variable names, which can be useful:

YVar = "Weight";                    /* variable to exclude from the matrix */
XVarNames = setdif(varNames, YVar); /* exclude Y, sort remaining X */

If you want to preserve the order of the variables, use the REMOVE function and specify the indices of the elements that you want to remove. The LOC function enables you to find the indices of the elements that you want to remove, as follows:

XVarNames = remove(varNames, loc(varNames=YVar));
print XVarNames;

Putting it all together

For the Sashelp.Heart data set, here is how to read the variable Weight into a vector, and read all other numeric variables into a matrix X:

proc iml;
YVar = "Weight";                    /* var to exclude from the matrix */
dsName = Sashelp.Heart;
use (dsName);                       /* open data set */
read next var _NUM_ into X[colname=varNames];     /* read only numeric vars */
XVarNames = remove(varNames, loc(varNames=YVar)); /* exclude; preserve order */
read all var YVar into Y;           /* Y is vector */
read all var XVarNames into X;      /* X is matrix */

You can use the LOC-ELEMENT trick to exclude multiple variables. For example, you can use the following statements to exclude two variables:

YVar = {"Weight" "Height"};
XVarNames = remove(varNames, loc( element(varNames,YVar) ));
Post a Comment

Error distributions and exponential regression models


Last week I discussed ordinary least squares (OLS) regression models and showed how to illustrate the assumptions about the conditional distribution of the response variable. For a single continuous explanatory variable, the illustration is a scatter plot with a regression line and several normal probability distributions along the line.

The term "conditional distribution of the response" is a real mouthful. For brevity, I will say that the graph shows the assumed "error distributions."

A similar graph can illustrate regression models that involve a transformation of the response variable. A common transformation is to model the logarithm of the response variable, which means that the predicted curve is exponential.

There are two common ways to construct an exponential fit of a response variable, Y, with an explanatory variable, X. The two models are as follows:

  • A generalized linear model of Y that uses LOG as a link function. In SAS you can construct this model with PROC GENMOD by setting DIST=NORMAL and LINK=LOG.
  • An OLS model of log(Y), followed by exponentiation of the predicted values. In SAS you can construct this model with PROC GLM or REG, although for consistency I will use PROC GENMOD with an identity link function.

To illustrate the two models, I will use the same 'cars' data as last time. These data were used by Arthur Charpentier, whose blog post about GLMs inspired me to create my own graphs in SAS. Thanks to Randy Tobias and Stephen Mistler for commenting on an early draft of this post.

A generalized linear model with a log link

A generalized linear model of Y with a log link function assumes that the response is predicted by an exponential function of the form Y = exp(b0 + b1X) + ε and that the errors are normally distributed with a constant variance. In terms of the mean value of Y, it models the log of the mean: log(E(Y)) = b0 + b1X.


The graph to the left illustrates this model for the "cars" data used in my last post. The X variable is the speed of a car and the Y variable is the distance required to stop.

How can you create this graph in SAS? As shown in my last post, you can run a SAS procedure to get the parameter estimates, then obtain the predicted values by scoring the model on evenly spaced values of the explanatory variable. However, when you create the data for the probability distributions, be sure to apply the inverse link function, which in this case is the EXP function. This centers the error distributions on the prediction curve.

The call to PROC GENMOD is shown below. For the details of constructing the graph, download the SAS program used to create these graphs.

title "Generalized Linear Model of Y with log link";
proc genmod data=MyData;
   model y = x / dist=normal link=log;
   ods select ParameterEstimates;
   store work.GenYModel / label='Normal distribution for Y; log link';
proc plm restore=GenYModel noprint;
   score data=ScoreX out=Model pred=Fit / ILINK;  /* apply inverse link to predictions */

An OLS model of log(Y)

An alternative model is to fit an OLS model for log(Y). The data set already contains a variable called LogY = log(Y). The OLS model assumes that log(Y) is predicted by a model of the form b0 + b1X + &epsilon. The model assumes that the errors are normally distributed and that the expected value of log(Y) is linear: E(log(Y)) = b0 + b1X. You can use PROC GLM to fit the model, but the following statement uses PROC GENMOD and PROC PLM to provide an "apples-to-apples" comparison:

title "Linear Model of log(Y)";
proc genmod data=MyData;
   model logY = x / dist=normal link=identity;
   ods select ParameterEstimates;
   store work.LogYModel / label='Normal distribution for log(Y); identity link';
proc plm restore=LogYModel noprint;
   score data=ScoreX out=Model pred=Fit;
t_GLM_lognormal GLM_lognormal

On the log scale, the regression line and the error distributions look like the graph in my previous post. However, transforming to the scale of the original data provides a better comparison with the generalized linear model from the previous section. When you exponentiate the log(Y) predictions and error distribution, you obtain the graph at the left.

Notice that the error distributions are NOT normal. In fact, by definition, the distributions are lognormal. Furthermore, on this scale the assumed error distribution is heteroscedastic, with smaller variances when X is small and larger variances when X is large. These are the consequences of exponentiating the OLS model.

Incidentally, you can obtain this same model by using the FMM procedure, which enables you to fit lognormal distributions directly.

A comparison of log-link versus log(Y) models

It is worth noting that the two models result in different predictions. The following graph displays both predicted curves. They first curve (the generalized linear model with log link) goes through the "middle" of the data points, which makes sense when you think about the assumed error distributions for that model. The second curve (the exponentiated OLS model of log(Y)) is higher for large values of X than you might expect, until you consider the assumed error distributions for that model.


Both models assume that the effect of X on the mean value of Y is multiplicative, rather than additive. However, as one of my colleagues pointed out, the second model also assumes that the effect of errors is multiplicative, whereas in the generalized linear model the effect of the errors is additive. Researchers must use domain-specific knowledge to determine which model makes sense for the data. For applications such as exponential growth or decay, the second model seems more reasonable.

So there you have it. The two exponential models make different assumptions and consequently lead to different predictions. I am not an expert in generalized linear models, so I found the graphs in this article helpful to visualize the differences between the two models. I hope you did, too. If you want to see how the graphs were created, download the SAS program.

Post a Comment

Generate evenly spaced points in an interval

I've previously written about how to generate a sequence of evenly spaced points in an interval. Evenly spaced data is useful for scoring a regression model on an interval.

In the previous articles the endpoints of the interval were hard-coded. However, it is common to want to evaluate a function in the interval [min(x), max{x}], where x is an observed variable in data set. That is easily done in the DATA step by first running a PROC SQL call that puts the minimum and maximum values into macro variables. The macro variables can then be used to generate the evenly spaced values. For example, the following statements generate 101 points within the range of the Weight variable in the Sashelp.Cars data set:

/* Put min and max into macro variables */
proc sql noprint;
  select min(Weight), max(Weight) into :min_x, :max_x 
  from Sashelp.Cars;
/* Create data set of evenly spaced points */
data ScoreX;
do Weight = &min_x to &max_x by (&max_x-&min_x) / 100;  /* min(X) to max(X) */

See the article "Techniques for scoring a regression model in SAS" for various SAS procedures and statements that can score a regression model on the points in the ScoreX data set.

You can also use this technique to compute four macro variables to use in generating a uniformly spaced grid of values.

If you are doing the computation in the SAS/IML language, no macro variables are required because you can easily compute the minimum and maximum values as part of the program:

proc iml;
use Sashelp.Cars;  read all var "Weight" into x; close;
w = do(min(x), max(x), (max(x)-min(x))/100);   /* min(X) to max(X) */
Post a Comment

Plot the conditional distribution of the response in a linear regression model


A friend who teaches courses about statistical regression asked me how to create a graph in SAS that illustrates an important concept: the conditional distribution of the response variable. The basic idea is to draw a scatter plot with a regression line, then overlay several probability distributions along the line, thereby illustrating the model assumptions about the distribution of the response for each value of the explanatory variable.

For example, the image at left illustrates the theoretical assumptions for ordinary linear regression (OLS) with homoscedastic errors. The central line represents the conditional mean of the response. The observed responses are assumed to be of the form yi = β0 + β1xI + εi where εi ∼ N(0, σ) for some value of σ. The normal distributions all have the same variance and are centered on the regression line. Pictures like this are used by instructors and textbook authors to illustrate the assumptions of linear regression. You can create similar pictures for variations such as heteroscedastic errors or generalized linear models.

This article outlines how to create the graph in SAS by using the SGPLOT procedure. The data are from a blog post by Arthur Charpentier, "GLM, non-linearity and heteroscedasticity," as are some of the ideas. Charpentier also inspired a related post by Markus Gesmann, "Visualising theoretical distributions of GLMs." Thanks to Randy Tobias and Stephen Mistler for commenting on an early draft of this post.

Fitting and scoring a regression model

Charpentier uses a small data set that contains the stopping distance (in feet) when driving a car at a certain speed (in mph). The OLS regression model assumes that distance (the response) is a linear function of speed and that the observed distances follow the previous formula.

In order to make the program as clear as possible, I have used "X" as the name of the explanatory variable and "Y" as the name of the response variable. I have used macro variables in a few places in order to clarify the logic of the program. However, to prevent the program from appearing too complicated, I did not fully automate the process of generating the graph.

The following statements define the data and use PROC SQL to create macro variables that contain the minimum and maximum value of the explanatory variable. Then a DATA step creates 101 equally spaced points in the interval [min(x), max(x)], which will be used to score the regression model:

/* data and ideas from http://freakonometrics.hypotheses.org/9593 */
/* Stopping distance (ft) for a car traveling at certain speeds (mph) */
data MyData(label="cars data from R");
input x y @@;
logY = log(y);
label x = "Speed (mph)"  y = "Distance (feet)" logY = "log(Distance)";
4   2  4 10  7  4  7  22  8 16  9 10 10 18 10 26 10  34 11 17 
11 28 12 14 12 20 12  24 12 28 13 26 13 34 13 34 13  46 14 26 
14 36 14 60 14 80 15  20 15 26 15 54 16 32 16 40 17  32 17 40 
17 50 18 42 18 56 18  76 18 84 19 36 19 46 19 68 20  32 20 48 
20 52 20 56 20 64 22  66 23 54 24 70 24 92 24 93 24 120 25 85 
/* 1. Create scoring data set */
/* 1a. Put min and max into macro variables */
proc sql noprint;
  select min(x), max(x) into :min_x, :max_x 
  from MyData;
/* 1b. Create scoring data set */
data ScoreX;
do x = &min_x to &max_x by (&max_x-&min_x)/100;  /* min(X) to max(X) */

The next statements call PROC GENMOD to fit an OLS model. (I could have used PROC GLM or REG for this model, but in a future blog post I want to explore generalized linear models.) The procedure outputs the parameter estimates, as shown, and use the STORE statement to save the OLS model to a SAS item store. The PLM procedure is used to score the model at the points in the ScoreX data set. (For more about PROC PLM, see Tobias and Cai (2010).) The fitted values are saved to the SAS data set Model:

/* Model 1: linear regression of Y              */
title "Linear Regression of Y";
/* Create regression model. Save model to item store. Score with PROC PLM. */
proc genmod data=MyData;
  model y = x / dist=normal link=identity;
  ods select ParameterEstimates;
  store work.YModel / label='Normal distribution for Y; identity link';
proc plm restore=YModel noprint;
  score data=ScoreX out=Model pred=Fit; /* score model; save fitted values */

The estimate for the intercept is b0 = -17.58 and the estimate for the linear coefficient is b1 = 3.93. The Scale value is an MLE estimate for the standard deviation of the error distribution. These estimates will be used to plot a sequence of normal distributions along a fitted response curve that is overlaid on the scatter plot. If you use PROC GLM to fit the model, the RMSE statistic is a (slightly different) estimate for the scale parameter.

Plotting conditional distributions

The next step is to create data for a sequence of normal probability distributions that are spaced along the X axis and have standard deviation σ=15.07. Charpentier plots the distributions in three dimensions, but many authors simply let each distribution "fall on its side" and map the density scale into the coordinates of the data. This requires scaling the density to transform it into the data coordinate system. In the following DATA step, the maximum height of the density is half the distance between the distributions. The distributions are centered on the regression line, and extend three standard deviations in either direction. The PDF function is used to compute the density distribution, centered at the conditional mean, for the response variable.

The following code supports a link function, such as a log link. For this analysis, which uses the identity link, the ILINK macro variable is set to blank. However, for a generalized linear model with a log link, you can set the inverse link function to EXP.

/* specify parameters for model */
%let std = 15.3796;
%let b0 = -17.5791;
%let b1 =   3.9324;
%let ILINK = ;            /* blank for identity link; for log link use EXP */
/* create values to plot the assumed response distribution */
data NormalResponse(keep = s t distrib);
/* compute scale factor for height of density distributions */
ds = (&max_x - &min_x) / 6;          /* spacing for 5 distributions */
maxHt = ds / 2;                      /* max ht is 1/2 width */
maxPdf = pdf("Normal", 0, 0, &std);  /* max density for normal distrib */
ScaleFactor = maxHt / maxPDF;        /* scale density into data coords */
/* place distribs horizontally along X scale */
do s = (&min_x+ds) to (&max_x-ds) by ds;
   eta = &b0 + &b1*s;                /* linear predictor */
   pred = &ILINK(eta);               /* apply inverse link function */
   /* compute distribution of Y | X. Assume +/-3 std dev is appropriate */
   do t = pred - 3*&std to pred + 3*&std by 6*&std/50; 
      distrib = s + ScaleFactor * pdf("Normal", t, pred, &std); 

The last step is to combine the original data, the regression line, and the probability distributions into a single data set. You can then use PROC SGPLOT to create the graph shown at the top of the article. The data are plotted by using the SCATTER statement, the regression line is overlaid by using the SERIES statement, and the BAND statement is used to plot the sequence of semi-transparent probability distributions along the regression line:

/* merge data, fit, and error distribution information */
data All;  set MyData Model NormalResponse;  run;
title "Conditional Distribution of Response in Regression Model";
title2 "y = &b0 + &b1*x + eps, eps ~ N(0, &std)";
title3 "Normal Distribution, Identity Link";
proc sgplot data=All noautolegend;
  scatter x=x y=y;
  series x=x y=Fit;
  band y=t lower=s upper=distrib / group=s
       transparency=0.4 fillattrs=GraphData1;
  xaxis grid;
  yaxis grid;

The program could be shorter if I use the SAS/IML language, but I wanted to make the program available to as many SAS users as possible, so I used only Base SAS and procedures in SAS/STAT. The graph at the beginning of this article illustrates the basic assumptions of ordinary linear regression. Namely, for a given value of X, the OLS model assumes that each observed Y value is an independent random draw from a normal distribution centered at the predicted mean. The homoscedastic assumption is that the width of those probability distributions is constant.

In a future blog post I will draw a similar picture for a generalized linear model that has a log link function.

Post a Comment

Find the ODS table names produced by any SAS procedure

Statistical programmers often have to use the results from one SAS procedure as the input to another SAS procedure. Because ODS enables you to you to create a SAS data set from any ODS table or graph, it is easy to obtain a data set that contains the value of any statistic that is produced by any SAS procedure. All you need to do is submit the ODS OUTPUT statement before calling the procedure, like this:

ODS OUTPUT tablename;  /* replace with ODS table name */

There's just one problem: You might not know the name of the ODS table!

There are two ways to discover the name. The "experimental way" is to submit ODS TRACE ON, run the procedure, and then check the SAS log, which reports the name of every table and graph that the procedure produced. Here's an example:

ods trace on;
proc logistic data=Sashelp.Class;
model sex = height weight;
ods trace off;

The names of the ODS objects are output sequentially in the same order as the procedure output. Consequently, if you want the fifth table in the output, then you can count down to the fifth name in the log and discover that the fifth table is named the FitStatistics table.

There are two problems with this experimental approach. First, it is inefficient because it requires that you actually run a SAS procedure before you can discover the name of an ODS object. For a long-running procedure, this isn't optimal. Second, it assumes that you know which option to specify in order to produce the table. If a colleague says to you, "run PROC DISCRIM and save the PostResub table to a SAS data set," you might not know which option produces the PostResub table.

The second way to discover the name of ODS tables and graphs is to look them up in the documentation. This method requires several mouse clicks to locate the right book and chapter, and then you can click on the "Details" section and the "ODS Table Names" subsection. However, I recently discovered a page of the ODS User's Guide that simplifies the process. Here is a fast way to find the name of ODS tables:

  1. Go to the first Appendix of the ODS User's Guide.
  2. Select the product, such as the ODS table names for SAS/STAT procedures
  3. Select the procedure.

You are taken directly to the "ODS Table Names" section of the documentation, which lists the ODS tables and the options that produce them.

Thanks to the SAS documentation team who put together these pages that link to the analytics documentation from the ODS documentation. It's a real timesaver! Unfortunately, there are a few SAS/STAT procedures that are missing from the list (GENMOD and GLIMMIX among them), but I'm sure these small omissions will be corrected in subsequent documentation.

I'll leave you with a research question: which Base SAS or SAS/STAT procedure is capable of producing the most ODS tables? My bet is on PROC FREQ, which produces about 100 tables, but I don't know for sure. Does anyone know a procedure that can produce more?

Post a Comment

Mathematical art (part 2): Unweaving matrices


In my previous blog post, I showed how you can use SAS to program a "weaving" algorithm that takes an image, cuts it into strips, and weaves the strips together to create mathematical art. I used matrices and heat maps for the computations and visualization.

At the end of the post, I presented an image of woven strips (shown at left) and challenged the reader to "deduce the original image from the weave." This article shows that you can "unweave" the image, reassemble the strips, and use interpolation to obtain an approximation to the original image. You can download the SAS/IML program that computes and visualizes the preimage.

Unweaving a weave

The first step in reconstructing the original image is to "pull apart" the horizontal and vertical strips, rotate the horizontal ones, put them next to the vertical strips, and close all the gaps. This requires writing the inverse of the weave algorithm. When you do this, you obtain the following image:


Notice that there are gaps in our knowledge about the original image. Due to the weaving process, there are places where a vertical strip was covered by a horizontal strip, or the other way around. These missing values are shown in white. Mathematically, the weaving algorithm represents a many-to-one function, so there is no inverse function that can restore the preimage from the image.

Nevertheless, the unweaving reveals the main features of the original image. And we can do more. If we know that the original image was created from a continuous mathematical function, you can use interpolation to try to fill in the gaps. For this image you can use the information from nearby cells (north, south, east, and west) to estimate a value for the missing cells. The following uses a simple estimation scheme, which is to assign each missing cell the average value of its neighbors. You can proceed in five steps:

  1. Approximate the corners. For the lower left corner, use the average of the cell above and the cell to the right. For the lower right corner, use the average of the cell above and the cell to the left.
  2. Approximate the left edge. For each missing cell, average the cells above, below, and to the right.
  3. Approximate the right edge. For each missing cell, average the cells above, below, and to the left.
  4. Approximate the bottom edge. For each missing cell, average the cells above, to the left, and to the right.
  5. Approximate the interior cells. For each missing cell, average the cells above, below, to the left, and to the right.

After implementing this process, you obtain the following image, which approximates the original image:


How well does this image approximate the original image? Very well! The mean square error is very small, and the maximum error values are within 20% of the true values. For comparison, here is the original image. You can see that it is very similar to the reconstructed image.


The approximation could be improved by using a more sophisticated interpolation scheme, such as bivariate linear interpolation.

Feel free to download the SAS/IML program and play with other weaves. How well can you recover an original image from its weave?
Post a Comment

Mathematical art: Weaving matrices


An artist friend of mine recently created a beautiful abstract image and described the process on her blog. She says that "after painting my initial square, I cut it into strips and split them down the middle, then wove them together.... I had no idea when I started piecing these together again what they would look like."

I marveled at the beauty of the finished result, but I also started thinking about the process of constructing the artwork. You cut an initial picture into n vertical strips. (For simplicity, assume that n is even.) Then you take the first n/2 strips and space them out so that there are gaps between them. You take the last n/2 strips and turn them sideways, weaving them in and out of the vertical strips.

As my friend said in a private communication, the final image "depends entirely on the arrangement of the elements.... There are so many versions of the final outcome." By this she means that the process is deterministic, although the artist can make choices regarding the number of strips, the orientation of the strips, and the weaving pattern. Yes, it is difficult to predict the final image while staring at the initial image, but this is the sort of algorithm that you could ask a computer to carry out.

So, of course, I had to learn to make images like this with SAS!

Weaving matrices

To make sure that I understood the algorithm, I started with a 10 x 10 numeric matrix X that contains the values 1–100 in row-major order. That is, the first row contains 1–10, the second row contains 11–20, and so forth. I then created a 10 x 10 matrix W (for "weave") that contained all missing values. I copied the first five columns of X into the even columns of W. I then copied the last five columns of X into the even rows of W. I then went back and "wove" the columns and rows by re-copying certain elements from the first five columns of X. You can view this construction process for an example matrix.

The process can be encapsulated into a SAS/IML function, which is shown below:

proc iml;
start weave(x);
   n = nrow(x);             /* n must be even */
   y = j(n, n, .);
   j = 1:n/2;               /* vertical strips */
   y[, 2*j] = x[, j];       /* copy to every other column */
   i = (n/2+1):n;           /* horizontal strips */
   y[2*i-n,] = x[, i]`;     /* copy to every other row */
   k = do(1, n/2, 2);       /* weaving: raise certain rows */
   y[2*k, 2*k] = x[2*k, k]; /* every other for even rows */
   k = do(2, n/2, 2);
   y[2*k, 2*k] = x[2*k,k];  /* every other for odd rows */
   return( y );
x = shape(1:100, 10);       /* original 10x10 matrix */
w = weave(x);
print w;

Weaving heat maps

Now the fun can begin. Recall that a heat map is essentially a visualization of a matrix of numbers. A heat map can be used to visualize mathematical functions of two variables, and you can use the HEATMAPCONT function in SAS/IML 13.1 to easily create a heat map.

You can use the EXPANDGRID function to define a grid of evenly spaced points in the square [-1, 1] x [-1, 1]. You can evaluate a mathematical function on the grid and reshape the results into a matrix. You can then create a heat map of the original function and the "weave" of the function. For details, you can download the complete SAS/IML program that I used to construct the images in this article.

For example, in the following image, the function on the left is f(x,y) = x. The heat map that visualizes the function shows that the function is constant on vertical strips. You can see that each strip is a constant color, and the image on the right has taken those strips and woven them together.


The woven image is beautiful, but you can make more interesting images by choosing more interesting functions to use as the original image. The left portion of the following image shows a heat map for the quadratic function f(x,y) = x2 + y2. The image on the right is formed by weaving the vertical strips from the left image.


You get the idea: You can visualize any mathematical function and then form "mathematical art" by cutting the function along strips and weaving the strips back together.

If you are adventurous, you can play with different color palettes for the images. I chose a muted rainbow palette, but you can experiment with other color ramps. The program for these examples include other color ramps that change the aesthetics of the images.


I'll leave you with one more image. It is the weave for a common function from multivariable calculus. Can you deduce the original image from the weave? If not, look in the program that creates these examples..

These images show flexibility of the SAS/IML language for creating a custom algorithm. These simple images lack the depth and complexity of "real" art, but they show how matrices can be used to illustrate the main ideas behind a simple artistic process.

Try it yourself! What images can you make? If you create an interesting picture, leave a comment that includes the mathematical formula.

Post a Comment

Compute the number of digits in an integer

The title of this blog post might seem strange, but I occasionally need to compute the number of digits in a number, usually because I am trying to stuff an integer value into a string. Each time, I have to derive the formula from scratch, so I am writing this article so that I and others can look up the formula next time we need it.

The problem: You have a positive integer, n, which has k digits. How can you efficiently find k as a function of n?

The solution is to use logarithms. Because n has k digits,
10k-1n < 10k, or
10k-1 < n+1 ≤ 10k.

Applying the base-10 logarithm produces
k-1 < log10(n+1) ≤ k.

The CEIL function in Base SAS computes the integer value greater or equal to its argument, which proves that
k = ceil( log10(n+1) ).

A canonical example, is n=1234, which has k=4 digits. How can you find k from n? The following program computes k:

data _null_;
n = 1234;      /* positive integer */
k = ceil(log10(n+1)); 
put "The integer " n= "has " k= "digits.";
The integer n=1234 has k=4 digits.

If you have a negative integer m ≤ -1 and the absolute value has k digits, then m requires a string with k+1 elements, because one element is needed for the negative sign.

Incidentally, there is nothing special about base 10. The proof generalizes to compute the number of digits required to represent a number in any base. For example, the number n=1234 written in base 2 requires k = ceil(log2(n+1)) = 11 digits. You can check that the 11-digit binary number 10011010010 equals the decimal value 1234.

You can also use these mathematical ideas to compute the next power of 2 (or power of 10) that is larger than a given number.

Post a Comment

Monitor convergence during simulation studies in SAS

Ugh! Your favorite regression procedure just printed a warning to the SAS log. Something is wrong, and your attempt to fit a model to the data has not succeeded. A typical message is "WARNING: The validity of the model fit is questionable," perhaps followed by some additional diagnostic messages about "quasi-separation" or "lack of convergence."

If your modeling toolkit includes procedures such as LOGISTIC, GENMOD, MIXED, NLIN, or PHREG, you might have experienced convergence problems. A small sample size or a misspecified model are among the reasons for lack of convergence. There are many papers that discuss how to handle convergence issues. Paul Allison (2008) wrote a paper on some reasons that a logistic model might fail to converge, including an explanation of quasi-complete separation. The documentation for the MIXED procedure includes a long list of potential reasons that a mixed model might fail to converge. Several of my friends from SAS Technical Support give advice on convergence in their SAS Global Forum paper (Kiernan, Tao, and Gibbs, 2012).

Although it can be frustrating to deal with convergence issues during a data analysis, lack-of-convergence during a simulation study can be maddening. Recall that the efficient way to implement a simulation study in SAS is to use BY-group processing to analyze thousands of independent samples. A simulation study is designed to run in batch mode without any human intervention. How, then, can the programmer deal with lack of convergence during a simulation?

Some SAS users (such as Chen and Dong, 2009) have suggested parsing the SAS log for notes and warning messages, but that approach is cumbersome. Furthermore, if you have turned off SAS notes during the simulation, then there are no notes in the log to parse!

The _STATUS_ variable in the OUTEST= data set

Fortunately, SAS procedures that perform nonlinear optimization provide diagnostic variables as part of their output. These variables are informally known as "status variables." You can monitor the status variables to determine the BY groups for which the optimization converged.

The easiest way to generate a status variable is to use the OUTEST= option to generate an output data set that contains parameter estimates. Not every procedure supports the OUTEST= option, but many do. Let's see how it works. In my article "Simulate many samples from a logistic regression model," I showed how to generate 100 samples of data that follow a logistic regression model. If you make the sample size very small (like N=20), PROC LOGISTIC will report convergence problems for some of the random samples, as follows:

%let N = 20;                                     /* N = sample size */
%let NumSamples = 100;                           /* number of samples */
/* Generate logistic data: See http://blogs.sas.com/content/iml/?p=11735 */
options nonotes;    /* turn of notes; use OUTEST= option */
proc logistic data=LogisticData noprint outest=PE;
   by SampleId;
   model y(Event='1') = x1 x2;
options notes;
proc freq data=PE;
tables _STATUS_ / nocum;

The output from PROC FREQ shows that 10% of the models did not converge. The OUTEST= data set has a variable named _STATUS_ that indicates whether the logistic regression algorithm converged. You can use the _STATUS_ variable to analyze only those parameter estimates from converged optimizations. For example, the following call to PROC MEANS produces summary statistics for the three parameter estimates in the model, but only for the converged models:

proc means data=PE nolabels;
where _STATUS_ = "0 Converged"     /* analyze estimates for converged models */
var Intercept x1 x2;

The ConvergenceStatus table

Notice that the _STATUS_ variable does not contain details about why the algorithm failed to converge. For that information, you can use the ConvergenceStatus ODS table, which is produced by all SAS/STAT regression procedures that involve nonlinear optimization. To save the ODS table to an output data set, you cannot use the NOPRINT option. Instead, use the %ODSOff and %ODSOn macros to suppress ODS output. Use the ODS OUTPUT statement to write the ConvergenceStatus tables to a SAS data set, as follows:

/* define %ODSOff and %ODSOn macros */
proc logistic data=LogisticData;
   by SampleId;
   model y(Event='1') = x1 x2;
   ods output ParameterEstimates=PE2 ConvergenceStatus=CS;
proc freq data=CS;
tables Status Reason / nocum;

The ConvergenceStatus table has a numerical variable named Status, which has the value 0 if the algorithm converged. The character variable Reason contains a short description of the reason that the algorithm terminated. The results of the PROC FREQ call shows that the logistic algorithm failed eight times because of "complete separation," converged 90 times, and failed twice due to "quasi-complete separation."

You can use the DATA step and the MERGE statement to merge the ConvergenceStatus information with the results of other ODS tables from the same analysis. For example, the following statements merge the convergence information with the ParameterEstimates table. PROC MEANS can be used to analyze the parameter estimates for the models that converged. The summary statistics are identical to the previous results and are not shown.

data All;
merge PE2 CS;
by SampleID;
proc means data=All;
where Status = 0;
class Variable;
var Estimate;

In conclusion, the ConvergenceStatus table provides information about the convergence for each BY group analysis. When used as part of a simulation study, you can use the ConvergenceStatus table to manage the analysis of the simulated data. You can count the number of samples that did not converge, you can tabulate the reasons for nonconvergence, and you can exclude nonconverged estimates from your Monte Carlo summaries.

Post a Comment