The Wishart distribution: Covariance matrices for multivariate normal data

I've written about how to generate a sample from a multivariate normal (MVN) distribution in SAS by using the RANDNORMAL function in SAS/IML software. Last week a SAS/IML programmer showed me a program that simulated MVN data and computed the resulting covariance matrix for each simulated sample. The purpose of the program was to study properties of the covariance matrices.

The programmer was pleased when I told him that SAS/IML software provides a simpler and more efficient way to simulate covariance and correlation matrices for MVN data. You can generate the covariance matrices directly by using the RANDWISHART function, which generates matrices from the Wishart distribution.

What is the Wishart distribution?

Before thinking about covariance matrices for multivariate normal data, let's recall a theoretical result for univariate data: For a sample of size n drawn from a normal distribution, the sample variance (appropriately scaled) follows a chi-square distribution with n–1 degrees of freedom. This means that if you want to study properties of the sample variance, you don't need to generate normal data. Instead you can draw a random chi-square variate and rescale it to produce a sample variance. No normal samples required!

This result generalizes to multivariate normal data. If you draw a sample from a MVN distribution with covariance matrix Σ, the sample covariance matrix (appropriately scaled) has a sampling distribution that is called the Wishart distribution. You can think of the Wishart distribution as a multivariate generalization of the chi-square distribution. It is a distribution of symmetric positive-definite matrices. A random draw from the Wishart distribution is some matrix that, upon rescaling, is a covariance matrix for MVN data.

From data to covariance matrices

Suppose that you want to approximate the sampling distribution of the correlation coefficient between two correlated normal variables in a sample of size 50. The straightforward approach is to simulate 50 observations from the bivariate normal distribution, compute the correlation coefficient for the sample, and then repeat the process many times in order to approximate the distribution of the correlation coefficients. An implementation in PROC IML follows:

proc iml;
call randseed(12345);
N = 50;                              /* MVN sample size   */
Sigma = {9 1,                        /* population covariance; correlation = 1/3 */
         1 1};
NumSamples = 1000;                   /* number of samples in simulation */
/* First attempt: Generate MVN data; compute correlation from data */
corr = j(NumSamples, 1, .);          /* allocate space for results */
do i = 1 to NumSamples;
   X = randnormal(N, {0 0}, Sigma);  /* MVN sample of size 50 */
   corr[i] = corr(X)[2];             /* corr = off-diagonal element */
title "Distribution of Correlation Coefficient";
title2 "N=50; rho = 1/3";
call histogram(corr) xvalues=do(-2,0.7,0.1)
                     other="refline 0.333 / axis=x";

The histogram shows the approximate sampling distribution for the correlation coefficient when the population parameter is ρ = 1/3. You can see that almost all the sample correlations are positive, a few are negative, and that most correlations are close to the population parameter of 1/3.

Sampling from the Wishart distribution in SAS

In the previous section, notice that the MVN data is not used except to compute the sample correlation matrix. If we don't need it, why bother to simulate it? The following program shows how you can directly generate the covariance matrices from the Wishart distribution: draw a matrix from the Wishart distribution with n–1 degrees of freedom, then rescale by dividing the matrix by n–1.

/* More efficient: Don't generate MVN data, generate covariance matrix DIRECTLY! 
   Each row of A is scatter matrix; each row of B is a covariance matrix */
A = RandWishart(NumSamples, N-1, Sigma); /* N-1 degrees of freedom */
B = A / (N-1);                           /* rescale to form covariance matrix */
do i = 1 to NumSamples;
   cov = shape(B[i,], 2, 2);             /* convert each row to square matrix */
   corr[i] = cov2corr(cov)[2];           /* convert covariance to correlation */
call histogram(corr) xvalues=do(-2,0.7,0.1);

The histogram of the correlation coefficients is similar to the previous histogram and is not shown. Notice that the second method does not simulate any data! This can be quite a time-saver if you are studying the properties of covariance matrices for large samples with dozens of variables.

The RANDWISHART distribution actually returns a sample scatter matrix, which is equivalent to the crossproduct matrix X`X, where X is an N x p matrix of centered MVN data. You can divide by N–1 to obtain a covariance matrix.

The return value from the RANDWISHART function is a big matrix, each row of which contains a single draw from the Wishart distribution. The elements of the matrix are "flattened" so that they fit in a row in row-major order. For p-dimensional MVN data, the number of columns will be p2, which is the number of elements in the p x p covariance matrix. The following table shows the first five rows of the matrix B:


The first row contains elements for a symmetric 2 x 2 covariance matrix. The (1,1) element is 11.38, the (1,2) and (2,1) elements are 0.9, and the (2,2) element is 0.73. These sample variances and covariances are close to the population values of 9 1, and 1. You can use the SHAPE function to change the row into a 2 x 2 matrix. If necessary, you can use the COV2CORR function to convert the covariance matrix into a correlation matrix.

Next time you are conducting a simulation study that involves MVN data, think about whether you really need the data or whether you are just using the data to form a covariance or correlation matrix. If you don't need the data, use the RANDWISHART function to generate matrices from the Wishart distribution. You can speed up your simulation and avoid generating MVN data that are not needed.

Post a Comment

Overview of new features in SAS/IML 13.1

SAS software contains a lot of features, and each release adds more.To make sure that you do not miss new features that appear in the SAS/IML language, the word cloud on the right sidebar of my blog contains numbers that relate to SAS or SAS/IML releases. For example, you can click on "9.3" to read about features that first appeared in SAS 9.3. I have also written summaries of recent SAS/IML releases:

Over the past year I've blogged about features that were new to SAS/IML 13.1, which was released in December, 2013, as part of the first maintenance release of SAS 9.4 (SAS 9.4m1). This article collects all those blog posts together for easy reference.

New functions and subroutines in SAS/IML 13.1

The following blog posts discuss new functions and subroutines for data analysis in SAS/IML 13.1:

  • The CV function computes the sample coeficient of variation.
  • The SKEWNESS function computes the sample skewness.
  • The KURTOSIS function computes the sample kurtosis.
  • The LOGABSDET function computes the natural logarithm of the absolute value of the determinant of a matrix. (Say that three times fast!)
  • The PARENTNAME function enables a module to learn the name of a SAS/IML matrix that was passed in as an argument.

In addition, the SAS/IML 13.1 User's Guide documents two new functions for solving linear programming problems:

  • The LPSOLVE subroutine solve linear programming problems. LPSOLVE replaces the older LP call, which has been deprecated.
  • The MILPSOLVE subroutine is a new subroutine for solving mixed integer linear programming problems. It implements effective techniques for finding optimal solutions for linear objective functions that satisfy certain constraints.

New support for heat maps

There are also new routines for creating heat maps that visualize matrices. I produced a video about heat maps, as well as the following articles:

  • The HEATMAPCONT subroutine creates a heat map that uses a continuous color ramp to visualize a matrix.
  • The HEATMAPDISC subroutine creates a heat map that uses a discrete color ramp to visualize a matrix that contains a small number of distinct values.
  • The PALETTE function enables you to choose color palettes that reflect sound design principles.

Enhancements to functionality

There were several enhancements and language improvements in SAS/IML 13.1. For example, the ABORT and STOP statements now optionally print a user-defined message. Another change is that the order of resolution has changed for user-defined modules, so that it is easier to override built-in functions. I direct you to the "What's New" chapter of the documentation for additional new features.

It can be hard to keep up with enhancements to SAS software. Hopefully this reference page will be a useful to SAS/IML users who are upgrading their version of SAS. Have you used any of these new features? Leave a comment and tell me which is your favorite.

Post a Comment

Resampling and permutation tests in SAS

My colleagues at the SAS & R blog recently posted an example of how to program a permutation test in SAS and R. Their SAS implementation used Base SAS and was "relatively cumbersome" (their words) when compared with the R code. In today's post I implement the permutation test in SAS/IML. This provides an apples-to-apples comparison because both SAS/IML and R are matrix-vector languages.

This permutation test is a simple resampling exercise that could be assigned as a homework problem in a classroom. If you are at a college or university, remember that SAS/IML is available for free for all academic users through the SAS University Edition.

Permutation tests in SAS/IML

The analysis was motivated by a talk about using computational methods to illuminate statistical analyses. The data are the number of mosquitoes that were attracted to human volunteers in an experiment after each volunteer had consumed either a liter of beer (n=25) or water (n=18). The following statements assign the experimental data to two SAS/IML vectors and compute the observed difference between the means of the two groups:

proc iml;
G1 = {27, 19, 20, 20, 23, 17, 21, 24, 31, 26, 28, 20, 27, 19, 25, 31, 24, 28, 24, 29, 21, 21, 18, 27, 20};
G2 = {21, 19, 13, 22, 15, 22, 15, 22, 20, 12, 24, 24, 21, 19, 18, 16, 23, 20};
obsdiff = mean(G1) - mean(G2);
print obsdiff;

The experimenters observed that, on average, people who drank beer attracted 4.4 more mosquitoes than people who drank water. The statistical question is, "What is the probability of observing a difference of this magnitude (or bigger) by chance if the beverages have no effect?" You can answer this question by using a permutation test to perform a nonparametric version of the t test. The null hypothesis is that there is no difference between the mean number of mosquitoes that were attracted to each experimental group (beer or water).

The permutation test enables you to generate the null distribution. Draw 25 random observations from the data and assign them to Group 1; assign the other 18 observations to Group 2. Compute the difference between the means of each group. Repeat these two steps many times to approximate the null distribution. The following SAS/IML statements use the SAMPLE function in SAS/IML to permute the data. The permutation step is repeated 9,999 times so that (adding in the original data order) there are a total of 10,000 permutations of the data:

call randseed(12345);                             /* set random number seed */
alldata = G1 // G2;                        /* stack data in a single vector */
N1 = nrow(G1);  N = N1 + nrow(G2);
NRepl = 9999;                                     /* number of permutations */
nulldist = j(NRepl,1);                   /* allocate vector to hold results */
do k = 1 to NRepl;
   x = sample(alldata, N, "WOR");                       /* permute the data */
   nulldist[k] = mean(x[1:N1]) - mean(x[(N1+1):N]);  /* difference of means */
title "Histogram of Null Distribution";
refline = "refline " + char(obsdiff) + " / axis=x lineattrs=(color=red);";
call Histogram(nulldist) other=refline;

The histogram shows the distribution of mean differences that were computed under the assumption of the null hypothesis. The observed difference between the beer and water groups (the vertical red line at 4.38) is way off in the tail. Since the null hypothesis is not a likely explanation for the observed difference, we reject it. We conclude that mosquitoes are attracted differently to the two groups (beer and water).

If you would like to compute the empirical p-value for the null distribution, that is easily accomplished:

pval = (1 + sum(abs(nulldist) >= abs(obsdiff))) / (NRepl+1);
print pval;

Vectorization for permutation tests

Regular readers of my blog know that I advocate vectorizing programs whenever possible. Matrix-vector languages such as SAS/IML, R, and MATLAB work more efficiently when computations inside loops are replaced by vector or matrix computations.

Because of the way that SAS/IML loops are compiled and optimized, using loops in the SAS/IML language is not as detrimental to performance as in some other languages. For example, the previous permutation test code runs in about 0.04 seconds on my PC from 2009. Still, I like to promote vectorization because it can be important to performance.

The following statements eliminate the DO loop and implement the resampling and permutation test in two lines of SAS/IML code. The vectorized computation runs in about one-fourth the time:

x = sample(alldata, N//NRepl, "WOR");               /* create all resamples */
nulldist = x[,1:N1][,:] - x[,(N1+1):N][,:]; /* compute all mean differences */

The vectorized computation uses the colon (:) subscript reduction operator in SAS/IML to compute the mean of the first 25 and the last 18 elements for each set of permuted data.

Additional references for resampling in SAS

To learn more about efficient ways to implement resampling methods such as bootstrapping and permutation tests, consult the following references:

  • For information about bootstrapping in SAS/IML, see pages 14–17 of Wicklin (2008).
  • For another permutation test example, see pages 11–14 of Wicklin (2012).
  • Chapter 15 of my book Simulating Data with SAS describes resampling methods in both Base SAS and SAS/IML. I include useful tips and techniques that make bootstrapping in Base SAS less cumbersome, including a mention of the %BOOT and %BOOTCI macros, which enable you to implement bootstrap methods in Base SAS by using only a few program statements.
  • For an excellent introduction to resampling methods in Base SAS, see Cassell (2007).
Post a Comment

What is the coefficient of variation?

I sometimes wonder whether some functions and options in SAS software ever get used. Last week I was reviewing new features that were added to SAS/IML 13.1. One of the new functions is the CV function, which computes the sample coefficient of variation for data.

Maybe it is just me, but when I compute descriptive statistics for univariate data, the coefficient of variation is not a statistic that I look at. I don't think my undergraduate statistics course even mentioned the coefficient of variation (CV). I first encountered the idea many years later when learning about distribution theory.

The CV is a simple idea. For a distribution, the coefficient of variation is the ratio of the standard deviation to the mean: CV = σ/μ. You can estimate the coefficient of variation from a sample by using the ratio of the sample standard deviation and the sample mean, usually multiplied by 100 so that it is on the percent scale. This ratio is also known as the relative standard deviation when the data are positive.

What does the coefficient of variation mean?

The coefficient of variation is a dimensionless quantity. As such, it provides a measure of the variability of a sample without reference to the scale of the data.

Suppose I tell two people to measure the heights of some plants. The first person reports that the average height is 1.2 meters, with a standard deviation of 0.275 meters. The second person measures the same plants in centimeters. She reports that the average height is 120 centimeters, with a standard deviation of 27.5 centimeters. Obviously, these are the same answers, but one person reports a standard deviation of 0.275 (which sounds small) whereas the other person reports a standard deviation of 27.2 (which sounds big). The coefficient of variation comes to the rescue: for both sets of measurements the coefficient of variation is 22.9.

The CV can also help you compare two completely different measurements. How does variation in height compare to variation in weight? Or age? Or income? These variables are measured on different scales and use different units, but the CV (which is dimensionless) enables you to compare the variation of these variables.

How to compute the coefficient of variation in SAS

The coefficient of variation is computed by several SAS procedures: MEANS, UNIVARIATE, IML, TABULATE, and so forth. The following example shows data for the plant measurement example in the previous paragraph. The MEANS and IML procedure compute the CV for measurements on the meter and centimeter scales:

data Plants;
input height @@;
cm = height * 100;
1.6 1.5 .8 1.0 1.2 .9 1.2 1.8 1.2 1.3 1.3 .9 1.2 1.0 1.1
proc means data=Plants N mean std cv;
proc iml;
use Plants; read all var _NUM_ into X[c=varNames]; close;
cv = cv(X);
print cv[c=varNames];

Theoretical uses of the coefficient of variation

The coefficient of variation has some interesting uses as a theoretical tool. It enables you to compare the variation between different probability distributions. As I mentioned in my article on fat-tailed and long-tailed distributions, the exponential distribution is an important reference distribution in the theory of distributions. Because the standard deviation and the mean of an exponential distribution are equal, the exponential distribution has a CV equal to 1. Distributions with CV < 1 are considered low-variance distributions. Distributions with CV > 1 are high-variance distributions.

Obviously the coefficient of variation is undefined for symmetric distributions such as the normal and t distributions, which is perhaps why the CV is not widely used. The sample CV is undefined for centered data and is highly variable when the population mean is close to zero.

Do you use the coefficient of variation?

Have you ever used the coefficient of variation in a real data analysis problem? Is the CV a useful but underutilized statistic for practical data analysis? Or is it primarily a theoretical tool for comparing the variability of distributions? Leave a comment.

Post a Comment

The difference between RUN and CALL for SAS/IML subroutines

Have you ever noticed that some SAS/IML programmers use the CALL statement to call a subroutine, whereas others use the RUN statement? Have you ever wondered why the SAS/IML language has two statements that do the same thing?

It turns out that the CALL statement and the RUN statement do not do the same thing! Read on to discover how they differ.

By the way, the RUN statement in PROC IML has only one purpose: to execute a subroutine. Many other SAS procedures (such as REG, GLM, and DATASETS) use the RUN statement to tell SAS to run the procedure. PROC IML is different. Never put the RUN statement at the end of a PROC IML program.

Built-in versus user-defined subroutines

A main feature of the SAS/IML language is that it enables you to define your own modules. A module is a user-defined function or subroutine that you can call from an IML program. A module that returns a value is called a function module; a module that does not return a value is called a subroutine module.

Although I do not recommend that you do so, it is possible to define a subroutine module that has the same name as a built-in subroutine. These two routines co-exist. If you use the RUN statement, SAS/IML will call the subroutine that you wrote. If you use the CALL statement, SAS/IML will call the built-in subroutine.

So that's the difference between the RUN and CALL statements: The RUN statement looks first for a user-defined module with the specified name. If it finds it, then it calls that module. Otherwise, it calls the built-in subroutine that has the specified name. The CALL statement looks for a built-in subroutine and immediately calls that routine if it is found.

An example: Overriding the EIGEN subroutine

As I said, I don't recommend that you routinely write user-defined modules that have the same name as a SAS/IML built-in, but let's examine a situation in which it might be convenient to do so. Let's assume that you often read data sets that represent symmetric matrices, and that these data sets are stored in lower triangular form with missing value in the upper triangular portion of the matrix. For example, this is the default storage method for distance matrices that are created by using the DISTANCE procedure in SAS/STAT software, as shown below:

proc distance data=Sashelp.Cars out=Dist method=Euclid;
   where Type="Truck";
   var interval(Horsepower--Length / std=Std);
   id Model;
proc iml;
use Dist;   read all var _NUM_ into D[r=Model];   close Dist;
call heatmapcont(D) title="Distance Matrix";

The program creates a 24 x 24 symmetric matrix of distances between 24 observations in a six-dimensional space of variables. The program reads the distance matrix into a SAS/IML matrix. A heat map shows the matrix values. The dark gray color shows that the upper triangular portion of the matrix is missing. The white diagonal elements show that the diagonal of the matrix is exactly zero. The remaining cells indicate the distance between observations. Nearly white cells indicate that two observations are close together. Dark cells indicate that observations are relatively far apart.

Suppose that you want to compute the eigenvalues and eigenvectors of this matrix. The built-in EIGEN subroutine in SAS/IML can compute these quantities, but it expects a matrix that does not have any missing values. Therefore to compute the eigenvalues you should first extract the lower triangular elements of the matrix and copy them into the upper triangular portion of the matrix.

You could write a separate subroutine that only copies the lower triangular elements, but for this example I will write a subroutine that has the same name as the built-in EIGEN subroutine. The following statements define a module that inspects the upper triangular elements of a matrix. If the upper triangular elements are all missing, it replaces those missing values with the lower triangular elements. It then calls the built-in EIGEN function to compute the eigenvalues and eigenvectors:

start StrictLowerTriangular(X);    /* return lower triangular elements */
   return( remove(vech(X), cusum(1 || (ncol(X):2))) );
/* define EIGEN module as a custom override of the built-in subroutine */
start Eigen(eval, evec, A);
   r = row(A); c = col(A);
   UpperIdx = loc(c>r);
   /* if upper triangular elements are missing, copy from lower */
   if all(A[UpperIdx]=.) then
      A[UpperIdx] = StrictLowerTriangular(A);
   call eigen(eval, evec, A);          /* CALL the built-in EIGEN subroutine */
run eigen(eval, evec, D);              /* RUN the user-defined EIGEN subroutine */

In summary, the CALL and RUN statements enable you to choose between a built-in subroutine and a custom subroutine that have the same name.

The situation is a little different for user-defined functions. Beginning with SAS/IML 13.1, you can define a function that has the same name as a built-in function, and the order of resolution for calling functions ensures that a user-defined function is found before a built-in function that has the same name. This means that if you choose to override a SAS/IML function, you lose the ability to call the built-in function until you quit PROC IML.

Have you ever had the need to override a built-in SAS/IML subroutine? Let me know the details by leaving a comment.

Post a Comment

The distribution of Pythagorean triples


When I studied high school geometry, I noticed that many homework problems involved right triangles whose side lengths were integers. The canonical example is the 3-4-5 right triangle, which has legs of length 3 and 4 and a hypotenuse of length 5. The triple (3, 4, 5) is called a Pythagorean triple because it satisfies the Pythagorean theorem: 32 + 42 = 52. Similarly, (5, 12, 13) and (7, 24, 25) are Pythagorean triples that sometimes appear in geometry textbooks.

In general, a triple of natural numbers (a, b, c) is a Pythagorean triple if a2 + b2 = c2. Obviously if (a, b, c) is a Pythagorean triple, then so is (ka, kb, kc) for any natural number k. Geometrically, this transformation is not very interesting because it simply scales a right triangle into a similar triangle that has the same shape. I did not learn any trick in high school that enabled me to start with a Pythagorean triple and generate a new, geometrically different, right triangle.

Generating Pythagorean triples

Because my high school geometry book reused the same three or four triples over and over again, I assumed that it is difficult to generate Pythagorean triples. In fact, that is not the case: There are several formulas for generating Pythagorean triples, some of which were known to the ancient Greeks.

One interesting algorithm comes from the 20th-century and involves linear transformations via matrices. Here's how it works. Write any Pythagorean triple, v, as a row vector. Then the linear transformation v*L is a new Pythagorean triple, where L is any of the three following linear transformations:


For example, if you represent v = (3, 4, 5) as a row vector, then v*L1 = (5, 12, 13). You can multiply on the right again to get another new triple, such as v*L1*L3 = (45, 28, 53).

In fact, more is true. A primitive Pythagorean triple is a triple such that the three numbers have no nontrivial divisors, that is, they are relatively prime. It turns out that all primitive Pythagorean triples can be obtained by iteratively applying one of three linear transformations to the triple (3, 4, 5). In other words, the triple (3, 4, 5) is the "parent" of all primitive Pythagorean triples!

That is a fabulous result. It means that you can write a program that uses matrix multiplication to produce arbitrarily many primitive Pythagorean triples, as follows:

  • Start with v = (3, 4, 5).
  • Apply L1, L2, and L3 to create the first generation of children.
  • Apply the three transformations to each of the children to create a generation of nine grandchildren.
  • Continue this process. The ith generation will contain 3i triples.

You can implement this algorithm in the SAS/IML matrix language as follows:

proc iml;
start GenerateTriples(depth=3);
   L1 = { 1  2  2,
         -2 -1 -2, 
          2  2  3};
   L2 = { 1  2  2,
          2  1  2,
          2  2  3};
   L3 = {-1 -2 -2,
          2  1  2,
          2  2  3};
   NumTriples = sum(3##(0:Depth));  /* total number of triples */
   triples = j(NumTriples,3,.);     /* allocate array for results */
   triples[1,] = {3 4 5};           /* parent triple */
   k = 1; n = 2;
   do i = 1 to Depth;               /* for each generation */
      do j = 1 to 3##(i-1);         /* generate 3##i children */
         x = triples[k,];
         triples[n,]   = x*L1;
         triples[n+1,] = x*L2; 
         triples[n+2,] = x*L3; 
         k = k + 1;   n = n + 3;
   return( triples );
p = GenerateTriples();
print p;

The parent and the first three generations account for 1 + 3 + 9 + 27 = 40 triples. The list contains some old familiar friends, as well as a few strangers that I haven't met before.

The distribution of Pythagorean triples

What do the Pythagorean triples look like if you plot their location in a Cartesian coordinate system? It turns out that they make fascinating patterns!

Because the hypotenuse length is a function of the leg lengths, it suffices to plot the locations of a and b in the coordinate plane. However, it turns out that when the algorithm produces a particular triple such as (5, 12, 13), it does not produce the symmetric triple (12, 5, 13). To make the pattern more symmetric, we can to manually add the triple (b, a, c) whenever the triple (a, b, c) appears.

The following SAS/IML statements construct many primitive Pythagorean triples and create a scatter plot of the ones for which a and b are both less than 4,500:

p = GenerateTriples(15);                  /* generate many triples */
m = p[ loc(p[,1]<=4500 & p[,2]<=4500), ]; /* exclude large ones */
mm = m // m[,{2 1 3}];                    /* generate symmetric pairs */
ods graphics / width=600px height=600px;
title "Primitive Pythagorean Triples";
title2 "Depth = 15, Symmetric";
call scatter(mm[,1], mm[,2]) procopt="aspect=1"  
     option="markerattrs=(size=3 symbol=SquareFilled)";

The image is shown at the beginning of this article. You can see certain parabolic curves that have a high density of points. There is a similar plot in Wikipedia that includes nonprimitive triples. The nonprimitive triples "fill in" some of the apparent "gaps" in the pattern.

These interesting patterns hint at deep relationships in number theory between Diophantine equations and the geometry of the solutions. Are you tempted to generate a similar image for the generalized equation a3 + b3 = c3? Don't bother: Fermat's Last Theorem states that this equation has no integer solutions!

Post a Comment

An efficient way to increment a matrix diagonal

I was recently asked about how to use the SAS/IML language to efficiently add a constant to every element of a matrix diagonal. Mathematically, the task is to form the matrix sum A + kI, where A is an n x n matrix, k is a scalar value, and I is the identity matrix. The person who asked the question was dealing with large matrices. He knew that explicitly forming the identity matrix is inefficient and that adding n2n zeros to A is unnecessary, but he wasn't sure how to avoid it.

I have previously blogged about how to assign the diagonal elements of a matrix. In that post I wrote a little helper function that will set the diagonal of a matrix to a given vector or scalar value. The function is repeated below:

proc iml;
start SetDiag(A, v);
   diagIdx = do(1,nrow(A)*ncol(A), ncol(A)+1); /* index diagonal elements */
   A[diagIdx] = v;                             /* set diagonal elements */

This function sets the diagonal elements of the matrix A to the value v, which can be a vector or a scalar. To increment (rather than replace) the diagonal, use the VECDIAG function to extract the current diagonal values of the matrix. Add a constant and call the function like this:

m = j(5000, 5000, 1);              /* create a 5000 x 5000 matrix */
run SetDiag(m, vecdiag(m) + 100);  /* increment diagonal elements by 100 */

Timing the performance


Recall that it is easy to use the TIME function in SAS to compare the performance of two competing algorithms. The plot at the left shows the time required to increment a matrix diagonal for a sequence of large matrices, up to 15,000 x 15,000. The blue line represents the naive computation
A + kI. Notice that the time increases quadratically with the size of the matrix because this algorithm adds n2 numbers to the matrix.

In contrast, the red line shows the time that it takes to form a vector that indexes the diagonal elements of a matrix and to replace only those elements. Theoretically, the time should be linear with the size of the matrix, but the computation is so fast that the elapsed time registers as instantaneous for even the largest matrix.

In summary, when working with large matrices, you can save lots of computational effort if you avoid unnecessary computations. Never multiply or add large diagonal matrices unless absolutely necessary. Instead, think about whether it is possible to index into the matrix and change only those numbers that you need to change.

By the way, increasing the diagonal of a matrix is sometimes called "ridging a matrix" and it is used in ridge regression. Ridging a matrix is also sometimes used in numerical computations that involve estimated covariance matrices. Sometimes an estimate is not positive definite, and ridging is one approach to try to obtain an estimate for a covariance matrix that has this important property.

Post a Comment

The distribution of blood types by country

My colleague Robert Allison has a knack for finding fascinating data. Last week he did it again by locating data about how blood types and Rh factors vary among countries.

He produced a series of eight world maps, each showing the prevalence of a blood type (A+, A-, B+, B-, AB+, AB-, O+, and O-) in various countries around the world. As I studied his maps, I noticed that the distribution of blood types in certain ethnic groups (Chinese, Japanese, Indians,...) was different than the distribution in Western Europe and former European colonies.

When dealing with multivariate data, a single visualization is rarely sufficient to answer all the questions that you can ask. Robert's maps answer the question, "What is the spatial distribution of each blood type?" I was curious about a different question: "Within each country, what is the distribution of blood types?" To answer my question, I needed a different visualization of the same multivariate data.


My attempt is shown to the left. (Click to enlarge.) The graph is a stacked bar chart of the percentage of blood type for 63 countries, sorted by the percentage of types that have positive Rh factors. Blood types with positive Rh factors are plotted on the right; negative Rh factors are plotted on the left. The A+ and A- types are plotted closest to the 0% reference line. The next types are AB, B, and O, in increasing distances from the 0% reference line.

A few ethnic differences are apparent. At the top of the chart are Western European countries and former European colonies such as Brazil, Australia, and New Zealand. A little lower on the list are countries in Eastern Europe and Scandinavia.

After that, the list starts to get geographically jumbled. The United States, Canada, and South Africa were all settled by people of multiple ethnicities. The middle of the list is dominated by countries from the Middle East, Northern Africa, and the Near East.

The next set of countries include South American countries such as Argentina and Bolivia, Caribbean countries, India, and African countries.

Finally, the bottom of the list features Asian populations such as China, Japan, Mongolia, and the Philippines. These populations have almost no negative Rh factors in their blood. The distribution of blood types in those countries are similar to each other, although regional differences appear, such as the relatively small number of A+ blood in Thailand.

This one dimensional ranking of countries by blood types reflects historical connections between peoples as a result of conquest, trade, and colonization.

A few countries seem "out of place" in their list order. Lebanon, Ireland and Iceland, and Peru and Chile, are some of the countries whose distribution of blood types differ from those adjacent to them in the list.

Relationships between countries

Some of the "out of place" countries are probably a result of the fact that it is hard to linearly order the countries when there are eight variables to consider. Principal component analysis (PCA) is a statistical technique that can group observations according to similar characterisics. In SAS software, you can use the PRINCOMP procedure to conduct a principal component analysis.

The analysis reveals that 81% of the variation in the data can be explained by the first two principal components. About 92% can be explained by using three principal components, which means that the eight variables (percentages of each blood type) fit well into these lower-dimensional linear subspaces.


The score plot from a two-dimensional PCA analysis is shown to the left. (Click to enlarge.) I added colors to the data to indicate a geographical region for the countries; the regions came from the United Nations list of countries and geographic regions. This plot shows the relationships between countries based on similarities in the distribution of blood types.

The middle of the plot contains African and West Asian nations. (West Asia is the UN name for the region that many people call the Middle East.) The right side of the plot is dominated by European countries and their former colonies. The upper left quadrant contains the Asian countries. The lower left quadrant includes Caribbean, Central American, and South American countries. This presentation once again shows that the distribution of blood types in Peru and Chile are different from other countries, but are similar to each other.

You can download the data and the SAS program that analyzes it and do additional analyses.

What interesting features can you find in these data? Are there other ways to view these data? Leave a comment.

Post a Comment

Binning data by quantiles? Beware of rounded data

In my article about how to create a quantile plot, I chose not to discuss a theoretical issue that occasionally occurs. The issue is that for discrete data (which includes rounded values), it might be impossible to use quantile values to split the data into k groups where each group has approximately the same number of elements. To put it bluntly, the algorithm that I proposed for creating a quantile bin plot can fail for large values of k.

The problem can occur for data that are rounded to the nearest unit, such as age, height, weight, and blood pressure. The algorithm assumes that the data quantiles are (mostly) unique and that they divide the empirical cumulative distribution function (ECDF) of N observations into k groups that each have approximately N/k observations. However, if there are more than N/k repeated values, the repeated value can occupy more than one quantile value. In fact, this will always happen if a particular value is repeated more than 2N/k times.

For example, suppose that you want to divide the diastolic blood pressure data in the Sashelp.Heart data set into 10 groups that each have approximately 10% of the data. The following statements draw the ECDF for the diastolic measurements and mark the deciles:

proc univariate data=Sashelp.Heart;
   var diastolic;
   cdfplot diastolic / vref=(10 to 90 by 10);
   ods select CDFPlot Quantiles;

The value 80 appears in 711 of the 5209 observations, which is about 14% of the data. Notice that the value 80 intersects two reference lines, one at 30 and the other at 40. This means that 80 is both the 30th percentile and the 40th percentile. Assuming that you do not want to split the value 80 across two bins, the deciles of the data produce at most nine bins.

Another way to see this graphically is to use the RANK procedure to try to group the data into 10 groups, as described in the article "Grouping observations based on quantiles." The following statements create a new variable called Group, which for continuous data would have the values 0–9. However, for this rounded data only nine groups exist:

proc sort data=Sashelp.Heart out=Heart;
   by Diastolic;
proc rank data=Heart out=Heart groups=10 ties=high;
   var Diastolic;        /* variable on which to group */
   ranks Group;          /* name of variable to contain groups 0,1,...,k-1 */
proc sgplot data=Heart;
   scatter x=Group y=Diastolic / group=Group;
   refline 80 / axis=y;
   xaxis values=(0 to 9);

As shown by the legend on the scatter plot, only nine groups were created, instead of the 10 that were asked for. Group 3 was not created.

I've described a problem, but what can you do about it? Not a lot. The problem is in the data. If you are flexible about the number of groups, you can try decreasing the value of k. Because each group contains approximately N/k observations, decreasing k might circumvent the problem. Another possibility is to jitter the data by adding a small amount of random noise. That will eliminate the problem of repeated values.

So I'll leave you with this warning: beware of using quantiles to bin rounded data into groups. Although the technique works great when almost all of the data values are distinct, you can run into problems if you ask for many bins and your data contain many repeated values.

Post a Comment

Recent additions to the SAS/IML file exchange

It has been three months since the introduction of the SAS/IML File Exchange, so I thought I'd give a short update on recent submissions and activity.

  • The web site administrators have created an easy-to-remember shortcut to the File Exchange:
  • There have been eight new articles submitted to the File Exchange. The application areas include
    • experimental design
    • discretization of a continuous variable
    • generalized linear models for complex sample data
    • multiple imputations
    • quality improvement

What I love about the SAS/IML File Exchange is that these programs were written by experts in their subject areas. Some programs are based on research that has been presented at conferences or in professional journals. Others are shorter examples and fun programs, such as Ian Wakeling's Sudoku Puzzle Generator! Together, they hint at the wide range of applications to which matrix computations and the SAS/IML language applies.

The SAS/IML file exchange is a valuable resource, but it relies on contributions from SAS/IML programmers who want to share their knowledge and programs. Have you written an interesting SAS/IML program? See the article "How to Contribute to the SAS/IML File Exchange" to find out how easy it is to share your work with others.

If you are more of a consumer than a creator of programs, the File Exchange still needs you. Download and try out the programs that are there. Use the interactive features of the SAS Community to rate the programs, to ask questions, to submit bug reports, or just to say thanks.

The community would love to hear from you. What are you waiting for?

Post a Comment