Elementwise minimum and maximum operators

Like most programming languages, the SAS/IML language has many functions. However, the SAS/IML language also has quite a few operators. Operators can act on a matrix or on rows or columns of a matrix. They are less intuitive, but can be quite powerful because they enable you perform computations without writing a DO loop. This article describes the elementwise minimum and maximum operators.

Recently I wrote about how to use subscript operators to compute the min or max of a row or column of a data matrix. These min/max operators are two examples of subscript operators that perform row and column operations on a data matrix. Other subscript operators enable you to compute the sum, mean, and sum of squares for rows or columns of a matrix.

But what if you need to compare the values across two different matrices? To paraphrase a popular ad campaign, "there's an op for that!"

The elementwise minimum operator

Here's an example that illustrate how the elementwise minimum operator works. Suppose that there is a teacher who always gives a bonus question on her quizzes and tests. Quizzes are worth 50 points, tests are worth 100 points, and the bonus question is worth 5 points. However, the teacher does not allow any student to score more than 50 points on a quiz or more than 100 points on a test.

The following SAS/IML statements define the raw scores for three students in the class:

proc iml;
Test = {"Quiz1" "Test1" "Quiz2" "Test2"};
Name = {"Rita", "Sam", "Tim"};
x = {55 105 50 105,
     45  95 55  90,
     15 100 55 105};
print x[r=Name c=Test L="Raw Scores"];

You can see that on several occasions a student has answered all questions correctly, including the bonus question. The teacher wants to cap the scores at some maximum value. The elementwise minimum operator (><) is an easy way to perform this operation. To ensure that 100 is the maximum possible score, the teacher could apply the >< operator, just as in the SAS DATA step:

trunc100 = x >< 100;             /* result is minimum of x[i,j] and 100 */
print trunc100[r=Name c=Test L="Max Score = 100"];

Now the maximum value of any cell is 100. However, columns 1 and 3 represent quiz scores, so they need to be truncated at the maximum value of 50. You might be tempted to loop over the columns and use the CHOOSE function to cap the maximum value of each column, as follows:

target = {50 100 50 100};    /* target[i] is max allowed value of column i */
A = j(nrow(x), ncol(x));     /* allocate new matrix for results */
do j = 1 to 4;
   A[,j] = choose(x[,j] > target[j], target[j], x[,j]);

However, the elementwise minimum operator enables you to compute the result without writing a loop. Because the target vector is a row vector with four columns, the following statement returns a matrix where element (i,j) is the minimum of A[i,j] and target[j]:

A = (x >< target);   /* target[j] is max allowed value of column j */
print A[r=Name c=Test L="Adjusted Scores"];

The elementwise maximum operator

In a similar way, the elementwise maximum operator (<>) enables the teacher to set the smallest possible value for a test score. The teacher might decide that extremely low scores are unduly influential when computing the average grade, so she might decide to make 25 the lowest possible score, as follows:

B = A <> 25;             /* result is maximum of A[i,j] and 25 */

The elementwise minimum and maximum operators work when the second matrix is a scalar, a row vector, a column vector, or a matrix.

  • The expression (A >< scalar) returns a matrix that is the same size of A, and no element is smaller than scalar.
  • The expression (A >< row_vector) returns a matrix that is the same size of A, and no element of the jth column is smaller than the jth element of row_vector.
  • The expression (A >< col_vector) returns a matrix that is the same size of A, and no element of the ith row is smaller than the ith element of col_vector.
  • The expression (A >< matrix) returns a matrix that is the same size of A, and the (i,j)th element is no smaller than the (i,j)th element of matrix.

Truncating vector values

The elementwise min/max operators can be used to truncate a vector of values. This often comes up when data has small negative values (possibly because of numerical round off) and you want to truncate all negative values to 0.

Another example is when you want to truncate the values of a function. For example, the sine function returns values in the range [–1, 1]. If you want to truncate the sine function at ±1/2, you can use the following shorthand notation:

ods graphics / width=400px height=200px;
t = do(0,12.56,0.03);
z = (-0.5 <> sin(t) >< 0.5); /* truncate within [-0.5, 0.5] */
title "Truncated Sine Function";
call series(t, z);
Post a Comment

A Christmas tree from Pascal's triangle

O Christmas tree,
O Christmas tree,
One year a fractal made thee!
O Christmas tree,
O Christmas tree,
A heat map can display thee!

From Pascal's matrix we define!
Reflect across, divide by nine.
O Christmas tree,
O Christmas tree,
Self-similar and so divine!

Eventually I will run out of cute ways to use SAS software to create and visualize a Christmas-themed image. But not this year!

My recent article about how to create Pascal's matrix in SAS included a lower triangular matrix that exhibits self-similar patterns. If you take the Pascal matrix modulo 9 and reflect it across the Y axis, you get a triangular array that looks a little bit like a Christmas tree. You can add a "trunk" to the bottom of the tree to improve the resemblance. As shown in my previous post, you can use a heat map to visualize the matrix. In the following SAS/IML program, I use a green palette of colors to visualize the Pascal triangle values modulo 9. The resulting heat map is shown at the top of this article.

proc iml;
start PascalRule(n);
   m = j(n,n,0);              /* initialize with zeros */
   m[,1] = 1;                 /* set first column to 1 */
   j = 2:n;                   /* elements to compute */
   do k = 2 to n;
      /* for kth row, add adjacent elements from previous row */
      m[k,j] = m[k-1,j-1] + m[k-1,j];
/* reflect Pascal's triangle to create left-right symmetry
   and add a tree trunk to create a Christmas tree image */
start PascalTree(n);
   m = PascalRule(n);
   T = m[,n:2] || m[,2:n];    /* reflect; omit column of 1s */
   T[ loc(T=0) ] = .;         /* replace zeros with missing values */
   trunk = j(3,ncol(T),.);    /* add a "tree trunk" with value -1 */
   midpt = ncol(T)/2;         /* note that ncol(T) is even */
   halfwidth = ceil(n/10);
   trunk[,(midpt-halfwidth):(midpt+halfwidth+1)]= -1;
   return( T // trunk );
m = PascalTree(25);
k = 9;
tree = mod(m,k);
ods graphics / width=300px height=380px;
title = "Happy Holidays to All SAS Users!";
ramp = palette("greens", k);
ramp = "CX26261C" || ramp[,k:1];  /* brown (for trunk) and green palette */
call heatmapdisc(tree) colorramp=ramp displayoutlines=0 title=title;

It is remarkable that the Pascal matrix has so many amazing mathematical properties. Now you can add "makes a reasonable facsimile of a Christmas tree" to the list!

Happy holidays and a wonderful New Year to all of my readers. You are the reason that I write this blog.

Post a Comment

The direct product (Kronecker product) in SAS

There are many ways to multiply scalars, vectors, and matrices, but the Kronecker product (also called the direct product) is multiplication on steroids.

The Kronecker product looks scary, but it is actually simple. The Kronecker product is merely a way to pack multiples of a matrix B into a block matrix. If A is an n x p matrix, then the direct product A@B is the block matrix formed by stacking copies of B into the shape of A and multiplying the (i,j)th block by Aij. In symbols:


In SAS software, the Kronecker product is available in the SAS/IML matrix language. The direct product operator in SAS/IML is represented by using the "at sign" (@) operator.

The first matrix in the Kronecker product determines the shape of the final (block) matrix. The following example is equivalent to the horizontal concatenation of B and 2B:

proc iml;
B = {1 2, 2 4};
A = {1 2};       /* dimension of A determines shape of block matrix */
C1 = A@B;        /* 1*B || 2*B */
print C1;

The first matrix can be any shape. To obtain a product that is a block-diagonal matrix, you can form the Kronecker product of a diagonal matrix with another matrix, as follows:

A = {2 0, 0 4};  /* = diag({2 4}) */ 
C2 = A@B;        /* = block(2*B, 4*B) */
print C2;

Block-diagonal matrices are used in mixed models. The Kronecker product makes it easy to construct block matrices where each block is a multiple of a matrix B.

Notice that the product A@B is very simple if A is a matrix of zeros and ones. In that case, the Kronecker product creates a block matrix where each block is either a copy of B or a zero block. For example, execute the statement C3 = I(3)@B to form a block diagonal matrix that contains three copies of B.

In the early days of the SAS/IML language, the Kronecker product was used a lot. Prior to SAS version 8, matrices had to be the same dimension in order to add or subtract them. The Kronecker product can be used to coerce a vector into a matrix shape, and the Kronecker operator appears in graduate-level textbooks about matrix operations in statistics. For example, suppose that you want to center a data matrix by subtracting the mean of each column. A modern SAS/IML program would use the following shorthand expression to center the matrix:

/* read data matrix into X */
use Sashelp.Class; read all var _num_ into X;  close Sashelp.Class;
mean = x[:,];                    /* vector contains mean of each column */
CenterX = X - mean;              /* modern code: Subtract the vector */

In contrast, a SAS/IML program that was written in the 1980s or '90s would use the Kronecker product with a vectors of 1s to create a matrix that is the same shape as X. Each row of the resulting matrix is a copy of the mean vector. Here's how an old program (or a textbook) might express the operation that centers a data matrix:

/* Old code: expand meanX into matrix that is same shape as X */
CenterX = X - j(nrow(X),1,1) @ mean;  /* subtract matrices of same dim */

Do you use the Kronecker product in your work? How does it appear? Leave a comment.

Post a Comment

A matrix computation on Pascal's triangle

A colleague asked me a question regarding my recent post about the Pascal triangle matrix. While responding to his question, I discovered a program that I had written in 1999 that computed with a Pascal triangle matrix. Wow, I've been computing with Pascal's triangle for 15 years! I don't know whether to be proud or embarrassed.

Anyway, here is a neat result from my 1999 program. The Pascal triangle matrix, M, from my last post was lower triangular. Therefore its transpose M` is the Cholesky root of the symmetric positive definite matrix MM`. What is the matrix MM`? The following SAS/IML program computes the answer for a Pascal matrix with 10 rows:

proc iml;
start PascalRule(n);
   m = j(n,n,0);    /* initialize with zeros */
   m[,1] = 1;       /* set first column to 1 */
   j = 2:n;         /* elements to compute */
   do k = 2 to n;
      /* for kth row, add adjacent elements from previous row */
      m[k,j] = m[k-1,j-1] + m[k-1,j];
M = PascalRule(10);   /* the Cholesky root of some matrix */
P = M * M`;           /* symmetric positive definite (SPD) */
print P[F=5. r=("R1":"R10") c=("C1":"C10")];

Tilt your head to the left and look carefully at the antidiagonals of the matrix P. The kth antidiagonal contains values from the kth row of Pascal's triangle! In other words, the Cholesky root of a Pascal matrix contains another Pascal matrix!

I didn't discover this fact. In my 1999 program I have a comment that says

See N. Higham, Accuracy and Stability of Numerical Algorithms, SIAM, 1996, for interesting facts related to the Pascal matrices. In particular, the Cholesky factor of a Pascal matrix has columns that contains the elements of Pascal's triangle!

Several readers posted comments to my previous blog that describe other features of Pascal's matrix. In particular, the inverse and square of a Pascal matrix exhibit interesting characteristics.

By the way, my program from 1999 actually started with the symmetric positive definite Pascal matrix and computed the triangular matrix by calling the ROOT function in SAS/IML. The following module computes the symmetric positive definite Pascal matrix directly:

/* compute the SPD matrix whose antidiagonals are the rows of Pascal's triangle */
start PascalMatrix(n);
  P = j(n,n);
  do i = 2 to n;
    j = i:n;
    P[i,j] = comb( i+j-2, j-1 );
    P[j,i] = T(P[i,j]);          /* make symmetric */
  return ( P );      
Post a Comment

Pascal's triangle in SAS


Pascal's triangle is the name given to the triangular array of binomial coefficients. The nth row is the set of coefficients in the expansion of the binomial expression (1 + x)n. Complicated stuff, right?

Well, yes and no. Pascal's triangle is known to many school children who have never heard of polynomials or coefficients because there is a fun way to construct it by using simple addition. You start by writing down the number 1. The second line is the sum of the first line and the result of shifting the first line one column to the right. This process continues: to form a new row of the triangle, add the previous row to a right-shifted copy of the previous row. To make the bookkeeping work out, you pretend that missing items are zero, as shown by the ghostly images of zeros in the image at the top of this article.

In math terms, you can find the kth element of the nth row, by adding the (k–1)th and the kth elements of the previous row. This geometric construction is a result of the following recursive relationship between binomial coefficients:


Recently several participants on the SAS/IML Support Community posted programs that created Pascal's triangle by using the COMB function in SAS software. This article shows how to create Pascal's triangle and how self-similar (fractal) patterns arise when you visualize the triangle in certain ways.

Creating an array of binomial coefficients

The simplest way to create Pascal's matrix is to use the COMB function to generate the values of the binomial coefficients. You can start with a matrix that contains all zeros. The first k elements of the kth row are filled with binomial coefficients. You can do this in the SAS DATA step, or in the SAS/IML language as follows:

proc iml;
start PascalTriangle(n);
   m = j(n,n,0);                    /* matrix with all zeros */
   do k = 1 to n;                   /* for the kth row...    */
      m[k,1:k] = comb(k-1, 0:k-1);  /* fill nonzero elements */
T10 = PascalTriangle(10);
print T10[F=3. L="Pascal's Triangle, Level 10" r=("n=0":"n=9")];

The resulting matrix is similar to the image at the top of this post, with the upper triangular elements equal to zero. Notice that the SAS/IML language enables you to pass in a vector argument to a Base SAS function. In this case, a vector (0:k-1) is passed to the COMB function and the resulting row vector overwrites some of the zeros in the matrix m.

Using the previous row to create the next row of Pascal's matrix

You can also create Pascal's matrix by using the "schoolchild method" of adding adjacent elements from the previous row to create the new row. The construction is similar to the way that you can construct cellular automata. The following SAS/IML module constructs Pascal's triangle by using addition; no need to call the COMB function!

start PascalRule(n);
   m = j(n,n,0);    /* initialize with zeros */
   m[,1] = 1;       /* set first column to 1 */
   j = 2:n;         /* elements to compute */
   do k = 2 to n;
      /* for kth row, add adjacent elements from previous row */
      m[k,j] = m[k-1,j-1] + m[k-1,j];
T10 = PascalRule(10);
print T10[F=3. L="Pascal's Triangle, Level 10" r=("n=0":"n=9")];

Self-similar structures in Pascal's triangle

At first glance, the numbers in Pascal triangle have a simple structure. The edges of the triangle are all 1. The interior values increase geometrically, reaching their maximum values in the middle of the final row. However, if you label each value according to whether it is odd or even, a surprising pattern reveals itself!

The following SAS/IML program creates a Pascal matrix with 56 rows. The upper-triangular elements are set to missing values. The program then creates a discrete heat map that shows the parity (even or odd) of the remaining elements. Even numbers are displayed as white squares; odd numbers are displayed as black squares.
ods graphics / width=400px height=380px;
m = PascalRule(56);
m[ loc(col(m)>row(m)) ] = .;  /* replace zeros with missing values */
mod2 = mod(m,2);
call heatmapdisc(mod2) colorramp={WHITE BLACK}
     displayoutlines=0 title="Pascal's Triangle mod 2";

The resulting heat map bears a striking resemblance to the fractal known as Sierpinski's triangle. This fact is not widely known, but the image is comprehensible to school-age children. However, few children have the patience and stamina to color hundreds of cells, so using software to color the triangle is definitely recommended!

The fascinating self-similar pattern might inspire you to wonder what happens if the elements are colored according to some other scheme. For example, what is the pattern if you divide the elements of Pascal's triangle by 3 and visualize the remainders? Or division by 4? Or 5? The following loop creates three heat maps that visualize the numbers in Pascal's triangle modulo k, where k = 3, 4, and 5.

do k = 3 to 5;
   mod = mod(m,k);
   ramp = palette("greys", k);
   title = "Pascal's Triangle mod " + char(k,1);
   call heatmapdisc(mod) colorramp=ramp displayoutlines=0 title=title;

The "mod 4" result is shown. The other heat maps are similar. Each shows a self-similar pattern. One of the surprising results of chaos theory and fractals is that these complicated self-similar structures can arise from simple iterative arithmetic operations (adding adjacent elements) followed by a modular operation.

Creating larger triangles

The astute reader will have noticed that 56 rows is a curious number. Why not 64 rows? Or 100? The answer is that the modular operations require that the numbers in Pascal's triangle be exactly representable as an integer. Although you can compute more than 1,000 rows of Pascal's triangle in double precision, at some point the numbers grow so large that they can no longer be represented exactly with an 8-byte integer.

You can use the SAS CONSTANT function to find the largest integer, B, such that smaller integers (in magnitude) are exactly represented by an 8-byte numeric value. It turns out that the largest value in a Pascal triangle with k rows is "k choose floor(k/2)," and this value exceeds B when k=57, as shown by the following statements. Thus the modulo operations will become inaccurate for k>56.

B = constant('ExactInt'); 
print B;   
k = T(55:58);
c = comb(k, floor(k/2));
print k c[L="k choose [k/2]"] (c<B)[L="Exact?"];

I think Pascal's triangle is very cool. When did you first encountered Pascal's triangle? Were you fascinated or bored? Share your story in the comments.

Post a Comment

Compute maximum and minimum values for rows and columns in SAS

A common question on SAS discussion forums is how to compute the minimum and maximum values across several variables. It is easy to compute statistics across rows by using the DATA step. This article shows how to compute the minimum and maximum values for each observation (across variables) and, for completeness, for each variable. If you think of your numerical data as being in a matrix, the task is to compute the minimum and maximum values for each row or for each column. You can do this in Base SAS or by using a powerful subscript operator in the SAS/IML matrix language.

The data in this article are Fisher's famous iris data, which contains the widths and lengths of the petals and sepals of 150 iris flowers. This data is distributed in the Sashelp.Iris data set as part of SAS software.

The minimum and maximum values of variables: Base SAS

The MEANS procedure is the simplest way to compute the minimum and maximum values for each numeric variable in a data set. The following statements display the extreme values for the four numerical variables in the iris data:

proc means nolabels data=Sashelp.Iris Min Max;
   output out=MinMaxCols;

By not specifying the VAR statement, all numeric variables are used. (You could also specify var _numeric_;) The OUTPUT statement writes the minimum an maximum values to the MinMaxCols data set, along with a few other useful statistics.

The minimum and maximum values of observations: Base SAS

The SAS DATA step contains the MIN and MAX functions, which return the minimum and maximum nonmissing values (respectively) from a list of variables. You can read all of the numerical variables in a data set into an array and call the MIN and MAX functions as follows:

data MinMaxRows;
   set sashelp.Iris;
   array x {*} _numeric_;    /* x[1] is 1st var,...,x[4] is 4th var */
   min = min(of x[*]);       /* min value for this observation */
   max = max(of x[*]);       /* max value for this observation */
proc print data=MinMaxRows(obs=7);
   var _numeric_;

You can see that the MIN variable contain the minimum value of each row and the MAX variable contains the maximum value. Notice that you can use the _NUMERIC_ keyword to automatically assign the contents of the array x. This DATA step will work for any input data that contains numeric variables because the variable names are not hard-coded! Also note that you can use the OF operator (sometimes called the OF keyword) to specify that the MIN and MAX functions should operate across all elements in the array.

The minimum and maximum values of columns in a matrix

The SAS/IML language contains a number of operators (called subscript reduction operators) that you can use to perform simples statistical operations down matrix columns or across matrix rows. This makes it easy to compute the maximum value of each row or column, and similarly for the minimum value.

The important operators are the max subscript operator (<>) and the min subscript operator (><). You use these operators like subscripts. To find extrema for columns, use these operators in place of row subscripts, as follows:

proc iml;
use Sashelp.Iris;
read all var _NUM_ into X[c=varNames];
close Sashelp.Iris;
minC = X[><, ];    /* row vector contains min of columns */
maxC = X[<>, ];    /* row vector contains max of columns */
print (minC//maxC)[r={"Min" "Max"} c=varNames];

For years I struggled to remember which combination of greater-than and less-than symbols was the min operator and which was the max operator. Eventually I developed the following mental image. Think of the minimum as being the bottom of a valley. If you are viewing a valley from the far side of a lake, the image looks like a greater-than sign placed next to a less-than sign. Similarly, think of a maximum as being the peak of a mountain. If you are viewing the mountain from across a lake, the image looks like a less-than sign placed next to a great-than sign. These mnemonic aids are shown in the following image:

The minimum and maximum values of rows in a matrix

In a similar way, you can compute the minimum and maximum values of each row of a matrix, as follows:

minR = X[ ,><];    /* column vector contains min of rows */
maxR = X[ ,<>];    /* column vector contains max of rows */

The MinR and MaxR vectors each contain 150 elements. The value MinR[i] is the minimum value of the ith row and the value MaxR[i] is the maximum value of the ith row.

One of the nice aspects of the SAS/IML matrix language is its symmetry: operations on rows and operations on columns are often closely related to each other. This is in contrast to Base SAS, where the DATA step is often the best choice for computing statistics for observations, whereas procedures are often easier to use for computing statistics for variables.

Post a Comment

The Wishart distribution: Covariance matrices for multivariate normal data

I've written about how to generate a sample from a multivariate normal (MVN) distribution in SAS by using the RANDNORMAL function in SAS/IML software. Last week a SAS/IML programmer showed me a program that simulated MVN data and computed the resulting covariance matrix for each simulated sample. The purpose of the program was to study properties of the covariance matrices.

The programmer was pleased when I told him that SAS/IML software provides a simpler and more efficient way to simulate covariance and correlation matrices for MVN data. You can generate the covariance matrices directly by using the RANDWISHART function, which generates matrices from the Wishart distribution.

What is the Wishart distribution?

Before thinking about covariance matrices for multivariate normal data, let's recall a theoretical result for univariate data: For a sample of size n drawn from a normal distribution, the sample variance (appropriately scaled) follows a chi-square distribution with n–1 degrees of freedom. This means that if you want to study properties of the sample variance, you don't need to generate normal data. Instead you can draw a random chi-square variate and rescale it to produce a sample variance. No normal samples required!

This result generalizes to multivariate normal data. If you draw a sample from a MVN distribution with covariance matrix Σ, the sample covariance matrix (appropriately scaled) has a sampling distribution that is called the Wishart distribution. You can think of the Wishart distribution as a multivariate generalization of the chi-square distribution. It is a distribution of symmetric positive-definite matrices. A random draw from the Wishart distribution is some matrix that, upon rescaling, is a covariance matrix for MVN data.

From data to covariance matrices

Suppose that you want to approximate the sampling distribution of the correlation coefficient between two correlated normal variables in a sample of size 50. The straightforward approach is to simulate 50 observations from the bivariate normal distribution, compute the correlation coefficient for the sample, and then repeat the process many times in order to approximate the distribution of the correlation coefficients. An implementation in PROC IML follows:

proc iml;
call randseed(12345);
N = 50;                              /* MVN sample size   */
Sigma = {9 1,                        /* population covariance; correlation = 1/3 */
         1 1};
NumSamples = 1000;                   /* number of samples in simulation */
/* First attempt: Generate MVN data; compute correlation from data */
corr = j(NumSamples, 1, .);          /* allocate space for results */
do i = 1 to NumSamples;
   X = randnormal(N, {0 0}, Sigma);  /* MVN sample of size 50 */
   corr[i] = corr(X)[2];             /* corr = off-diagonal element */
title "Distribution of Correlation Coefficient";
title2 "N=50; rho = 1/3";
call histogram(corr) xvalues=do(-2,0.7,0.1)
                     other="refline 0.333 / axis=x";

The histogram shows the approximate sampling distribution for the correlation coefficient when the population parameter is ρ = 1/3. You can see that almost all the sample correlations are positive, a few are negative, and that most correlations are close to the population parameter of 1/3.

Sampling from the Wishart distribution in SAS

In the previous section, notice that the MVN data is not used except to compute the sample correlation matrix. If we don't need it, why bother to simulate it? The following program shows how you can directly generate the covariance matrices from the Wishart distribution: draw a matrix from the Wishart distribution with n–1 degrees of freedom, then rescale by dividing the matrix by n–1.

/* More efficient: Don't generate MVN data, generate covariance matrix DIRECTLY! 
   Each row of A is scatter matrix; each row of B is a covariance matrix */
A = RandWishart(NumSamples, N-1, Sigma); /* N-1 degrees of freedom */
B = A / (N-1);                           /* rescale to form covariance matrix */
do i = 1 to NumSamples;
   cov = shape(B[i,], 2, 2);             /* convert each row to square matrix */
   corr[i] = cov2corr(cov)[2];           /* convert covariance to correlation */
call histogram(corr) xvalues=do(-2,0.7,0.1);

The histogram of the correlation coefficients is similar to the previous histogram and is not shown. Notice that the second method does not simulate any data! This can be quite a time-saver if you are studying the properties of covariance matrices for large samples with dozens of variables.

The RANDWISHART distribution actually returns a sample scatter matrix, which is equivalent to the crossproduct matrix X`X, where X is an N x p matrix of centered MVN data. You can divide by N–1 to obtain a covariance matrix.

The return value from the RANDWISHART function is a big matrix, each row of which contains a single draw from the Wishart distribution. The elements of the matrix are "flattened" so that they fit in a row in row-major order. For p-dimensional MVN data, the number of columns will be p2, which is the number of elements in the p x p covariance matrix. The following table shows the first five rows of the matrix B:


The first row contains elements for a symmetric 2 x 2 covariance matrix. The (1,1) element is 11.38, the (1,2) and (2,1) elements are 0.9, and the (2,2) element is 0.73. These sample variances and covariances are close to the population values of 9 1, and 1. You can use the SHAPE function to change the row into a 2 x 2 matrix. If necessary, you can use the COV2CORR function to convert the covariance matrix into a correlation matrix.

Next time you are conducting a simulation study that involves MVN data, think about whether you really need the data or whether you are just using the data to form a covariance or correlation matrix. If you don't need the data, use the RANDWISHART function to generate matrices from the Wishart distribution. You can speed up your simulation and avoid generating MVN data that are not needed.

Post a Comment

Overview of new features in SAS/IML 13.1

SAS software contains a lot of features, and each release adds more.To make sure that you do not miss new features that appear in the SAS/IML language, the word cloud on the right sidebar of my blog contains numbers that relate to SAS or SAS/IML releases. For example, you can click on "9.3" to read about features that first appeared in SAS 9.3. I have also written summaries of recent SAS/IML releases:

Over the past year I've blogged about features that were new to SAS/IML 13.1, which was released in December, 2013, as part of the first maintenance release of SAS 9.4 (SAS 9.4m1). This article collects all those blog posts together for easy reference.

New functions and subroutines in SAS/IML 13.1

The following blog posts discuss new functions and subroutines for data analysis in SAS/IML 13.1:

  • The CV function computes the sample coeficient of variation.
  • The SKEWNESS function computes the sample skewness.
  • The KURTOSIS function computes the sample kurtosis.
  • The LOGABSDET function computes the natural logarithm of the absolute value of the determinant of a matrix. (Say that three times fast!)
  • The PARENTNAME function enables a module to learn the name of a SAS/IML matrix that was passed in as an argument.

In addition, the SAS/IML 13.1 User's Guide documents two new functions for solving linear programming problems:

  • The LPSOLVE subroutine solve linear programming problems. LPSOLVE replaces the older LP call, which has been deprecated.
  • The MILPSOLVE subroutine is a new subroutine for solving mixed integer linear programming problems. It implements effective techniques for finding optimal solutions for linear objective functions that satisfy certain constraints.

New support for heat maps

There are also new routines for creating heat maps that visualize matrices. I produced a video about heat maps, as well as the following articles:

  • The HEATMAPCONT subroutine creates a heat map that uses a continuous color ramp to visualize a matrix.
  • The HEATMAPDISC subroutine creates a heat map that uses a discrete color ramp to visualize a matrix that contains a small number of distinct values.
  • The PALETTE function enables you to choose color palettes that reflect sound design principles.

Enhancements to functionality

There were several enhancements and language improvements in SAS/IML 13.1. For example, the ABORT and STOP statements now optionally print a user-defined message. Another change is that the order of resolution has changed for user-defined modules, so that it is easier to override built-in functions. I direct you to the "What's New" chapter of the documentation for additional new features.

It can be hard to keep up with enhancements to SAS software. Hopefully this reference page will be a useful to SAS/IML users who are upgrading their version of SAS. Have you used any of these new features? Leave a comment and tell me which is your favorite.

Post a Comment

Resampling and permutation tests in SAS

My colleagues at the SAS & R blog recently posted an example of how to program a permutation test in SAS and R. Their SAS implementation used Base SAS and was "relatively cumbersome" (their words) when compared with the R code. In today's post I implement the permutation test in SAS/IML. This provides an apples-to-apples comparison because both SAS/IML and R are matrix-vector languages.

This permutation test is a simple resampling exercise that could be assigned as a homework problem in a classroom. If you are at a college or university, remember that SAS/IML is available for free for all academic users through the SAS University Edition.

Permutation tests in SAS/IML

The analysis was motivated by a talk about using computational methods to illuminate statistical analyses. The data are the number of mosquitoes that were attracted to human volunteers in an experiment after each volunteer had consumed either a liter of beer (n=25) or water (n=18). The following statements assign the experimental data to two SAS/IML vectors and compute the observed difference between the means of the two groups:

proc iml;
G1 = {27, 19, 20, 20, 23, 17, 21, 24, 31, 26, 28, 20, 27, 19, 25, 31, 24, 28, 24, 29, 21, 21, 18, 27, 20};
G2 = {21, 19, 13, 22, 15, 22, 15, 22, 20, 12, 24, 24, 21, 19, 18, 16, 23, 20};
obsdiff = mean(G1) - mean(G2);
print obsdiff;

The experimenters observed that, on average, people who drank beer attracted 4.4 more mosquitoes than people who drank water. The statistical question is, "What is the probability of observing a difference of this magnitude (or bigger) by chance if the beverages have no effect?" You can answer this question by using a permutation test to perform a nonparametric version of the t test. The null hypothesis is that there is no difference between the mean number of mosquitoes that were attracted to each experimental group (beer or water).

The permutation test enables you to generate the null distribution. Draw 25 random observations from the data and assign them to Group 1; assign the other 18 observations to Group 2. Compute the difference between the means of each group. Repeat these two steps many times to approximate the null distribution. The following SAS/IML statements use the SAMPLE function in SAS/IML to permute the data. The permutation step is repeated 9,999 times so that (adding in the original data order) there are a total of 10,000 permutations of the data:

call randseed(12345);                             /* set random number seed */
alldata = G1 // G2;                        /* stack data in a single vector */
N1 = nrow(G1);  N = N1 + nrow(G2);
NRepl = 9999;                                     /* number of permutations */
nulldist = j(NRepl,1);                   /* allocate vector to hold results */
do k = 1 to NRepl;
   x = sample(alldata, N, "WOR");                       /* permute the data */
   nulldist[k] = mean(x[1:N1]) - mean(x[(N1+1):N]);  /* difference of means */
title "Histogram of Null Distribution";
refline = "refline " + char(obsdiff) + " / axis=x lineattrs=(color=red);";
call Histogram(nulldist) other=refline;

The histogram shows the distribution of mean differences that were computed under the assumption of the null hypothesis. The observed difference between the beer and water groups (the vertical red line at 4.38) is way off in the tail. Since the null hypothesis is not a likely explanation for the observed difference, we reject it. We conclude that mosquitoes are attracted differently to the two groups (beer and water).

If you would like to compute the empirical p-value for the null distribution, that is easily accomplished:

pval = (1 + sum(abs(nulldist) >= abs(obsdiff))) / (NRepl+1);
print pval;

Vectorization for permutation tests

Regular readers of my blog know that I advocate vectorizing programs whenever possible. Matrix-vector languages such as SAS/IML, R, and MATLAB work more efficiently when computations inside loops are replaced by vector or matrix computations.

Because of the way that SAS/IML loops are compiled and optimized, using loops in the SAS/IML language is not as detrimental to performance as in some other languages. For example, the previous permutation test code runs in about 0.04 seconds on my PC from 2009. Still, I like to promote vectorization because it can be important to performance.

The following statements eliminate the DO loop and implement the resampling and permutation test in two lines of SAS/IML code. The vectorized computation runs in about one-fourth the time:

x = sample(alldata, N//NRepl, "WOR");               /* create all resamples */
nulldist = x[,1:N1][,:] - x[,(N1+1):N][,:]; /* compute all mean differences */

The vectorized computation uses the colon (:) subscript reduction operator in SAS/IML to compute the mean of the first 25 and the last 18 elements for each set of permuted data.

Additional references for resampling in SAS

To learn more about efficient ways to implement resampling methods such as bootstrapping and permutation tests, consult the following references:

  • For information about bootstrapping in SAS/IML, see pages 14–17 of Wicklin (2008).
  • For another permutation test example, see pages 11–14 of Wicklin (2012).
  • Chapter 15 of my book Simulating Data with SAS describes resampling methods in both Base SAS and SAS/IML. I include useful tips and techniques that make bootstrapping in Base SAS less cumbersome, including a mention of the %BOOT and %BOOTCI macros, which enable you to implement bootstrap methods in Base SAS by using only a few program statements.
  • For an excellent introduction to resampling methods in Base SAS, see Cassell (2007).
Post a Comment

What is the coefficient of variation?

I sometimes wonder whether some functions and options in SAS software ever get used. Last week I was reviewing new features that were added to SAS/IML 13.1. One of the new functions is the CV function, which computes the sample coefficient of variation for data.

Maybe it is just me, but when I compute descriptive statistics for univariate data, the coefficient of variation is not a statistic that I look at. I don't think my undergraduate statistics course even mentioned the coefficient of variation (CV). I first encountered the idea many years later when learning about distribution theory.

The CV is a simple idea. For a distribution, the coefficient of variation is the ratio of the standard deviation to the mean: CV = σ/μ. You can estimate the coefficient of variation from a sample by using the ratio of the sample standard deviation and the sample mean, usually multiplied by 100 so that it is on the percent scale. This ratio is also known as the relative standard deviation when the data are positive.

What does the coefficient of variation mean?

The coefficient of variation is a dimensionless quantity. As such, it provides a measure of the variability of a sample without reference to the scale of the data.

Suppose I tell two people to measure the heights of some plants. The first person reports that the average height is 1.2 meters, with a standard deviation of 0.275 meters. The second person measures the same plants in centimeters. She reports that the average height is 120 centimeters, with a standard deviation of 27.5 centimeters. Obviously, these are the same answers, but one person reports a standard deviation of 0.275 (which sounds small) whereas the other person reports a standard deviation of 27.2 (which sounds big). The coefficient of variation comes to the rescue: for both sets of measurements the coefficient of variation is 22.9.

The CV can also help you compare two completely different measurements. How does variation in height compare to variation in weight? Or age? Or income? These variables are measured on different scales and use different units, but the CV (which is dimensionless) enables you to compare the variation of these variables.

How to compute the coefficient of variation in SAS

The coefficient of variation is computed by several SAS procedures: MEANS, UNIVARIATE, IML, TABULATE, and so forth. The following example shows data for the plant measurement example in the previous paragraph. The MEANS and IML procedure compute the CV for measurements on the meter and centimeter scales:

data Plants;
input height @@;
cm = height * 100;
1.6 1.5 .8 1.0 1.2 .9 1.2 1.8 1.2 1.3 1.3 .9 1.2 1.0 1.1
proc means data=Plants N mean std cv;
proc iml;
use Plants; read all var _NUM_ into X[c=varNames]; close;
cv = cv(X);
print cv[c=varNames];

Theoretical uses of the coefficient of variation

The coefficient of variation has some interesting uses as a theoretical tool. It enables you to compare the variation between different probability distributions. As I mentioned in my article on fat-tailed and long-tailed distributions, the exponential distribution is an important reference distribution in the theory of distributions. Because the standard deviation and the mean of an exponential distribution are equal, the exponential distribution has a CV equal to 1. Distributions with CV < 1 are considered low-variance distributions. Distributions with CV > 1 are high-variance distributions.

Obviously the coefficient of variation is undefined for symmetric distributions such as the normal and t distributions, which is perhaps why the CV is not widely used. The sample CV is undefined for centered data and is highly variable when the population mean is close to zero.

Do you use the coefficient of variation?

Have you ever used the coefficient of variation in a real data analysis problem? Is the CV a useful but underutilized statistic for practical data analysis? Or is it primarily a theoretical tool for comparing the variability of distributions? Leave a comment.

Post a Comment