A matrix computation on Pascal's triangle

A colleague asked me a question regarding my recent post about the Pascal triangle matrix. While responding to his question, I discovered a program that I had written in 1999 that computed with a Pascal triangle matrix. Wow, I've been computing with Pascal's triangle for 15 years! I don't know whether to be proud or embarrassed.

Anyway, here is a neat result from my 1999 program. The Pascal triangle matrix, M, from my last post was lower triangular. Therefore its transpose M` is the Cholesky root of the symmetric positive definite matrix MM`. What is the matrix MM`? The following SAS/IML program computes the answer for a Pascal matrix with 10 rows:

proc iml;
start PascalRule(n);
   m = j(n,n,0);    /* initialize with zeros */
   m[,1] = 1;       /* set first column to 1 */
   j = 2:n;         /* elements to compute */
   do k = 2 to n;
      /* for kth row, add adjacent elements from previous row */
      m[k,j] = m[k-1,j-1] + m[k-1,j];
   end;
   return(m);
finish;
 
M = PascalRule(10);   /* the Cholesky root of some matrix */
P = M * M`;           /* symmetric positive definite (SPD) */
print P[F=5. r=("R1":"R10") c=("C1":"C10")];
pascalmatrix

Tilt your head to the left and look carefully at the antidiagonals of the matrix P. The kth antidiagonal contains values from the kth row of Pascal's triangle! In other words, the Cholesky root of a Pascal matrix contains another Pascal matrix!

I didn't discover this fact. In my 1999 program I have a comment that says

See N. Higham, Accuracy and Stability of Numerical Algorithms, SIAM, 1996, for interesting facts related to the Pascal matrices. In particular, the Cholesky factor of a Pascal matrix has columns that contains the elements of Pascal's triangle!

Several readers posted comments to my previous blog that describe other features of Pascal's matrix. In particular, the inverse and square of a Pascal matrix exhibit interesting characteristics.

By the way, my program from 1999 actually started with the symmetric positive definite Pascal matrix and computed the triangular matrix by calling the ROOT function in SAS/IML. The following module computes the symmetric positive definite Pascal matrix directly:

/* compute the SPD matrix whose antidiagonals are the rows of Pascal's triangle */
start PascalMatrix(n);
  P = j(n,n);
  do i = 2 to n;
    j = i:n;
    P[i,j] = comb( i+j-2, j-1 );
    P[j,i] = T(P[i,j]);          /* make symmetric */
  end;
  return ( P );      
finish;
Post a Comment

Pascal's triangle in SAS

PascalsTriangle2

Pascal's triangle is the name given to the triangular array of binomial coefficients. The nth row is the set of coefficients in the expansion of the binomial expression (1 + x)n. Complicated stuff, right?

Well, yes and no. Pascal's triangle is known to many school children who have never heard of polynomials or coefficients because there is a fun way to construct it by using simple addition. You start by writing down the number 1. The second line is the sum of the first line and the result of shifting the first line one column to the right. This process continues: to form a new row of the triangle, add the previous row to a right-shifted copy of the previous row. To make the bookkeeping work out, you pretend that missing items are zero, as shown by the ghostly images of zeros in the image at the top of this article.

In math terms, you can find the kth element of the nth row, by adding the (k–1)th and the kth elements of the previous row. This geometric construction is a result of the following recursive relationship between binomial coefficients:

binomialcoefs

Recently several participants on the SAS/IML Support Community posted programs that created Pascal's triangle by using the COMB function in SAS software. This article shows how to create Pascal's triangle and how self-similar (fractal) patterns arise when you visualize the triangle in certain ways.

Creating an array of binomial coefficients

The simplest way to create Pascal's matrix is to use the COMB function to generate the values of the binomial coefficients. You can start with a matrix that contains all zeros. The first k elements of the kth row are filled with binomial coefficients. You can do this in the SAS DATA step, or in the SAS/IML language as follows:

proc iml;
start PascalTriangle(n);
   m = j(n,n,0);                    /* matrix with all zeros */
   do k = 1 to n;                   /* for the kth row...    */
      m[k,1:k] = comb(k-1, 0:k-1);  /* fill nonzero elements */
   end;
   return(m);
finish;
 
T10 = PascalTriangle(10);
print T10[F=3. L="Pascal's Triangle, Level 10" r=("n=0":"n=9")];

The resulting matrix is similar to the image at the top of this post, with the upper triangular elements equal to zero. Notice that the SAS/IML language enables you to pass in a vector argument to a Base SAS function. In this case, a vector (0:k-1) is passed to the COMB function and the resulting row vector overwrites some of the zeros in the matrix m.

Using the previous row to create the next row of Pascal's matrix

You can also create Pascal's matrix by using the "schoolchild method" of adding adjacent elements from the previous row to create the new row. The construction is similar to the way that you can construct cellular automata. The following SAS/IML module constructs Pascal's triangle by using addition; no need to call the COMB function!

start PascalRule(n);
   m = j(n,n,0);    /* initialize with zeros */
   m[,1] = 1;       /* set first column to 1 */
   j = 2:n;         /* elements to compute */
   do k = 2 to n;
      /* for kth row, add adjacent elements from previous row */
      m[k,j] = m[k-1,j-1] + m[k-1,j];
   end;
   return(m);
finish;
 
T10 = PascalRule(10);
print T10[F=3. L="Pascal's Triangle, Level 10" r=("n=0":"n=9")];

Self-similar structures in Pascal's triangle

At first glance, the numbers in Pascal triangle have a simple structure. The edges of the triangle are all 1. The interior values increase geometrically, reaching their maximum values in the middle of the final row. However, if you label each value according to whether it is odd or even, a surprising pattern reveals itself!

The following SAS/IML program creates a Pascal matrix with 56 rows. The upper-triangular elements are set to missing values. The program then creates a discrete heat map that shows the parity (even or odd) of the remaining elements. Even numbers are displayed as white squares; odd numbers are displayed as black squares.
ods graphics / width=400px height=380px;
m = PascalRule(56);
m[ loc(col(m)>row(m)) ] = .;  /* replace zeros with missing values */
mod2 = mod(m,2);
call heatmapdisc(mod2) colorramp={WHITE BLACK}
     displayoutlines=0 title="Pascal's Triangle mod 2";
PascalsTriangle3

The resulting heat map bears a striking resemblance to the fractal known as Sierpinski's triangle. This fact is not widely known, but the image is comprehensible to school-age children. However, few children have the patience and stamina to color hundreds of cells, so using software to color the triangle is definitely recommended!

The fascinating self-similar pattern might inspire you to wonder what happens if the elements are colored according to some other scheme. For example, what is the pattern if you divide the elements of Pascal's triangle by 3 and visualize the remainders? Or division by 4? Or 5? The following loop creates three heat maps that visualize the numbers in Pascal's triangle modulo k, where k = 3, 4, and 5.

do k = 3 to 5;
   mod = mod(m,k);
   ramp = palette("greys", k);
   title = "Pascal's Triangle mod " + char(k,1);
   call heatmapdisc(mod) colorramp=ramp displayoutlines=0 title=title;
end;
PascalsTriangle4

The "mod 4" result is shown. The other heat maps are similar. Each shows a self-similar pattern. One of the surprising results of chaos theory and fractals is that these complicated self-similar structures can arise from simple iterative arithmetic operations (adding adjacent elements) followed by a modular operation.

Creating larger triangles

The astute reader will have noticed that 56 rows is a curious number. Why not 64 rows? Or 100? The answer is that the modular operations require that the numbers in Pascal's triangle be exactly representable as an integer. Although you can compute more than 1,000 rows of Pascal's triangle in double precision, at some point the numbers grow so large that they can no longer be represented exactly with an 8-byte integer.

You can use the SAS CONSTANT function to find the largest integer, B, such that smaller integers (in magnitude) are exactly represented by an 8-byte numeric value. It turns out that the largest value in a Pascal triangle with k rows is "k choose floor(k/2)," and this value exceeds B when k=57, as shown by the following statements. Thus the modulo operations will become inaccurate for k>56.

B = constant('ExactInt'); 
print B;   
 
k = T(55:58);
c = comb(k, floor(k/2));
print k c[L="k choose [k/2]"] (c<B)[L="Exact?"];
PascalsTriangle5

I think Pascal's triangle is very cool. When did you first encountered Pascal's triangle? Were you fascinated or bored? Share your story in the comments.

Post a Comment

Compute maximum and minimum values for rows and columns in SAS

A common question on SAS discussion forums is how to compute the minimum and maximum values across several variables. It is easy to compute statistics across rows by using the DATA step. This article shows how to compute the minimum and maximum values for each observation (across variables) and, for completeness, for each variable. If you think of your numerical data as being in a matrix, the task is to compute the minimum and maximum values for each row or for each column. You can do this in Base SAS or by using a powerful subscript operator in the SAS/IML matrix language.

The data in this article are Fisher's famous iris data, which contains the widths and lengths of the petals and sepals of 150 iris flowers. This data is distributed in the Sashelp.Iris data set as part of SAS software.

The minimum and maximum values of variables: Base SAS

The MEANS procedure is the simplest way to compute the minimum and maximum values for each numeric variable in a data set. The following statements display the extreme values for the four numerical variables in the iris data:

proc means nolabels data=Sashelp.Iris Min Max;
   output out=MinMaxCols;
run;
t_minmax1

By not specifying the VAR statement, all numeric variables are used. (You could also specify var _numeric_;) The OUTPUT statement writes the minimum an maximum values to the MinMaxCols data set, along with a few other useful statistics.

The minimum and maximum values of observations: Base SAS

The SAS DATA step contains the MIN and MAX functions, which return the minimum and maximum nonmissing values (respectively) from a list of variables. You can read all of the numerical variables in a data set into an array and call the MIN and MAX functions as follows:

data MinMaxRows;
   set sashelp.Iris;
   array x {*} _numeric_;    /* x[1] is 1st var,...,x[4] is 4th var */
   min = min(of x[*]);       /* min value for this observation */
   max = max(of x[*]);       /* max value for this observation */
run;
proc print data=MinMaxRows(obs=7);
   var _numeric_;
run;
t_minmax2

You can see that the MIN variable contain the minimum value of each row and the MAX variable contains the maximum value. Notice that you can use the _NUMERIC_ keyword to automatically assign the contents of the array x. This DATA step will work for any input data that contains numeric variables because the variable names are not hard-coded! Also note that you can use the OF operator (sometimes called the OF keyword) to specify that the MIN and MAX functions should operate across all elements in the array.

The minimum and maximum values of columns in a matrix

The SAS/IML language contains a number of operators (called subscript reduction operators) that you can use to perform simples statistical operations down matrix columns or across matrix rows. This makes it easy to compute the maximum value of each row or column, and similarly for the minimum value.

The important operators are the max subscript operator (<>) and the min subscript operator (><). You use these operators like subscripts. To find extrema for columns, use these operators in place of row subscripts, as follows:

proc iml;
use Sashelp.Iris;
read all var _NUM_ into X[c=varNames];
close Sashelp.Iris;
 
minC = X[><, ];    /* row vector contains min of columns */
maxC = X[<>, ];    /* row vector contains max of columns */
print (minC//maxC)[r={"Min" "Max"} c=varNames];
t_minmax3

For years I struggled to remember which combination of greater-than and less-than symbols was the min operator and which was the max operator. Eventually I developed the following mental image. Think of the minimum as being the bottom of a valley. If you are viewing a valley from the far side of a lake, the image looks like a greater-than sign placed next to a less-than sign. Similarly, think of a maximum as being the peak of a mountain. If you are viewing the mountain from across a lake, the image looks like a less-than sign placed next to a great-than sign. These mnemonic aids are shown in the following image:

The minimum and maximum values of rows in a matrix

In a similar way, you can compute the minimum and maximum values of each row of a matrix, as follows:

minR = X[ ,><];    /* column vector contains min of rows */
maxR = X[ ,<>];    /* column vector contains max of rows */

The MinR and MaxR vectors each contain 150 elements. The value MinR[i] is the minimum value of the ith row and the value MaxR[i] is the maximum value of the ith row.

One of the nice aspects of the SAS/IML matrix language is its symmetry: operations on rows and operations on columns are often closely related to each other. This is in contrast to Base SAS, where the DATA step is often the best choice for computing statistics for observations, whereas procedures are often easier to use for computing statistics for variables.

Post a Comment

The Wishart distribution: Covariance matrices for multivariate normal data

I've written about how to generate a sample from a multivariate normal (MVN) distribution in SAS by using the RANDNORMAL function in SAS/IML software. Last week a SAS/IML programmer showed me a program that simulated MVN data and computed the resulting covariance matrix for each simulated sample. The purpose of the program was to study properties of the covariance matrices.

The programmer was pleased when I told him that SAS/IML software provides a simpler and more efficient way to simulate covariance and correlation matrices for MVN data. You can generate the covariance matrices directly by using the RANDWISHART function, which generates matrices from the Wishart distribution.

What is the Wishart distribution?

Before thinking about covariance matrices for multivariate normal data, let's recall a theoretical result for univariate data: For a sample of size n drawn from a normal distribution, the sample variance (appropriately scaled) follows a chi-square distribution with n–1 degrees of freedom. This means that if you want to study properties of the sample variance, you don't need to generate normal data. Instead you can draw a random chi-square variate and rescale it to produce a sample variance. No normal samples required!

This result generalizes to multivariate normal data. If you draw a sample from a MVN distribution with covariance matrix Σ, the sample covariance matrix (appropriately scaled) has a sampling distribution that is called the Wishart distribution. You can think of the Wishart distribution as a multivariate generalization of the chi-square distribution. It is a distribution of symmetric positive-definite matrices. A random draw from the Wishart distribution is some matrix that, upon rescaling, is a covariance matrix for MVN data.

From data to covariance matrices

Suppose that you want to approximate the sampling distribution of the correlation coefficient between two correlated normal variables in a sample of size 50. The straightforward approach is to simulate 50 observations from the bivariate normal distribution, compute the correlation coefficient for the sample, and then repeat the process many times in order to approximate the distribution of the correlation coefficients. An implementation in PROC IML follows:

proc iml;
call randseed(12345);
N = 50;                              /* MVN sample size   */
Sigma = {9 1,                        /* population covariance; correlation = 1/3 */
         1 1};
NumSamples = 1000;                   /* number of samples in simulation */
 
/* First attempt: Generate MVN data; compute correlation from data */
corr = j(NumSamples, 1, .);          /* allocate space for results */
do i = 1 to NumSamples;
   X = randnormal(N, {0 0}, Sigma);  /* MVN sample of size 50 */
   corr[i] = corr(X)[2];             /* corr = off-diagonal element */
end;
 
title "Distribution of Correlation Coefficient";
title2 "N=50; rho = 1/3";
call histogram(corr) xvalues=do(-2,0.7,0.1)
                     other="refline 0.333 / axis=x";
wishart

The histogram shows the approximate sampling distribution for the correlation coefficient when the population parameter is ρ = 1/3. You can see that almost all the sample correlations are positive, a few are negative, and that most correlations are close to the population parameter of 1/3.

Sampling from the Wishart distribution in SAS

In the previous section, notice that the MVN data is not used except to compute the sample correlation matrix. If we don't need it, why bother to simulate it? The following program shows how you can directly generate the covariance matrices from the Wishart distribution: draw a matrix from the Wishart distribution with n–1 degrees of freedom, then rescale by dividing the matrix by n–1.

/* More efficient: Don't generate MVN data, generate covariance matrix DIRECTLY! 
   Each row of A is scatter matrix; each row of B is a covariance matrix */
A = RandWishart(NumSamples, N-1, Sigma); /* N-1 degrees of freedom */
B = A / (N-1);                           /* rescale to form covariance matrix */
do i = 1 to NumSamples;
   cov = shape(B[i,], 2, 2);             /* convert each row to square matrix */
   corr[i] = cov2corr(cov)[2];           /* convert covariance to correlation */
end;
 
call histogram(corr) xvalues=do(-2,0.7,0.1);

The histogram of the correlation coefficients is similar to the previous histogram and is not shown. Notice that the second method does not simulate any data! This can be quite a time-saver if you are studying the properties of covariance matrices for large samples with dozens of variables.

The RANDWISHART distribution actually returns a sample scatter matrix, which is equivalent to the crossproduct matrix X`X, where X is an N x p matrix of centered MVN data. You can divide by N–1 to obtain a covariance matrix.

The return value from the RANDWISHART function is a big matrix, each row of which contains a single draw from the Wishart distribution. The elements of the matrix are "flattened" so that they fit in a row in row-major order. For p-dimensional MVN data, the number of columns will be p2, which is the number of elements in the p x p covariance matrix. The following table shows the first five rows of the matrix B:

t_wishart

The first row contains elements for a symmetric 2 x 2 covariance matrix. The (1,1) element is 11.38, the (1,2) and (2,1) elements are 0.9, and the (2,2) element is 0.73. These sample variances and covariances are close to the population values of 9 1, and 1. You can use the SHAPE function to change the row into a 2 x 2 matrix. If necessary, you can use the COV2CORR function to convert the covariance matrix into a correlation matrix.

Next time you are conducting a simulation study that involves MVN data, think about whether you really need the data or whether you are just using the data to form a covariance or correlation matrix. If you don't need the data, use the RANDWISHART function to generate matrices from the Wishart distribution. You can speed up your simulation and avoid generating MVN data that are not needed.

Post a Comment

Overview of new features in SAS/IML 13.1

SAS software contains a lot of features, and each release adds more.To make sure that you do not miss new features that appear in the SAS/IML language, the word cloud on the right sidebar of my blog contains numbers that relate to SAS or SAS/IML releases. For example, you can click on "9.3" to read about features that first appeared in SAS 9.3. I have also written summaries of recent SAS/IML releases:

Over the past year I've blogged about features that were new to SAS/IML 13.1, which was released in December, 2013, as part of the first maintenance release of SAS 9.4 (SAS 9.4m1). This article collects all those blog posts together for easy reference.

New functions and subroutines in SAS/IML 13.1

The following blog posts discuss new functions and subroutines for data analysis in SAS/IML 13.1:

  • The CV function computes the sample coeficient of variation.
  • The SKEWNESS function computes the sample skewness.
  • The KURTOSIS function computes the sample kurtosis.
  • The LOGABSDET function computes the natural logarithm of the absolute value of the determinant of a matrix. (Say that three times fast!)
  • The PARENTNAME function enables a module to learn the name of a SAS/IML matrix that was passed in as an argument.

In addition, the SAS/IML 13.1 User's Guide documents two new functions for solving linear programming problems:

  • The LPSOLVE subroutine solve linear programming problems. LPSOLVE replaces the older LP call, which has been deprecated.
  • The MILPSOLVE subroutine is a new subroutine for solving mixed integer linear programming problems. It implements effective techniques for finding optimal solutions for linear objective functions that satisfy certain constraints.

New support for heat maps

There are also new routines for creating heat maps that visualize matrices. I produced a video about heat maps, as well as the following articles:

  • The HEATMAPCONT subroutine creates a heat map that uses a continuous color ramp to visualize a matrix.
  • The HEATMAPDISC subroutine creates a heat map that uses a discrete color ramp to visualize a matrix that contains a small number of distinct values.
  • The PALETTE function enables you to choose color palettes that reflect sound design principles.

Enhancements to functionality

There were several enhancements and language improvements in SAS/IML 13.1. For example, the ABORT and STOP statements now optionally print a user-defined message. Another change is that the order of resolution has changed for user-defined modules, so that it is easier to override built-in functions. I direct you to the "What's New" chapter of the documentation for additional new features.

It can be hard to keep up with enhancements to SAS software. Hopefully this reference page will be a useful to SAS/IML users who are upgrading their version of SAS. Have you used any of these new features? Leave a comment and tell me which is your favorite.

Post a Comment

Resampling and permutation tests in SAS

My colleagues at the SAS & R blog recently posted an example of how to program a permutation test in SAS and R. Their SAS implementation used Base SAS and was "relatively cumbersome" (their words) when compared with the R code. In today's post I implement the permutation test in SAS/IML. This provides an apples-to-apples comparison because both SAS/IML and R are matrix-vector languages.

This permutation test is a simple resampling exercise that could be assigned as a homework problem in a classroom. If you are at a college or university, remember that SAS/IML is available for free for all academic users through the SAS University Edition.

Permutation tests in SAS/IML

The analysis was motivated by a talk about using computational methods to illuminate statistical analyses. The data are the number of mosquitoes that were attracted to human volunteers in an experiment after each volunteer had consumed either a liter of beer (n=25) or water (n=18). The following statements assign the experimental data to two SAS/IML vectors and compute the observed difference between the means of the two groups:

proc iml;
G1 = {27, 19, 20, 20, 23, 17, 21, 24, 31, 26, 28, 20, 27, 19, 25, 31, 24, 28, 24, 29, 21, 21, 18, 27, 20};
G2 = {21, 19, 13, 22, 15, 22, 15, 22, 20, 12, 24, 24, 21, 19, 18, 16, 23, 20};
obsdiff = mean(G1) - mean(G2);
print obsdiff;
t_permutationtest

The experimenters observed that, on average, people who drank beer attracted 4.4 more mosquitoes than people who drank water. The statistical question is, "What is the probability of observing a difference of this magnitude (or bigger) by chance if the beverages have no effect?" You can answer this question by using a permutation test to perform a nonparametric version of the t test. The null hypothesis is that there is no difference between the mean number of mosquitoes that were attracted to each experimental group (beer or water).

The permutation test enables you to generate the null distribution. Draw 25 random observations from the data and assign them to Group 1; assign the other 18 observations to Group 2. Compute the difference between the means of each group. Repeat these two steps many times to approximate the null distribution. The following SAS/IML statements use the SAMPLE function in SAS/IML to permute the data. The permutation step is repeated 9,999 times so that (adding in the original data order) there are a total of 10,000 permutations of the data:

call randseed(12345);                             /* set random number seed */
alldata = G1 // G2;                        /* stack data in a single vector */
N1 = nrow(G1);  N = N1 + nrow(G2);
NRepl = 9999;                                     /* number of permutations */
nulldist = j(NRepl,1);                   /* allocate vector to hold results */
do k = 1 to NRepl;
   x = sample(alldata, N, "WOR");                       /* permute the data */
   nulldist[k] = mean(x[1:N1]) - mean(x[(N1+1):N]);  /* difference of means */
end;
 
title "Histogram of Null Distribution";
refline = "refline " + char(obsdiff) + " / axis=x lineattrs=(color=red);";
call Histogram(nulldist) other=refline;
permutationtest

The histogram shows the distribution of mean differences that were computed under the assumption of the null hypothesis. The observed difference between the beer and water groups (the vertical red line at 4.38) is way off in the tail. Since the null hypothesis is not a likely explanation for the observed difference, we reject it. We conclude that mosquitoes are attracted differently to the two groups (beer and water).

If you would like to compute the empirical p-value for the null distribution, that is easily accomplished:

pval = (1 + sum(abs(nulldist) >= abs(obsdiff))) / (NRepl+1);
print pval;
t_permutationtest2

Vectorization for permutation tests

Regular readers of my blog know that I advocate vectorizing programs whenever possible. Matrix-vector languages such as SAS/IML, R, and MATLAB work more efficiently when computations inside loops are replaced by vector or matrix computations.

Because of the way that SAS/IML loops are compiled and optimized, using loops in the SAS/IML language is not as detrimental to performance as in some other languages. For example, the previous permutation test code runs in about 0.04 seconds on my PC from 2009. Still, I like to promote vectorization because it can be important to performance.

The following statements eliminate the DO loop and implement the resampling and permutation test in two lines of SAS/IML code. The vectorized computation runs in about one-fourth the time:

x = sample(alldata, N//NRepl, "WOR");               /* create all resamples */
nulldist = x[,1:N1][,:] - x[,(N1+1):N][,:]; /* compute all mean differences */

The vectorized computation uses the colon (:) subscript reduction operator in SAS/IML to compute the mean of the first 25 and the last 18 elements for each set of permuted data.

Additional references for resampling in SAS

To learn more about efficient ways to implement resampling methods such as bootstrapping and permutation tests, consult the following references:

  • For information about bootstrapping in SAS/IML, see pages 14–17 of Wicklin (2008).
  • For another permutation test example, see pages 11–14 of Wicklin (2012).
  • Chapter 15 of my book Simulating Data with SAS describes resampling methods in both Base SAS and SAS/IML. I include useful tips and techniques that make bootstrapping in Base SAS less cumbersome, including a mention of the %BOOT and %BOOTCI macros, which enable you to implement bootstrap methods in Base SAS by using only a few program statements.
  • For an excellent introduction to resampling methods in Base SAS, see Cassell (2007).
Post a Comment

What is the coefficient of variation?

I sometimes wonder whether some functions and options in SAS software ever get used. Last week I was reviewing new features that were added to SAS/IML 13.1. One of the new functions is the CV function, which computes the sample coefficient of variation for data.

Maybe it is just me, but when I compute descriptive statistics for univariate data, the coefficient of variation is not a statistic that I look at. I don't think my undergraduate statistics course even mentioned the coefficient of variation (CV). I first encountered the idea many years later when learning about distribution theory.

The CV is a simple idea. For a distribution, the coefficient of variation is the ratio of the standard deviation to the mean: CV = σ/μ. You can estimate the coefficient of variation from a sample by using the ratio of the sample standard deviation and the sample mean, usually multiplied by 100 so that it is on the percent scale. This ratio is also known as the relative standard deviation when the data are positive.

What does the coefficient of variation mean?

The coefficient of variation is a dimensionless quantity. As such, it provides a measure of the variability of a sample without reference to the scale of the data.

Suppose I tell two people to measure the heights of some plants. The first person reports that the average height is 1.2 meters, with a standard deviation of 0.275 meters. The second person measures the same plants in centimeters. She reports that the average height is 120 centimeters, with a standard deviation of 27.5 centimeters. Obviously, these are the same answers, but one person reports a standard deviation of 0.275 (which sounds small) whereas the other person reports a standard deviation of 27.2 (which sounds big). The coefficient of variation comes to the rescue: for both sets of measurements the coefficient of variation is 22.9.

The CV can also help you compare two completely different measurements. How does variation in height compare to variation in weight? Or age? Or income? These variables are measured on different scales and use different units, but the CV (which is dimensionless) enables you to compare the variation of these variables.

How to compute the coefficient of variation in SAS

The coefficient of variation is computed by several SAS procedures: MEANS, UNIVARIATE, IML, TABULATE, and so forth. The following example shows data for the plant measurement example in the previous paragraph. The MEANS and IML procedure compute the CV for measurements on the meter and centimeter scales:

data Plants;
input height @@;
cm = height * 100;
datalines;
1.6 1.5 .8 1.0 1.2 .9 1.2 1.8 1.2 1.3 1.3 .9 1.2 1.0 1.1
;
proc means data=Plants N mean std cv;
run;
 
proc iml;
use Plants; read all var _NUM_ into X[c=varNames]; close;
cv = cv(X);
print cv[c=varNames];
cv

Theoretical uses of the coefficient of variation

The coefficient of variation has some interesting uses as a theoretical tool. It enables you to compare the variation between different probability distributions. As I mentioned in my article on fat-tailed and long-tailed distributions, the exponential distribution is an important reference distribution in the theory of distributions. Because the standard deviation and the mean of an exponential distribution are equal, the exponential distribution has a CV equal to 1. Distributions with CV < 1 are considered low-variance distributions. Distributions with CV > 1 are high-variance distributions.

Obviously the coefficient of variation is undefined for symmetric distributions such as the normal and t distributions, which is perhaps why the CV is not widely used. The sample CV is undefined for centered data and is highly variable when the population mean is close to zero.

Do you use the coefficient of variation?

Have you ever used the coefficient of variation in a real data analysis problem? Is the CV a useful but underutilized statistic for practical data analysis? Or is it primarily a theoretical tool for comparing the variability of distributions? Leave a comment.

Post a Comment

The difference between RUN and CALL for SAS/IML subroutines

Have you ever noticed that some SAS/IML programmers use the CALL statement to call a subroutine, whereas others use the RUN statement? Have you ever wondered why the SAS/IML language has two statements that do the same thing?

It turns out that the CALL statement and the RUN statement do not do the same thing! Read on to discover how they differ.

By the way, the RUN statement in PROC IML has only one purpose: to execute a subroutine. Many other SAS procedures (such as REG, GLM, and DATASETS) use the RUN statement to tell SAS to run the procedure. PROC IML is different. Never put the RUN statement at the end of a PROC IML program.

Built-in versus user-defined subroutines

A main feature of the SAS/IML language is that it enables you to define your own modules. A module is a user-defined function or subroutine that you can call from an IML program. A module that returns a value is called a function module; a module that does not return a value is called a subroutine module.

Although I do not recommend that you do so, it is possible to define a subroutine module that has the same name as a built-in subroutine. These two routines co-exist. If you use the RUN statement, SAS/IML will call the subroutine that you wrote. If you use the CALL statement, SAS/IML will call the built-in subroutine.

So that's the difference between the RUN and CALL statements: The RUN statement looks first for a user-defined module with the specified name. If it finds it, then it calls that module. Otherwise, it calls the built-in subroutine that has the specified name. The CALL statement looks for a built-in subroutine and immediately calls that routine if it is found.

An example: Overriding the EIGEN subroutine

As I said, I don't recommend that you routinely write user-defined modules that have the same name as a SAS/IML built-in, but let's examine a situation in which it might be convenient to do so. Let's assume that you often read data sets that represent symmetric matrices, and that these data sets are stored in lower triangular form with missing value in the upper triangular portion of the matrix. For example, this is the default storage method for distance matrices that are created by using the DISTANCE procedure in SAS/STAT software, as shown below:

proc distance data=Sashelp.Cars out=Dist method=Euclid;
   where Type="Truck";
   var interval(Horsepower--Length / std=Std);
   id Model;
run;
 
proc iml;
use Dist;   read all var _NUM_ into D[r=Model];   close Dist;
call heatmapcont(D) title="Distance Matrix";
runcall

The program creates a 24 x 24 symmetric matrix of distances between 24 observations in a six-dimensional space of variables. The program reads the distance matrix into a SAS/IML matrix. A heat map shows the matrix values. The dark gray color shows that the upper triangular portion of the matrix is missing. The white diagonal elements show that the diagonal of the matrix is exactly zero. The remaining cells indicate the distance between observations. Nearly white cells indicate that two observations are close together. Dark cells indicate that observations are relatively far apart.

Suppose that you want to compute the eigenvalues and eigenvectors of this matrix. The built-in EIGEN subroutine in SAS/IML can compute these quantities, but it expects a matrix that does not have any missing values. Therefore to compute the eigenvalues you should first extract the lower triangular elements of the matrix and copy them into the upper triangular portion of the matrix.

You could write a separate subroutine that only copies the lower triangular elements, but for this example I will write a subroutine that has the same name as the built-in EIGEN subroutine. The following statements define a module that inspects the upper triangular elements of a matrix. If the upper triangular elements are all missing, it replaces those missing values with the lower triangular elements. It then calls the built-in EIGEN function to compute the eigenvalues and eigenvectors:

start StrictLowerTriangular(X);    /* return lower triangular elements */
   return( remove(vech(X), cusum(1 || (ncol(X):2))) );
finish;
 
/* define EIGEN module as a custom override of the built-in subroutine */
start Eigen(eval, evec, A);
   r = row(A); c = col(A);
   UpperIdx = loc(c>r);
   /* if upper triangular elements are missing, copy from lower */
   if all(A[UpperIdx]=.) then
      A[UpperIdx] = StrictLowerTriangular(A);
   call eigen(eval, evec, A);          /* CALL the built-in EIGEN subroutine */
finish;
 
run eigen(eval, evec, D);              /* RUN the user-defined EIGEN subroutine */

In summary, the CALL and RUN statements enable you to choose between a built-in subroutine and a custom subroutine that have the same name.

The situation is a little different for user-defined functions. Beginning with SAS/IML 13.1, you can define a function that has the same name as a built-in function, and the order of resolution for calling functions ensures that a user-defined function is found before a built-in function that has the same name. This means that if you choose to override a SAS/IML function, you lose the ability to call the built-in function until you quit PROC IML.

Have you ever had the need to override a built-in SAS/IML subroutine? Let me know the details by leaving a comment.

Post a Comment

The distribution of Pythagorean triples

pythagdistrib

When I studied high school geometry, I noticed that many homework problems involved right triangles whose side lengths were integers. The canonical example is the 3-4-5 right triangle, which has legs of length 3 and 4 and a hypotenuse of length 5. The triple (3, 4, 5) is called a Pythagorean triple because it satisfies the Pythagorean theorem: 32 + 42 = 52. Similarly, (5, 12, 13) and (7, 24, 25) are Pythagorean triples that sometimes appear in geometry textbooks.

In general, a triple of natural numbers (a, b, c) is a Pythagorean triple if a2 + b2 = c2. Obviously if (a, b, c) is a Pythagorean triple, then so is (ka, kb, kc) for any natural number k. Geometrically, this transformation is not very interesting because it simply scales a right triangle into a similar triangle that has the same shape. I did not learn any trick in high school that enabled me to start with a Pythagorean triple and generate a new, geometrically different, right triangle.

Generating Pythagorean triples

Because my high school geometry book reused the same three or four triples over and over again, I assumed that it is difficult to generate Pythagorean triples. In fact, that is not the case: There are several formulas for generating Pythagorean triples, some of which were known to the ancient Greeks.

One interesting algorithm comes from the 20th-century and involves linear transformations via matrices. Here's how it works. Write any Pythagorean triple, v, as a row vector. Then the linear transformation v*L is a new Pythagorean triple, where L is any of the three following linear transformations:

pythagxform

For example, if you represent v = (3, 4, 5) as a row vector, then v*L1 = (5, 12, 13). You can multiply on the right again to get another new triple, such as v*L1*L3 = (45, 28, 53).

In fact, more is true. A primitive Pythagorean triple is a triple such that the three numbers have no nontrivial divisors, that is, they are relatively prime. It turns out that all primitive Pythagorean triples can be obtained by iteratively applying one of three linear transformations to the triple (3, 4, 5). In other words, the triple (3, 4, 5) is the "parent" of all primitive Pythagorean triples!

That is a fabulous result. It means that you can write a program that uses matrix multiplication to produce arbitrarily many primitive Pythagorean triples, as follows:

  • Start with v = (3, 4, 5).
  • Apply L1, L2, and L3 to create the first generation of children.
  • Apply the three transformations to each of the children to create a generation of nine grandchildren.
  • Continue this process. The ith generation will contain 3i triples.

You can implement this algorithm in the SAS/IML matrix language as follows:

proc iml;
start GenerateTriples(depth=3);
   L1 = { 1  2  2,
         -2 -1 -2, 
          2  2  3};
   L2 = { 1  2  2,
          2  1  2,
          2  2  3};
   L3 = {-1 -2 -2,
          2  1  2,
          2  2  3};
   NumTriples = sum(3##(0:Depth));  /* total number of triples */
   triples = j(NumTriples,3,.);     /* allocate array for results */
   triples[1,] = {3 4 5};           /* parent triple */
   k = 1; n = 2;
   do i = 1 to Depth;               /* for each generation */
      do j = 1 to 3##(i-1);         /* generate 3##i children */
         x = triples[k,];
         triples[n,]   = x*L1;
         triples[n+1,] = x*L2; 
         triples[n+2,] = x*L3; 
         k = k + 1;   n = n + 3;
      end;
   end;
   return( triples );
finish;
 
p = GenerateTriples();
print p;
t_pythag

The parent and the first three generations account for 1 + 3 + 9 + 27 = 40 triples. The list contains some old familiar friends, as well as a few strangers that I haven't met before.

The distribution of Pythagorean triples

What do the Pythagorean triples look like if you plot their location in a Cartesian coordinate system? It turns out that they make fascinating patterns!

Because the hypotenuse length is a function of the leg lengths, it suffices to plot the locations of a and b in the coordinate plane. However, it turns out that when the algorithm produces a particular triple such as (5, 12, 13), it does not produce the symmetric triple (12, 5, 13). To make the pattern more symmetric, we can to manually add the triple (b, a, c) whenever the triple (a, b, c) appears.

The following SAS/IML statements construct many primitive Pythagorean triples and create a scatter plot of the ones for which a and b are both less than 4,500:

p = GenerateTriples(15);                  /* generate many triples */
m = p[ loc(p[,1]<=4500 & p[,2]<=4500), ]; /* exclude large ones */
mm = m // m[,{2 1 3}];                    /* generate symmetric pairs */
 
ods graphics / width=600px height=600px;
title "Primitive Pythagorean Triples";
title2 "Depth = 15, Symmetric";
call scatter(mm[,1], mm[,2]) procopt="aspect=1"  
     option="markerattrs=(size=3 symbol=SquareFilled)";

The image is shown at the beginning of this article. You can see certain parabolic curves that have a high density of points. There is a similar plot in Wikipedia that includes nonprimitive triples. The nonprimitive triples "fill in" some of the apparent "gaps" in the pattern.

These interesting patterns hint at deep relationships in number theory between Diophantine equations and the geometry of the solutions. Are you tempted to generate a similar image for the generalized equation a3 + b3 = c3? Don't bother: Fermat's Last Theorem states that this equation has no integer solutions!

Post a Comment

An efficient way to increment a matrix diagonal

I was recently asked about how to use the SAS/IML language to efficiently add a constant to every element of a matrix diagonal. Mathematically, the task is to form the matrix sum A + kI, where A is an n x n matrix, k is a scalar value, and I is the identity matrix. The person who asked the question was dealing with large matrices. He knew that explicitly forming the identity matrix is inefficient and that adding n2n zeros to A is unnecessary, but he wasn't sure how to avoid it.

I have previously blogged about how to assign the diagonal elements of a matrix. In that post I wrote a little helper function that will set the diagonal of a matrix to a given vector or scalar value. The function is repeated below:

proc iml;
start SetDiag(A, v);
   diagIdx = do(1,nrow(A)*ncol(A), ncol(A)+1); /* index diagonal elements */
   A[diagIdx] = v;                             /* set diagonal elements */
finish;

This function sets the diagonal elements of the matrix A to the value v, which can be a vector or a scalar. To increment (rather than replace) the diagonal, use the VECDIAG function to extract the current diagonal values of the matrix. Add a constant and call the function like this:

m = j(5000, 5000, 1);              /* create a 5000 x 5000 matrix */
run SetDiag(m, vecdiag(m) + 100);  /* increment diagonal elements by 100 */

Timing the performance

matrixdiag

Recall that it is easy to use the TIME function in SAS to compare the performance of two competing algorithms. The plot at the left shows the time required to increment a matrix diagonal for a sequence of large matrices, up to 15,000 x 15,000. The blue line represents the naive computation
A + kI. Notice that the time increases quadratically with the size of the matrix because this algorithm adds n2 numbers to the matrix.

In contrast, the red line shows the time that it takes to form a vector that indexes the diagonal elements of a matrix and to replace only those elements. Theoretically, the time should be linear with the size of the matrix, but the computation is so fast that the elapsed time registers as instantaneous for even the largest matrix.

In summary, when working with large matrices, you can save lots of computational effort if you avoid unnecessary computations. Never multiply or add large diagonal matrices unless absolutely necessary. Instead, think about whether it is possible to index into the matrix and change only those numbers that you need to change.

By the way, increasing the diagonal of a matrix is sometimes called "ridging a matrix" and it is used in ridge regression. Ridging a matrix is also sometimes used in numerical computations that involve estimated covariance matrices. Sometimes an estimate is not positive definite, and ridging is one approach to try to obtain an estimate for a covariance matrix that has this important property.

Post a Comment