Analyzing the first 10 million digits of pi: Randomness within structure


Saturday, March 14, 2015, is Pi Day, and this year is a super-special Pi Day! This is your once-in-a-lifetime chance to celebrate the first 10 digits of pi (π) by doing something special on 3/14/15 at 9:26:53. Apologies to my European friends, but Pi Day requires that you represent dates with the month placed first in order to match the sequence 3.141592653....

Last year I celebrated Pi Day by using SAS to explore properties of the continued fraction expansion of pi. This year, I will examine statistical properties of the first 10 million digits of pi. In particular, I will show that the digits of pi exhibit statistical properties that are inherent in a random sequence of integers.

Editor's note (17Mar2015): Ken Kleinman remarked on the similarity of the analysis in this article to his own analysis from 2010. I was reading his blog regularly in 2010, so I suspect that I unconsciously remembered his chi-square and Durbin-Watson analyses and used them without properly citing him. My apologies to Ken and Nick Horton for this oversight!

Reading 10 million digits of pi

I have no desire to type in 10 million digits, so I will use SAS to read a text file at a Princeton University URL. The following statements use the FILENAME statement to point to the URL:

/* read data over the internet from a URL */
filename rawurl url ""
                /* proxy='' */ ;
data PiDigits;
   infile rawurl lrecl=10000000;
   input Digit 1. @@;
   Position = _n_;
   Diff = dif(digit);      /* compute difference between adjacent digits */
proc print data=PiDigits(obs=9);
   var Digit;

The PiDigits data set contains 10 million rows. The call to PROC PRINT displays the first few decimal digits, which are (skipping the 3) 141592653....

For other ways to use SAS to download data from the internet, search Chris Hemedinger's blog, The SAS Dummy for "PROC HTTP" and you will find several examples of how to download data from a URL.

The distribution of digits of pi

You can run many statistical tests on these numbers. It is conjectured that the digits of pi are randomly uniformly distributed in the sense that the digits 0 through 9 appear equally often, as do pairs of digits, trios of digits, and so forth.

You can call PROC FREQ to compute the frequency distribution of the first 10 million digits of pi and to test whether the digits appear to be uniformly distributed:

/* Are the digits 0-9 equally distributed? */
proc freq data = PiDigits;
tables Digit/ chisq out=DigitFreq;

The frequency analysis of the first 10 million digits shows that each digit appears about one million times. A chi-square test indicates that the digits appear to be uniformly distributed. If you turn on ODS graphics, PROC FREQ also produces a deviation plot that shows that the deviations from uniformity are tiny.

A "pi chart" of the distribution of the digits of pi

As an advocate of the #OneLessPie Chart Initiative, I am intellectually opposed to creating pie charts. However, I am going to grit my teeth and make an exception for this super-special Pi Day. You can use the Graph Template Language (GTL) to create a pie chart. Even simpler, Sanjay Matange has written a SAS macro that creates a pie chart with minimal effort. The following DATA step create a percentage variable and then calls Sanjay's macro:

data DigitFreq;
   set DigitFreq;
   Pct = Percent/100; 
   format Pct PERCENT8.2;
/* macro from */
%GTLPieChartMacro(data=DigitFreq, category=Digit, response=Pct,
         title=Distribution of First 10 Million Digits of Pi,

The pie chart appears at the top of this article. It shows that the digits 0 through 9 are equally distributed.

Any autocorrelation in the sequence?

In the DATA step that read the digits of pi, I calculated the difference between adjacent digits. You can use the SGPLOT procedure to create a histogram that shows the distribution of this quantity:

proc sgplot data=PiDigits(obs=1000000);  
   vbar Diff;

That's a pretty cool triangular distribution! I won't bore you with mathematical details, but this shape arises when you examine the difference between two independent discrete uniform random variables, which suggests that the even digits of pi are independent of the odd digits of pi.

In fact, more is true. You can run a formal test to check for autocorrelation in the sequence of numbers. The Durbin-Watson statistic, which is available in PROC REG and PROC AUTOREG, has a value near 2 if a series of values has no autocorrelation. The following call to PROC AUTOREG requests the Durbin-Watson statistic for first-order through fifth-order autocorrelation for the first one million digits of pi. The results show that there is no detectable autocorrelation through fifth order. To the Durban-Watson test, the digits of pi are indistinguishable from a random sequence:

proc autoreg data=PiDigits(obs=1000000);  
   model Digit = / dw=5 dwprob;

Are the digits of pi random?

Researchers have run dozens of statistical tests for randomness on the digits of pi. They all reach the same conclusion. Statistically speaking, the digits of pi seems to be the realization of a process that spits out digits uniformly at random.

Nevertheless, mathematicians have not yet been able to prove that the digits of pi are random. One of the leading researchers in the quest commented that if they are random then you can find in the sequence (appropriately converted into letters) the "entire works of Shakespeare" or any other message that you can imagine (Bailey and Borwein, 2013). For example, if I assign numeric values to the letters of "Pi Day" (P=16, I=9, D=4, A=1, Y=25), then the sequence "1694125" should appear somewhere in the decimal expansion of pi. I wrote a SAS program to search the decimal expansion of pi for the seven-digit "Pi Day" sequence. Here's what I found:

proc print noobs data=PiDigits(firstobs=4686485 obs=4686491);
   var Position Digit;

There it is! The numeric representation of "Pi Day" appears near the 4.7 millionth decimal place of pi. Other "messages" might not appear in the first 10 million digits, but this one did. Finding Shakespearian sonnets and plays will probably require computing more digits of pi than the current world record.

The digits of pi pass every test for randomness, yet pi is a precise mathematical value that describes the relationship between the circumference of a circle and its diameter. This dichotomy between "very random" and "very structured" is fascinating! Happy Pi Day to everyone!

Post a Comment

Matrix multiplication with missing values in SAS

Sometimes I get contacted by SAS/IML programmers who discover that the SAS/IML language does not provide built-in support for multiplication of matrices that have missing values. (SAS/IML does support elementwise operations with missing values.) I usually respond by asking what they are trying to accomplish, because mathematically matrix multiplication with missing values is not a well-defined operation. I get various answers. Some people want to exclude the missing values. Others want them propagated. Still others want them imputed.

This article discusses possible ways to multiply matrices when the matrices contain missing values. It defines a SAS/IML module that propagates missing values. The module can be used to score a linear model data that include missing values.

Linear algebra with missing values? Does it make sense?

SAS software supports scalar operations on missing values. Multiplication that involves a missing value results in a missing value. Additive operations exclude (skip) the missing value. Many SAS procedures, including PROC IML, handle missing data in statistical calculations.

Linear algebra is another story. A vector space is defined over an algebraic field, which is a set of numbers that support addition, subtraction, multiplication, and division. Every element of the field must have an additive inverse, and every nonzero element must have a multiplicative inverse (reciprocal). A missing value does not have either. There is no number that can be added to a missing value to get zero. Similarly, there is no number such that the product of the number and a missing value is 1.

Consequently, you can't include missing values in matrices and expect to preserve the usual laws of linear algebra. However, you can define new operations on numerical arrays that are reminiscent of matrix multiplication. In the rest of this article, A and B are matrices where the number of columns of A equals the number of rows of B. Read on to discover ways to compute a "matrix product" C = A*B when either matrix has a missing value.

Excluding rows with missing values

By far the most common way to handle missing values in a statistical analysis is to exclude them. See my previous blog post about how to perform listwise deletion of missing values in the DATA step and in SAS/IML.

Propagating missing values

In some software packages (including MATLAB and R), missing values are propagated by matrix multiplication. If the matrix A has a missing value anywhere in the ith row, the product A*B contains missing values everywhere in the ith row. Similarly, if B has a missing value anywhere in the jth column, the entire jth column of the product A*B contains missing values.

It is easy to define a SAS/IML module that implements this multiplication scheme: Use ordinary matrix multiplication on rows and columns that are free of missing values and put missing values everywhere else. The following function implements this multiplication method:

proc iml;
/* matrix "multiplication" where missing values are propagated */
start MVMult(A, B);
   C = j(nrow(A), ncol(B), .);
   rows = loc(countmiss(A, "ROW")=0);
   cols = loc(countmiss(B, "COL")=0);
   if ncol(rows)>0 & ncol(cols)>0 then 
      C[rows, cols] = A[rows,] * B[,cols];
A = {1 2 3,
     . 4 1,
    -1 0 1};
B = {1 2 -1,
     3 4  0,
     0 1  .};
C = MVMult(A,B);
print C;

The product matrix contains a missing value for every element of the second row because the A matrix has a missing value in the second row. The product matrix contains a missing value for every element of the third column because the B matrix has a missing value in the third column.

This a reasonable way to handle missing values when you are trying to score data according to a linear model. In that case, the matrix on the left side is an n x p data matrix (X). Each column of the right-hand matrix (B) contains p coefficients of a linear model. Often the right-hand matrix is a column vector. The matrix product P=X*B evaluates (scores) the linear model to obtain predicted values of the response. A missing value anywhere in the ith row of X indicates that you cannot predict a value for that observation. When scoring linear models, the right-hand matrix does not usually contain missing values, although PROC SCORE permits missing value for coefficients and treats them as 0.

The previous paragraph shows why propagating missing values makes sense when scoring linear models. However, I prefer not to call it matrix multiplication. For example, in true matrix multiplication, a product that involves the identity matrix results in the original matrix. That does not hold true when missing values propagate. Furthermore, in a product of three matrices for which the middle matrix contains missing values, every element of the product is missing, which limits the usefulness of this technique.

A1 = MVMult(A, I(3));   /* A*I ^= A  */
A2 = MVMult(I(3), A);   /* I*A ^= A  */
H = MVMult(A2, I(3));   /* every element of I*A*I is missing */
print A1, A2, H;

Skipping missing values

Some SAS customers have expressed interest in "skipping" missing values when they multiplying matrices. Recall that each element of a matrix product A*B is formed by multiplying elements of A and B and then summing those scalar products. It is reasonable to expect that multiplication of a missing value should result in a missing value, but that missing values are skipped in the summation, just as they are when using the SUM function.

You can define a SAS/IML module that implements this scheme, but this is not a good way to handle missing values. This scheme treats missing values as if they were zero. The following statements replace each missing value by 0, and then perform ordinary matrix multiplication. The product is the same as when you "skip" missing values during the summation:

A0 = A;  B0 = B;
A0[ loc(A=.) ] = 0;   /* replace missing values with 0 */
B0[ loc(B=.) ] = 0;
C0 = A0*B0;
print C0;

Replacing missing values with 0 is a bad idea. Substituting the value 0 is likely to bias whatever analysis you are trying to perform. If you want to impute the missing values, you should use an imputation scheme that makes sense for the data and for the analysis.


In conclusion, there are several ways to deal with multiplying matrices that contain missing values in SAS/IML software:

  • Use only complete cases of the data by deleting any observation that contains a missing value.
  • Propagate missing values by using the MVMult function in this blog post. This approach makes sense if you are evaluating a linear model on data that contain missing values.
  • Impute the missing values. There are many ways to impute missing values in SAS, but imputing them with the value 0 is not usually a good choice. Consequently, I do not recommend that you skip missing values during matrix multiplication, which is equivalent to substitution by 0.
Post a Comment

Writing data in chunks: Does the chunk size matter?

I often blog about the usefulness of vectorization in the SAS/IML language. A one-sentence summary of vectorization is "execute a small number of statements that each analyze a lot of data." In general, for matrix languages (SAS/IML, MATLAB, R, ...) vectorization is more efficient than the alternative, which is to run many statements (often inside a loop) that each analyze a small amount of data.

I usually think of vectorization as applying to computations. However, I recently realized that the same principle applies when writing data to a SAS data set. I needed to compute a large number of results and write them to a SAS data set. Because each result required a similar computation, I wrote a program that looked something like the following.

proc iml;
N = 1e5;                   /* 100,000 = number of observations to write */
x = j(1,5);
t0 = time();
call randseed(123);
create DS1 from x[colname=("x1":"x5")];   /* open data set for writing */
do i = 1 to N;
   call randgen(x, "Uniform");         /* replaces complex calculation */
   append from x;                      /* write 1 obs with 5 variables */
close DS1;                             /* close data set */
t1 = time() - t0;
print t1[F=5.3];

In the preceding program, the call to the RANDGEN function is used in place of a complicated computation. After running the program I was disappointed. Almost 7 seconds for a mere 100,000 results?! That's terrible performance! After studying the program I began to wonder whether the problem was that I was calling the APPEND statement many times, and each call does only a little bit of work (writes one observation). That is a classic vectorize-this-step situation.

I decided to see what would happen if I accumulated 1,000 results and then output those results in a single APPEND statement. My hope was that the performance would improve by calling the APPEND statement fewer times and writing more data with each call. The rewritten program looked similar to the following:

proc iml;
N = 1e5;                      /* total number of observations to write */
BlockSize = 1e3;              /* number of observations in each APPEND stmt */
NumIters = N/BlockSize;       /* number of calls to the APPEND stmt */
x = j(1,5);
t0 = time();
call randseed(123);
xChunk = j(BlockSize,5,0);
create DS2 from xChunk[colname=("x1":"x5")];
do i = 1 to NumIters;
   do j = 1 to BlockSize;
      call randgen(x, "Uniform");      /* replaces complex calculation */
      xChunk[j,] = x;                  /* accumulate results */
   append from xChunk;                 /* write 1,000 results */
close DS2;
t2 = time() - t0;
print t2[F=5.3];

Ahh! That's better! By calling the APPEND statement 100 times and writing 1,000 observations for each call, the performance of the program increased dramatically.


I wrote the second program so that the number of observations written by the APPEND statement (BlockSize) is a parameter. That enables me to run tests to examine the performance as a function of how many observations are written with each call. The following graph summarizes the situation.

The graph indicates that the major performance improvements occur when increasing the number of observations from 1 to 20. The curve flattens out after 50 observations. Beyond 100 observations there is no further improvement.

There are two lessons to learn from this experience:

  • Writing (and reading!) data in a matrix language is similar to other numerical operations. Try to avoid reading and writing small quantities of data, such as one observation at a time. Instead, read and write larger chunks of data.
  • You don't have to create huge matrices with gigabytes of data to realize the performance improvement due to vectorization. In this program, writing a matrix that has 100 rows is a vast improvement over writing one observation at a time.

The principal of vectorization tells us to execute a small number of statements that each analyze a lot of data. The same principle applies to reading and writing data in a matrix language. Avoid reading or writing one observation at a time. Instead, you can improve performance by ensuring that each I/O statement reads or writes a hundred or more observations.

Post a Comment

Create a custom PDF and CDF in SAS

In my previous post, I showed how to approximate a cumulative density function (CDF) by evaluating only the probability density function. The technique uses the trapezoidal rule of integration to approximate the CDF from the PDF.

For common probability distributions, you can use the CDF function in Base SAS to evaluate the cumulative distributions. The technique from my previous post becomes relevant if you need to compute the CDF of a distribution that is not built into SAS.

A recent question on a discussion forum was "How can I plot the PDF and CDF for [an unsupported] distribution?" It turns out that the distribution from the discussion forum has an analytical expression for the CDF, so I will use the Lévy distribution for this article. The density of the Lévy distribution is given by the following formula:


Evaluating a custom PDF in SAS

When evaluating any function in SAS, you need to make sure that you understand the domain of the function. For the Lévy distribution, the support is the semi-infinite interval [μ, ∞). The location parameter μ determines the left-hand endpoint for the support of the distribution. The scale parameter c must be a positive value. The limit as x → μ+ is 0. The following SAS/IML module evaluates the Lévy density function. The function is vectorized so that it returns a vector of densities if you call it with a vector of x values. The implementation sets default parameter values of μ=0 and c=1.

proc iml;
start LevyPDF(x, mu=0, c=1);
   pi = constant("pi");
   f = j(nrow(x), ncol(x), 0);         /* any values outside support are 0 */
   idx = loc(x>mu);                    /* find valid values of x */
   if ncol(idx)>0 then do;             /* evaluate density for valid values */
      v = x[idx];         
      f[idx] = sqrt(c/(2*pi)) # exp(-c/(2*(v-mu))) /(v-mu)##1.5; 
   return( f );
/* plot PDF on [0, 5] */
x = do(0,5,0.02);
pdf = LevyPDF(x);
title "Levy PDF, mu=0, c=1";
call Series(x, pdf) grid={x y};

Evaluating a custom CDF in SAS: The quick-and-dirty way

As shown in my previous post, you can approximate a cumulative density function (CDF) by using the trapezoidal rule to add up the area under the PDF. The following statements implement this approximation. The approximation is accurate provided that the distance between adjacent x values is small, especially when the second derivative of the PDF is large.

start CumTrapIntegral(x,y);      /* cumulative sums of trapezoidal areas */
   N = nrow(colvec(x));
   dx    =   x[2:N] - x[1:N-1];
   meanY = ( y[2:N] + y[1:N-1] )/2;
   return( 0 // cusum(dx # meanY) );
CDFApprox = CumTrapIntegral(x, pdf);   
title "Levy CDF, mu=0, c=1";
call Series(x, CDFApprox) yvalues=do(0,0.7,0.1) grid={x y};

Evaluating a custom CDF in SAS: The slow-and-steady way

Although the trapezoidal approximation of the CDF is very fast to compute, sometimes slow and steady wins the race. If you want to evaluate the CDF as accurately as possible, or you only need the CDF at a few locations, you can use the QUAD subroutine to numerically integrate the PDF.

To use the QUAD subroutine, the integrand must be a function of a single variable (x). Any parameters (such as μ and c) must be specified by using a GLOBAL clause. The following statements define a function named _Func that calls the LevyPDF function with the value of the global parameters g_mu and g_c. From the plot of the CDF function, it looks like the median of the distribution is approximately at x=2.2. The following statements evaluate the integral of the Lévy PDF on the interval [0, 2.2]. The integral is approximately 0.5.

start _Func(x) global(g_mu, g_c);
   return( LevyPDF(x, g_mu, g_c) );
/* looks like median is about x=2.2 */
g_mu=0; g_c=1;
call quad(result, "_Func", {0 2.2});
print result;

The call the QUAD subroutine gave the expected answer. Therefore you can create a function that calls the QUAD subroutine many times to evaluate the Lévy cumulative distribution at a vector of values:

start LevyCDF(x, mu=0, c=1) global(g_mu, g_c);
   cdf = j(nrow(x), ncol(x), 0);             /* allocate result */
   N = nrow(x)*ncol(x);
   g_mu=mu; g_c=c;
   do i = 1 to N;                            /* for each x value */
      call quad(result, "_Func", 0 || x[i]); /* compute integral on [0,x] */
      cdf[i] = result;
CDF = LevyCDF(x);
title "Levy CDF, mu=0, c=1";
call Series(x, CDF) yvalues=do(0,0.7,0.1) grid={x y};

The graph is visually indistinguishable from the previous CDF graph and is not shown. The CDF curve computed by calling the LevyCDF function is within 2e-5 of the approximate CDF curve. However, the approximate curve is computed almost instantaneously, whereas the curve that evaluates integrals requires a noticeable fraction of a second.

In summary, if you implement a custom probability distribution in SAS, try to write the PDF function so that it is vectorized. Pay attention to special points in the domain (such as 0) that might require special handling. From the PDF, you can create a CDF function. You can use the CumTrapIntegral function to evaluate an approximate CDF from values of the PDF, or you can use the QUAD function to create a computationally expensive (but more accurate) function that integrates the PDF.

Post a Comment

An easy way to approximate a cumulative distribution function

Evaluating a cumulative distribution function (CDF) can be an expensive operation. Each time you evaluate the CDF for a continuous probability distribution, the software has to perform a numerical integration. (Recall that the CDF at a point x is the integral under the probability density function (PDF) where x is the upper limit of integration. I am assuming that the PDF does not have a closed-form antiderivative.) For example, the following statements compute and graph the CDF for the standard lognormal distribution at 121 points in the domain [0,6]. To create the graph, the software has to compute 121 numerical integrations.

proc iml;
x = do(0, 6, 0.05);
CDF = cdf("Lognormal", x);
title "Standard Lognormal CDF";
call series(x, CDF) grid={x y};


An integral is an accumulation of slices

In contrast, evaluating the PDF is relatively cheap: you only need to evaluate the density function at a series of points. No integrals reqired. This suggests a clever little approximation: Instead of calling the CDF function many times, call the PDF function and use the cumulative sum function (CUSUM) to add together the areas of many slices of the density. For example, it is simple to modify the trapezoidal rule of integration to return a cumulative sum of the trapezoidal areas that approximate the area under a curve, as follows:

start CumTrapIntegral(x,y);
   N = nrow(colvec(x));
   dx    =   x[2:N] - x[1:N-1];
   meanY = ( y[2:N] + y[1:N-1] )/2;
   return( 0 // cusum(dx # meanY) );

With this function you can approximate the (expensive) CDF by using a series of (cheap) PDF evaluations, as follows:

pdf = pdf("Lognormal", x);             /* evaluate PDF */
CDFApprox = CumTrapIntegral(x, pdf);   /* cumulative sums of trapezoidal areas */

How good is the approximation to the CDF? You can plot the difference between the values that were computed by the CDF function and the values that were computed by using the trapeoidal areas:

Difference = CDF - CDFApprox`;
title "Difference between Lognormal CDF and Aproximate CDF";
call Series(x, Difference) grid={x y};

The graph shows that for the lognormal distribution evaluated on [0,6], the approximate CDF differs by less than 0.001. For most values of x, CDF(x) differs from the cumulative trapezoidal approximation by less than 0.0001. This result uses PDF values with a step size of 0.05. In general, smaller step sizes lead to smaller errors, and the error for the trapezoidal rule is proportional to the square of the step size.

Adjustments to the approximation

The lognormal distribution is exactly zero for x < 0, which means that the integral from minus infinity to 0 is zero. However, for distributions that are defined on the whole real line, you need to make a small modification to the algorithm. Nanmely, if you are approximating a cumulative distribution on the interal [a, b], you need to add the CDF value at a to the values returned by the CumTrapIntegral function. For example, the following statements compare the values of the standard normal CDF on [-3, 3] with the cumulative trapezoidal sums:

/* for distributions on [-infinity, x], correct by adding constant */
x = do(-3, 3, 0.05);
CDF = cdf("Normal", x);              /* gold standard */
pdf = pdf("Normal", x);
CDFApprox = CumTrapIntegral(x, pdf); /* cumulative sums */
CDFAdd = cdf("Normal", -3);          /* integral on (-infinity, -3) */
CDFApprox = CDFAdd + CDFApprox`;     /* trapezoidal approximation */
Difference = CDF - CDFApprox;
title "Difference between Normal CDF and Aproximate CDF";
call Series(x, Difference) grid={x y};

The graph shows that the approximation is within ±5e-5 of the values from the CDF function.

When should you use this approximation?

You might be asking the same question that my former math students would ask: "When will I ever use this stuff?" One answer is to use the approximation when you are computing the CDF for a distribution that is not supported by the SAS CDF function. You can implement the CDF by using the QUAD subroutine, but you can use the method in this post when you only need a quick-and-dirty approximation to the CDF.

In my next blog post I will show how to graph the PDF and CDF for a distribution that is not built into SAS.

Post a Comment

Avoid loops, avoid the APPLY function, vectorize!

Last week I received a message from SAS Technical Support saying that a customer's IML program was running slowly. Could I look at it to see whether it could be improved?

What I discovered is a good reminder about the importance of vectorizing user-defined modules. The program in this blog post is not the customer's, but it illustrates the main ideas that I'd like to discuss. Consider the following SAS/IML module:

proc iml;
start MinCDF(x, y);
   a = min(x, y);
   p = cdf("Normal", a);
   z = min(p, 0.5);

The function takes two scalar arguments, x and y, and does the following:

  • It assigns the smaller of x and y to the variable a.
  • It calls the CDF function to compute the probability that a normal random variable will be observed to have a value less than or equal to a. Numerically, this requires evaluating the area under the normal density curve from minus infinity to a.
  • It returns that probability or 0.5, whichever is smaller.

The details of the function are not important. The important facts are

  • It uses the MIN function to return the smaller of two scalar values.
  • It performs a computationally expensive operation. In this case, the CDF call.
  • It is a scalar function: each argument represents a scalar value and it returns a scalar value.


The function is not vectorized, which means that you cannot pass in two vectors for x and y and get back a vector of results. (Try it!) The problem is the MIN function, which returns the smallest value (a scalar) from among all elements of x and y.

If you want to call the function on many pairs of x and y values, you have two options: call the function in a loop or rewrite it so that it will return a vector of results when passed vectors for input arguments. For example, you might want to evaluate the function on a fine grid of values in order to visualize the function as a surface over the xy-plane, as shown to the left. The following statements evaluate the function in a loop. The loop calls the function more than 360,000 times, and takes about a second:

xy = ExpandGrid( do(-3,3,0.01), do(-3,3,0.01) ); /* grid of (x,y) values */
x = xy[,1]; y = xy[,2];
z = j(nrow(xy),1);       /* allocate vector for results */
t0 = time();
do i = 1 to nrow(xy);
   z[i] = MinCDF(x[i], y[i]);
t_loop = time() - t0;

The customer's program did something equivalent. The customer called the APPLY function to avoid writing a loop. If you have a function that takes scalar-valued arguments and returns a scalar argument, you can use the APPLY function to evaluate the scalar-valued function at many input values, as follows:

z = apply("MinCDF", x, y);

The APPLY function requires about the same amount of time to run as the loop. The APPLY function merely prevents you from writing the loop.

For this example, you can call the MinCDF function 360,000 in about one second. The customer's function was more computationally expensive, and the customer was using the APPLY function to call the function millions of times. Consequently, the program was taking a long time to run.

Experienced SAS/IML programmers are familiar with the potential advantages of vectorizing a computation. For this computation, the key to vectorizing the function was to eliminate the call to the MIN function and instead call the elementwise minimum operator (><), which will return the vector of minimums when x and y are vectors. The new vectorized module is below:

start VecMinCDF(x,y);
   a = x >< y;                 /* elementwise minimum */
   p = cdf("Normal", a);       /* vector of probabilities */
   z = p >< 0.5;               /* elementwise min of cdf and 0.5 */
t0 = time();
z = VecMinCDF(x, y);
t_vec = time() - t0;

Calling the vectorized function on two vectors of size 360,000 requires 0.01 seconds, which is a hundredfold speedup. For the customer's example, the speedup was even more dramatic. Notice in this case that rewriting the function was trivial: only the call to the MIN function was replaced.

Someone asked me the other day if it always possible to vectorize a computation. Unfortunately, the answer is "no." For some iterative algorithms, the second element of a result depends on the value of the first element. Algorithms of that sort cannot be vectorized, although you might be able to vectorize parts of the computation.

This example teaches an important lesson: although the APPLY function makes it easy to call a scalar-valued function on a vector of arguments, the convenience might come at a price. The APPLY function is essentially equivalent to calling a scalar-valued function in a loop, which can lead to poor performance. If you want to call a function with scalar arguments many times with different input values, investigate whether it is possible to rewrite the function so that it is vectorized. The speedup can be dramatic.

Post a Comment

Plotting multiple time series in SAS/IML (Wide to Long, Part 2)

I recently wrote about how to overlay multiple curves on a single graph by reshaping wide data (with many variables) into long data (with a grouping variable). The implementation used PROC TRANSPOSE, which is a procedure in Base SAS.

When you program in the SAS/IML language, you might encounter data in a matrix with many columns that you need to reshape. The SAS/IML language provides several functions for reshaping data:

In the PROC TRANSPOSE example, the data were reshaped by reading the data set in row-major order: The kth observation for the first variable is followed by the kth observation for the second, third, and fourth variables, and so forth. Consequently, the default behavior of the following WideToLong module is also to interleave the column values. However, the module also enables you to specify that the data should be stacked columnwise. That is, the reshaped data consists of all observations for the first variable, followed by all observations for the second variable, and so forth.

Some of the input variables might be numeric whereas others might be character, so the following module handles the X, Y, and grouping variables in separate matrices. The module returns three matrices: The long form of the Y matrix, the replicated X values, and the replicated grouping (or ID) variable. By convention, output variables are listed first in the argument list, as follows:

proc iml;
start WideToLong(OutY, OutX, OutG,                       /* output vars */
           Y, X=T(1:nrow(Y)), Group=1:ncol(Y), stack=0); /* input vars  */
   p = ncol(Y);
   N = nrow(Y);
   cX = colvec(X); 
   cG = colvec(Group);
   if stack then do;         /* stack Y in column-major order  */
      OutY = shapecol(Y, 0, 1);        /* stack cols of Y      */
      OutX = repeat(cX, p);            /* replicate x          */
      OutG = colvec(repeat(cG, 1, N)); /* duplicate Group ID   */
   else do;      /* DEFAULT: interleave Y in row-major order   */
      OutY = shape(Y, 0, 1);           /* interleave cols of Y */
      OutX = colvec(repeat(cX, 1, p)); /* replicate x          */
      OutG = repeat(cG, N);            /* duplicate Group ID   */

Let's see how the module works by testing it on the same data as in the previous blog post. For the Sashelp.Tourism data, each observation is a year in the range 1966–1994. The year is stored in a variable named Year. (Technical note: The Year data is stored as a SAS date value, so the example uses the YEAR function to convert the date value to an integer.) The exchange rates for the British pound and the Spanish peseta are stored in separate variables named ExUK and ExSp, respectively. The following statements read the data into SAS/IML matrices, and then convert the data from wide to long form by calling the WideToLong routine:

YNames = {ExUK ExSp};
GroupValues = {"UK" "Spain"};
XName = "Year";
use Sashelp.Tourism;
   read all var YNames into Y;
   read all var XName into X;
close Sashelp.Tourism;
/* For these data, X is a SAS date value. Convert to integer. */
X = year(X);  
run WideToLong(ExchangeRate, Year, Country, /* output (long) */
               Y, X, GroupValues);          /* input (Y is wide) */

As shown in the previous blog post, a useful application of reshaping data is that it is easy to overlay multiple line plots on a single graph. In the SAS/IML language, the SERIES statement creates series plots and supports the GROUP= option for overlaying multiple lines. The following statement creates a time series of the two exchange rates between 1966 and 1994:

title "Exchange Rate Indices";
call series(Year, ExchangeRate) group=Country;

By default, the data are reshaped in row-major order. To reshape in column-major order, specify the STACK=1 option, as follows:

run WideToLong(ExchangeRate, Year, Country, /* output (long) */
               Y, X, GroupValues) stack=1;  /* input (Y is wide) */
Post a Comment

Plotting multiple series: Transforming data from wide to long

Data. To a statistician, data are the observed values. To a SAS programmer, analyzing data requires knowledge of the values and how the data are arranged in a data set.

Sometimes the data are in a "wide form" in which there are many variables. However, to perform a certain analysis requires that you reshape the data into a "long form" where there is one identifying (ID) variable and another variable that encodes the value. I have previously written about how to convert data from wide to long format by using PROC TRANSPOSE.

Recently I wanted to plot several time series in a single graph. Although PROC SGPLOT supports multiple SERIES statements, it is simpler to use the GROUP= option in a single SERIES statement. The GROUP= option makes it convenient to plot arbitrarily many lines on a single graph. This article describes how to reshape the data so that you can easily plot multiple series in a single plot or in a panel of plots.

Series plots for wide data

The Sashelp sample library includes a data set named Tourism that includes an index of the exchange rates for the British pound and the Spanish peseta versus the US dollar for the years 1966–1994. Each observation is a year. The year is stored in a variable named Year; the exchange rates are stored in separate variables named ExUK and ExSp, for the UK and Spanish exchange rates, respectively. You can use separate SERIES statements to visualize the exchange rates, as follows:

proc sgplot data=Sashelp.Tourism;
   series x=Year y=ExUK / legendlabel="UK Pound";
   series x=Year y=ExSp / legendlabel="Spanish Pesetas";
   yaxis label="Exchange Rate vs Dollar";

It is not difficult to specify two SERIES statements, but suppose that you want to plot 20 series. Or 100. (Plots with many time series are sometimes called "spaghetti plots.") It is tedious to specify many SERIES statements. Furthermore, if you want to change attributes of every line (thickness, transparency, labels, and so forth), you have to override the default options for each SERIES statement. Clearly, this becomes problematic for many variables. You could write a macro solution, but the next section shows how to reshape the data so that you can easily plot multiple series.

From wide to long in Base SAS

A simpler way to plot multiple series is to reshape the data. The following call to PROC TRANSPOSE converts the data from wide to long form. It manipulates the data as follows:

  1. Rename ExUK and ExSp variables to more descriptive names (UK and Spain). This step is optional.
  2. Transpose the data. Each value of the Year variable is replicated into two new rows. The values of the UK and Spain variables are interleaved into a new variable named ExchangeRate.
  3. A new column named Country identifies which exchange rate is associated with which of the original variables.
proc transpose 
   data=Sashelp.Tourism(rename=(ExUK=UK ExSp=Spain)) /* 1 */
   out=Long(drop=_LABEL_ rename=(Col1=ExchangeRate)) /* 2 */
   name=Country;                                     /* 3 */
   by Year;                                          /* original X var  */
   var UK Spain;                                     /* original Y vars */

The original data set contained 29 rows, one X variable (Year) with unique values, and two response variables (ExUK and ExSp). The new data set contains 58 rows, one X variable (Year) with duplicated values, one response variable (ExchangeRate), and an discrete variable (Country) that identifies whether each exchange rate is for the British pound or for the Spanish Peseta. For the new data set, you can recreate the previous plot by using a single SERIES statement and specifying the GROUP=COUNTRY option:

proc sgplot data=Long;
   label ExchangeRate="Exchange Rate vs Dollar" Country="Country";
   series x=Year y=ExchangeRate / group=Country; 
   yaxis label="Exchange Rate";

The resulting graph is almost identical to the one that was created earlier. The difference is that now it is easier to change the attributes of the lines, such as their thickness or transparency.

Create a panel of plots

If you want to see each time series in its own plot, you can use the SGPANEL procedure, which also requires that the data be in the wide form:

proc sgpanel data=Long;
   label ExchangeRate="Exchange Rate vs Dollar" Country="Country";
   panelby Country;
   series x=Year y=ExchangeRate;

Notice that you do not need to sort the data by the Country variable in order to use the PANELBY statement. However, if you want to conduct a BY-group analysis of these data, then you can easily sort the data by using PROC SORT.

Which do you prefer: multiple SERIES statements or a single SERIES statement that has the GROUP= option? Or maybe you prefer paneled plots? There are advantages to each. Leave a comment.

Post a Comment

Complete cases: How to perform listwise deletion in SAS

SAS procedures usually handle missing values automatically. Univariate procedures such as PROC MEANS automatically delete missing values when computing basic descriptive statistics. Many multivariate procedures such as PROC REG delete an entire observation if any variable in the analysis has a missing value. This is called listwise deletion or using complete cases. Some procedures such as PROC FREQ and PROC CORR have options that control the way that missing values are handled in the statistical analysis.

Because SAS procedures internally handle which observations are kept or deleted, there is not usually a need for SAS programmers to create a data set that contains all of the complete cases. In contrast, SAS/IML programmers often have to worry about missing values. When implementing a multivariate algorithm, the first step is often to delete rows of the data matrix that contain missing values. This article shows how to perform listwise deletion by using the DATA step and PROC IML.

The case of missing values in numerical data is the most important case, so this article uses the following data set. The first, fourth, and fifth observations represent complete cases.

data A;
input x1-x5;
1 2 3 4 5
2 2 . 4 5
3 3 3 3 .
4 1 3 2 5
5 2 1 3 4
6 . . . 7
. 8 9 0 .
. . . . .
9 8 7 . .

Output complete cases by using the DATA step

If the variables are all numeric, you can use the NMISS function to determine which observations have missing values. You can use the subsetting IF statement to output only those observations that do not contain missing values, as follows:

data CompleteCases;
   set A;
   if nmiss(of _NUMERIC_)=0;    /* output complete cases for all numeric vars */
proc print noobs; run;

In the previous example, all numeric variables are used to determine the complete cases. You can also specify a list of variables to the NMISS function, as follows:

   if nmiss(x1, x2, x5)=0;       /* specify list of numeric vars */

For character variables, you can use the CMISS function and the _CHARACTER_ keyword. In fact, the CMISS function also counts numeric variables, so it enables you to handle numeric and character variables with a single call. The following DATA step outputs the complete cases for all variables in the Sashelp.Heart data set:

data CompleteCases;
  set Sashelp.Heart;
  if cmiss(of _ALL_)=0;  /* complete cases for all vars */

Again, you can specify a list of variable names to the CMISS function if you need to analyze on certain variables.

Complete cases in the SAS/IML language

In the SAS/IML language, you can use matrix subscripts to extract specific rows of a matrix. Therefore, the following function returns the indices for rows that do not contain missing values. The COUNTMISS function is used to count the number of missing values in each row. The LOC function returns the index for the rows that do not contain missing values, as follows:

proc iml;
/* return rows that have no missing values */
start CompleteCases(X);
   return( loc(countmiss(X, "row")=0) );
use A;  read all var _NUM_ into X;  close A;
nonMissingRows = CompleteCases(X);
print nonMissingRows;

The CompleteCases function makes it trivial to extract rows of a matrix that are complete cases. Because the COUNTMISS function accepts both numerical and character matrices, the CompleteCases function works for all matrices. The following function encapsulates the extraction of complete cases:

/* exclude any row with a missing value */
start ExtractCompleteCases(X);
   idx = CompleteCases(X);
   if ncol(idx)>0 then
      return( X[idx, ] );
      return( {} ); 
Y = ExtractCompleteCases(X);
if ncol(Y)>0 then print Y;
else print "No complete cases";

Have you ever needed to physically construct a matrix or data set that has only complete cases? What was the application? Leave a comment.

Post a Comment

Monitor the progress of a long-running SAS/IML program

When you have a long-running SAS/IML program, it is sometimes useful to be able to monitor the progress of the program. For example, suppose you need to computing statistics for 1,000 different data sets and each computation takes between 5 and 30 seconds. You might want to output a message after every 100th computation so that you know at a glance whether the program is close to finishing. (You might also want to check out some efficiency tips and ways to vectorize your program.)

Now some of you might be thinking, "What's the problem? Just put a PRINT statement inside the loop!" Right idea, but it doesn't work like you might expect. For efficiency, SAS/IML loops are compiled and executed in a very efficient manner. One of the consequences of this efficiency is that ODS output is not flushed until the entire DO loop has completed! Run the following program (which takes 10 seconds) and notice that you get NO output until the loop completes, at which point you will see all 10 messages displayed at once:

proc iml;
do i = 1 to 1000;
   if mod(i,100)=0 then do;
      print "Iteration= " i;
   call sleep(1, 0.01);  /* delay for 1/100th second */
print "Execution finished";

So how can you force SAS/IML to display information while the loop is executing? I know two ways. In the IMLPlus language (available in the SAS/IML Studio application), you can use the PRINTNOW statement to flush the ODS output. In PROC IML, you can use the SUBMIT and ENDSUBMIT statements to write messages to the SAS Log.

The IMLPlus solution: The PRINTNOW statement

In the SAS/IML Studio environment, the IMLPlus language supports the PRINTNOW statement. The PRINTNOW statement tells the IMLPlus program to retrieve and display any pending output from the SAS Workspace server. In the IMLPlus language, the following statement prints the iteration history while the DO loop is executing:

do i = 1 to 1000;
   if mod(i,100)=0 then do;
      print "Iteration= " i;
      printnow;          /* IMLPlus statement: works only in SAS/IML Studio */
   call sleep(1, 0.01);  /* delay for 1/100th second */

The PROC IML solution: The SUBMIT/ENDSUBMIT statements

If you are running PROC IML as part of a larger SAS program, you can use the SUBMIT/ENDSUBMIT statements to write messages to the SAS Log. Recall that you can pass values from SAS/IML matrices into SAS statements. If you include the name of a SAS/IML variable on the SUBMIT statement, the contents of that variable are substituted into the SAS code before the SUBMIT block is sent to SAS. In the following example, the value of the counter i is sent into a SUBMIT block, and the %PUT macro statement is used to print the iteration value to the SAS Log:

proc iml;
do i = 1 to 1000;
   if mod(i,100)=0 then do;
      submit i;
      %put Iteration= &i;
   call sleep(1, 0.01);  /* delay for 1/100th second */
print "Execution finished";

The previous program displays the iteration count in the SAS Log while the program is running.

Do you have long-running programs in SAS? Have you developed a clever way to monitor their progress while they run? Leave a comment about your technique.

Post a Comment