Visualizing the causes of airline crashes


There has been a spate of recent high-profile airline crashes (Malaysia Airlines, TransAsia Airways, Germanwings,...) so I was surprised when I saw a time series plot of the number of airline crashes by year, which indicates that the annual number of airline crashes has been decreasing since 1993. The data show that the annual number of crashes in recent years is about half of the number from 20 years ago.

A flow diagram that visualizes airline crash data

The series plot appeared in Significance magazine as part of an interview with designer David McCandless, who produces infographics about data. McCandless has recently published a new book titled Knowledge is Beautiful, which includes the following infographic (click to enlarge). The time series plot appears in the upper left. The main portion of the infographic presents information about the causes of commercial air crashes from 1993–2013. In addition to the cause of the crash, McCandless shows the "phase of flight" (take-off, en route, landing) during which the disaster occurred.


In general, infographics are designed to appeal as well as to inform. In the article, McCandless says that he is trying to show "the links, connections and relations between things." His goal is to "mediate between the data and the audience" so that people "pay attention" to the data. Many statisticians share that goal, with a possible difference being that statistical graphics attempt to objectively represent the data so that the data speak for themselves.

McCandless's goal is to create beautiful infographics (and he has succeeded!), but I would like to propose an alternative visualization of these data that trades beauty for compactness.

I have two main issues with McCandless's "flow diagram," which is reminiscent of a Sankey diagram. The first is that it uses a lot ink to display the relationship between "Cause" and "Phase," which are each five-level categorical variables. A standard frequency analysis would display this data in a 5x5 table or visualize it with a tile-based plot with 25 tiles. I don't think that Edward Tufte would approve of Mccandless's plot. Unlike Tufte's favorite plot—Minard's Napolean's March to Moscow", which uses a similar design—there is no temporal or geographic change for these data. The "flow" through the various "pipes" merely represents conditional frequencies.

My second objection is simply that some of the blocks in the diagram are the wrong sizes. The area of the blocks is supposed to be proportional to the number of crashes in each category. Notice that the bottom row is properly scaled: If you stack the "take-off" and "en route" boxes, which account for 50% of the cases, they approximately equal the area of the "landing" box. However, some boxes are not proportional to the frequency that they represent:

  • The top block represents 100% of the crashes, but its area is too big compared to the cumulative areas of the blocks in the "Causes" or "Phase" row.
  • The "Human Error" block (48%) is too big within its row. The other blocks on the "Causes" row account for 51% of the cases, so the sum of their areas should be greater than the area of the "Human Error" block.
  • The "Human Error" block (48%) is too big across rows. It is much larger than the "Landing" block in the bottom row, which represents 49% of the cases.

Visualizing airline crashes with a mosaic plot

At the risk of building a glass house after hurling those stones, I suggest that a mosaic plot is a standard statistical graphic that can display the same air crash data. You can download the data from McCandless's web site. The data needs a little data cleansing before you can analyze it, and I have provided my SAS program that creates the mosaic plot and other related plots. McCandless's diagram uses data through mid-2013, whereas the mosaic plot uses data through March 2015.

Many SAS procedures use ODS statistical graphics to display graphs as readily as they display tables. For example, the following call to PROC FREQ creates a 5x5 frequency table (click to see the table) and a mosaic plot:

proc freq data=Crash order=data;
tables Cause*Phase / plots=(mosaic(square));

In a mosaic plot, the area of each tile is proportional to the number of cases. I have ordered both variables according to frequency, whereas McCandless ordered the "Phase" variable temporally (standing, take-off, en route, landing). I did this so that the most important causes and phases appear in the first row (bottom) and first column (left). The relative widths of the columns indicate the relative frequency of the one-way frequencies for the "Phase" variable. Colors connect the categories for the "Cause" variable.

The mosaic plot is far from perfect. It is hard to visually trace the categories for the "Cause" variable. For example, finding the "mechanical" category for each flight phase requires some eyeball gymnastics. Also, although the mosaic plot makes it possible to visualize the proportions of crashes for each cause and phase, the absolute frequencies are not available on the graph by default. The frequency counts are available in the 5x5 table that PROC FREQ produces. Alternatively, with additional effort you can create a mosaic plot that includes frequency counts for the largest mosaic tiles, which I think is far superior to the default display.

What do you think? Do you prefer the mosaic plot that is produced automatically by PROC FREQ, the one with frequency counts, or do you prefer McCandless's flow diagram? Which helps you to understand the causes of airline crashes? Feel free to adapt my SAS program and create your own alternative visualization!

Post a Comment

On the number of permutations supported in SAS software

There's "big," and then there is "factorial big."

If you have k items, the number of permutations is "k factorial," which is written as k!. The factorial function gets big fast. For example, the value of k! for several values of k is shown in the following table. You can use a custom algorithm to compute accurate, extended-precision values of the factorial function for k>18.


Base SAS has several functions for working with permutations, including the ALLPERM function (which generates all permutations of k items) and the RANPERM function (which generates a random permutation of k items). The documentation for the ALLPERM function mentions that you can specify no more than 18 items. That is, it only enables you to compute all permutations when k≤ 18.

A SAS programmer recently asked why k=18 is the limit in these factorial computations. At first glance, 18 does seem to be an unusual choice. Why not k=16, since computer scientists love powers of 2? Or why not k=20, since most of us count in base 10?

The explanation is that k=18 is the largest value for which k! can be represented by an 8-byte integer. The first argument to the ALLPERM function is an index that enumerates the permutation, and this argument does not fit into an 8-byte integer when k>18. You can use the CONSTANT function in SAS to find the value of the largest 8-byte integer:

data _null_;
Fact18 = fact(18);
put Fact18=;
put ExactInt=;

From a programming perspective, having a limit also prevents us programmers from accidentally creating a loop that iterates a billion billion times (which is 20!). When dealing with permutations and combinations, implementing a smart algorithm is almost always preferable to a brute-force computation that iterates over every possible permutation.

Do you have any tales of woe from when you inadvertently tried to compute with a large combinatorial number? Leave a comment.

Post a Comment

Vectors that have a fractional number of elements

The title of this article makes no sense. How can the number of elements (in fact, the number of anything!) not be a whole number? In fact, it can't. However, the title refers to the fact that you might compute a quantity that ought to be an integer, but is not an integer when computed by using floating-point arithmetic.

Most programmers are aware of examples of finite-precision numerical arithmetic that seem to violate mathematical truths. One of my favorite examples is that (0.1 + 0.1 + 0.1) ≠ 0.3 in finite precision arithmetic:

data _null_;
x = 0.1 + 0.1 + 0.1;
y = 0.3;
if x=y then put "x = y";
else put "x is not equal to y";

This is a classic example from computer science that shows why you should avoid testing floating-point values for equality. Furthermore, if you add 0.1 a multiple of 10 times, you will not obtain an integer. This example is not unique, so you should use a truncation function such as ROUND, INT, FLOOR, or CEIL if you need to ensure that a floating-point computation results in an integer.

With that example in mind, let's return to the title of this blog post. A SAS programmer asked on the SAS Support Communities why a SAS/IML vector had a smaller number of elements than he had asked for. After looking at his computation, I realized that he had computed the number of elements by summing a series of floating-point values. Mathematically, the sum should have been 7, but in floating-point arithmetic the value was 6.99999.

So what happens if you ask PROC IML for a vector with 6.99999 elements? The software truncates the number, and you get a vector with 6 elements, as shown by the following program:

proc iml;
n = 6.99999;
v = j(n, 1);    /* allocate a vector with n elements */
print "The vector has " (nrow(v)) " elements";

A more accurate title for this article might be "What do you get if you ask for a vector that has a fractional number of elements?" The answer is that you get a vector where the number of elements is the truncation of the number that you specified. If you ever find yourself in this situation, you can use the ROUND function to round, rather than truncate, the non-integer value.

There are other operations in SAS (for example, subscripting an array) that similarly truncate non-integer values. However, there is no need to fear an "unexpected truncation" because most programmers do not sum up fractional values when they specify indices and dimensions. Of course, when you use integer-valued functions such as NROW, NCOL, and LENGTH to specify a dimension, then you get what you expect.

In closing, analyze this DATA step that intentionally uses fractional values to index into an array. Do you think it runs? What values do you think are assigned to the array? Is this example interesting or an abomination?

data A(drop=h);
array x[5];
/* index variable takes on fractional values */
do h = 1 to 5 by 0.5;  
  x[h] = h;
proc print; run;
Post a Comment

Finding observations that match a target value

Imagine that you have one million rows of numerical data and you want to determine if a particular "target" value occurs. How might you find where the value occurs?

For univariate data, this is an easy problem. In the SAS DATA step you can use a WHERE clause or a subsetting IF statement to find the rows that contain the target value. In the SAS/IML language you can use the LOC function to find the rows.

For multivariate data, you can use a similar approach, except the code become messier. For example, suppose that you have a data set that contains five numeric variables and you want to test whether the 5-tuple (1, 2, 3, 4, 5) appears in the data. In the DATA step you might write the following statement:

if x1=1 & x2=2 & x3=3 & x4=4 & x5=5 then ...;

A general mathematical principle is that a "target problem" is equivalent to a root-finding problem. The standard mathematical trick is to subtract the target value and then find zeros of the resulting set of numbers. This trick provides the basis for writing a general SAS/IML programs that is vectorized and that can handle an arbitrary number of variables. The following statements generate a matrix that has one million rows and five columns. Each element is an integer 0 through 9.

proc iml;
call randseed(123);
x = floor( 10*randfun({1e6 5}, "Uniform") ); /* matrix of values 0-9 */

Suppose that you want to find whether there is a row that matches the target value {1 2 3 4 5}. The following statements subtract the target value from the data and then find all rows for which the difference is the zero vector:

target = {1 2 3 4 5};
y = x - target;             /* match target ==> y={0 0...0}   */
count = (y=0)[,+];          /* count number of 0s in each row */
idx = loc(count=ncol(target));      /* rows that match target */
if ncol(idx)>0 then 
   print "Target pattern found in " (ncol(idx)) " rows";
else print "Target pattern not found";

The variable idx contains the row numbers for which the pattern occurs. For this random integer matrix, the target pattern appeared four times. Other random matrices might have six, 12, or 15 rows that match the target. It is easy to show that in a random matrix with one million rows, the target value is expected to appear 10 times.

The example in this article shows how to search a numerical matrix for rows that have a particular value. For character matrices, you can use the ROWCATC function in SAS/IML to concatenate the column values into a single vector of strings. You can then use the LOC function to find the rows that match the pattern.

Post a Comment

How to pass parameters to a SAS program

This article show how to run a SAS program in batch mode and send parameters into the program by specifying the parameters when you run SAS from a command line interface. This technique has many uses, one of which is to split a long-running SAS computation into a series of less time-intensive computations. There is a large class of problems that naturally divide into smaller pieces: programs in which the same computation is repeated many times for different values of parameters. A computation such as this is an example of an "embarrassingly parallel" computation.

To give a concrete example, suppose that you are running a simulation that involves generating many samples from the Beta(a,b) distribution over a wide range of (a,b) values, such as 0 < a ≤ 4 and 0 < b ≤ 4. You could write a single program that loops over a regular grid of (a,b) values, but here are some reasons that you might want to divide the computation into smaller pieces:

  • Some parameter values might require more iterations than others in order to obtain accurate estimates.
  • Certain parameter values might require a different estimation technique or an asymptotic approximation.
  • An error in one portion of your program (for example, when a parameter is zero) will not result in the loss of earlier computations.
  • You can begin analyzing the first set of results even while you are computing more results.
  • If you have access to multiple computers, you can submit different parameter values to each computer, thus achieving parallel processing with little effort and no cost.

With apologies to Neil Sedaka, breaking up (a program) is easy to do.

An example SAS program

A simple DATA step is sufficient to illustrate the process. It lacks the motivation (there is no actual need to break up the program), but it illustrates the main ideas in a simple way. Start the break-up process by using macro variables to replace the upper and lower limits of the parameters, as follows:

%let aMin = 0.1;         /* lower limit of parameter 'a' */
%let aMax = 4;           /* upper limit of parameter 'a' */
%let bMin = 0.1;         /* lower limit of parameter 'b' */
%let bMax = 4;           /* upper limit of parameter 'b' */
%let DSName = Out1;      /* name of data set that contain results for params */
libname dest ".";        /* put results in current working directory */
data dest.&DSName(keep = a b kurtosis);
do a = &aMin to &aMax by 0.1;                 /* loop over a */
   do b = &bMin to &bMax by 0.1;              /* loop over b */
      /* compute kurtosis of Beta(a,b) distribution */
      numer = 6*((a-b)**2 * (a+b+1)-a*b*(a+b+2));
      denom = a*b*(a+b+2)*(a+b+3);
      kurtosis = numer / denom;

The program computes the kurtosis of the Beta distribution for each (a,b) value on a regular grid 0.1 ≤ a ≤ 4 and 0.1 ≤ b ≤ 4. Notice that the DEST libref ensures that the results are stored in the current directory.

Breaking up the computation into smaller pieces

The %LET statements define the range of the a and b parameters and the name of the output data set. (You can also use this technique to write a simulation that accepts a seed value as a parameter to the RAND function.) I can use the values of the macro variable to divide the computation into a series of smaller computations. For example, I could divide the computation into the following four smaller computations:

  1. Compute the results on 0.1 ≤ a ≤ 2 and 0.1 ≤ b ≤ 2. Store these results in the data set Out1.
  2. Compute the results on 0.1 ≤ a ≤ 2 and 2.1 ≤ b ≤ 4. Store these results in Out2.
  3. Compute the results on 2.1 ≤ a ≤ 4 and 0.1 ≤ b ≤ 2. Store these results in Out3.
  4. Compute the results on 2.1 ≤ a ≤ 4 and 2.1 ≤ b ≤ 4. Store these results in Out4.

You could change the parameter values manually, but SAS provides a feature that makes it simple to specify parameter values on the command line. At the top of the program, you can replace the simple %LET statements with calls to the %SYSGET function, as follows:

/* get values of environment variables from SAS command line */
%let aMin = %sysget(aMin);
%let aMax = %sysget(aMax);
%let bMin = %sysget(bMin);
%let bMax = %sysget(bMax);
%let DSName = %sysget(DSName);

With this change, you can run the DATA step in batch mode and use the -SET option on the SAS command line to change the parameter values for each invocation. For example, if the program is stored in the file then the first invocation from a Windows command prompt could be

> "C:\Program Files\SASHome\SASFoundation\9.4\sas.exe" 
    -set DSName "Out1" 
    -set aMin 0.1 -set aMax 2 -set bMin 0.1 -set bMax 2

When the SAS program runs, the %SYSGET function gets the values that you specified by using the -SET command line option. Notice that the options are specified as keyword/value pairs such a -set aMin 0.1. SAS runs the program in batch mode and creates a SAS data set Out1.sas7bdat in the current directory. In a similar way you can run the program three more times to create the data sets Out2, Out3, and Out4.

After all the data sets are created, you can concatenate them together to form the complete data set, which contains results for the complete range of parameter values:

data All;
set dest.Out:;           /* use colon to specify Out1, Out2, Out3, and Out4 */
/* analyze the data and/or make plots */
proc sgrender data=All template=ContourPlotParm;
dynamic _TITLE="Kurtosis of Beta(a,b) Distribution"
        _X="a" _Y="b" _Z="Result";

The call to the SGRENDER procedure uses a GTL template from a previous article about how to create a contour plot in SAS. The graph shows that the kurtosis of the beta distribution is small when the values of a and b are approximately equal, but the kurtosis can be arbitrarily large when a or b are close to zero.

Although I demonstrated this technique for a simple DATA step, you can use the approach to control the parameters that are passed to any SAS procedures. For example, the following PROC IML statements might begin a long-running simulation in the SAS/IML language:

proc iml;
aGrid = do(&aMin, &aMax, 0.1); /* evenly spaced [aMin, aMax] */
bGrid = do(&bMin, &bMax, 0.1); /* evenly spaced [bMin, bMax] */
So next time you have a long-running SAS program that you want to run during lunch (or overnight), think about using the -SET command line option to specify parameter values that are passed to your program. Inside your program, use the %SYSGET function to receive the values. This technique enables you to run SAS programs in batch mode and pass in program parameters. It also enables you to break up a long-running program into smaller, less time-intensive programs.
Post a Comment

Analyzing the first 10 million digits of pi: Randomness within structure


Saturday, March 14, 2015, is Pi Day, and this year is a super-special Pi Day! This is your once-in-a-lifetime chance to celebrate the first 10 digits of pi (π) by doing something special on 3/14/15 at 9:26:53. Apologies to my European friends, but Pi Day requires that you represent dates with the month placed first in order to match the sequence 3.141592653....

Last year I celebrated Pi Day by using SAS to explore properties of the continued fraction expansion of pi. This year, I will examine statistical properties of the first 10 million digits of pi. In particular, I will show that the digits of pi exhibit statistical properties that are inherent in a random sequence of integers.

Editor's note (17Mar2015): Ken Kleinman remarked on the similarity of the analysis in this article to his own analysis from 2010. I was reading his blog regularly in 2010, so I suspect that I unconsciously remembered his chi-square and Durbin-Watson analyses and used them without properly citing him. My apologies to Ken and Nick Horton for this oversight!

Reading 10 million digits of pi

I have no desire to type in 10 million digits, so I will use SAS to read a text file at a Princeton University URL. The following statements use the FILENAME statement to point to the URL:

/* read data over the internet from a URL */
filename rawurl url ""
                /* proxy='' */ ;
data PiDigits;
   infile rawurl lrecl=10000000;
   input Digit 1. @@;
   Position = _n_;
   Diff = dif(digit);      /* compute difference between adjacent digits */
proc print data=PiDigits(obs=9);
   var Digit;

The PiDigits data set contains 10 million rows. The call to PROC PRINT displays the first few decimal digits, which are (skipping the 3) 141592653....

For other ways to use SAS to download data from the internet, search Chris Hemedinger's blog, The SAS Dummy for "PROC HTTP" and you will find several examples of how to download data from a URL.

The distribution of digits of pi

You can run many statistical tests on these numbers. It is conjectured that the digits of pi are randomly uniformly distributed in the sense that the digits 0 through 9 appear equally often, as do pairs of digits, trios of digits, and so forth.

You can call PROC FREQ to compute the frequency distribution of the first 10 million digits of pi and to test whether the digits appear to be uniformly distributed:

/* Are the digits 0-9 equally distributed? */
proc freq data = PiDigits;
tables Digit/ chisq out=DigitFreq;

The frequency analysis of the first 10 million digits shows that each digit appears about one million times. A chi-square test indicates that the digits appear to be uniformly distributed. If you turn on ODS graphics, PROC FREQ also produces a deviation plot that shows that the deviations from uniformity are tiny.

A "pi chart" of the distribution of the digits of pi

As an advocate of the #OneLessPie Chart Initiative, I am intellectually opposed to creating pie charts. However, I am going to grit my teeth and make an exception for this super-special Pi Day. You can use the Graph Template Language (GTL) to create a pie chart. Even simpler, Sanjay Matange has written a SAS macro that creates a pie chart with minimal effort. The following DATA step create a percentage variable and then calls Sanjay's macro:

data DigitFreq;
   set DigitFreq;
   Pct = Percent/100; 
   format Pct PERCENT8.2;
/* macro from */
%GTLPieChartMacro(data=DigitFreq, category=Digit, response=Pct,
         title=Distribution of First 10 Million Digits of Pi,

The pie chart appears at the top of this article. It shows that the digits 0 through 9 are equally distributed.

Any autocorrelation in the sequence?

In the DATA step that read the digits of pi, I calculated the difference between adjacent digits. You can use the SGPLOT procedure to create a histogram that shows the distribution of this quantity:

proc sgplot data=PiDigits(obs=1000000);  
   vbar Diff;

That's a pretty cool triangular distribution! I won't bore you with mathematical details, but this shape arises when you examine the difference between two independent discrete uniform random variables, which suggests that the even digits of pi are independent of the odd digits of pi.

In fact, more is true. You can run a formal test to check for autocorrelation in the sequence of numbers. The Durbin-Watson statistic, which is available in PROC REG and PROC AUTOREG, has a value near 2 if a series of values has no autocorrelation. The following call to PROC AUTOREG requests the Durbin-Watson statistic for first-order through fifth-order autocorrelation for the first one million digits of pi. The results show that there is no detectable autocorrelation through fifth order. To the Durban-Watson test, the digits of pi are indistinguishable from a random sequence:

proc autoreg data=PiDigits(obs=1000000);  
   model Digit = / dw=5 dwprob;

Are the digits of pi random?

Researchers have run dozens of statistical tests for randomness on the digits of pi. They all reach the same conclusion. Statistically speaking, the digits of pi seems to be the realization of a process that spits out digits uniformly at random.

Nevertheless, mathematicians have not yet been able to prove that the digits of pi are random. One of the leading researchers in the quest commented that if they are random then you can find in the sequence (appropriately converted into letters) the "entire works of Shakespeare" or any other message that you can imagine (Bailey and Borwein, 2013). For example, if I assign numeric values to the letters of "Pi Day" (P=16, I=9, D=4, A=1, Y=25), then the sequence "1694125" should appear somewhere in the decimal expansion of pi. I wrote a SAS program to search the decimal expansion of pi for the seven-digit "Pi Day" sequence. Here's what I found:

proc print noobs data=PiDigits(firstobs=4686485 obs=4686491);
   var Position Digit;

There it is! The numeric representation of "Pi Day" appears near the 4.7 millionth decimal place of pi. Other "messages" might not appear in the first 10 million digits, but this one did. Finding Shakespearian sonnets and plays will probably require computing more digits of pi than the current world record.

The digits of pi pass every test for randomness, yet pi is a precise mathematical value that describes the relationship between the circumference of a circle and its diameter. This dichotomy between "very random" and "very structured" is fascinating! Happy Pi Day to everyone!

Post a Comment

Matrix multiplication with missing values in SAS

Sometimes I get contacted by SAS/IML programmers who discover that the SAS/IML language does not provide built-in support for multiplication of matrices that have missing values. (SAS/IML does support elementwise operations with missing values.) I usually respond by asking what they are trying to accomplish, because mathematically matrix multiplication with missing values is not a well-defined operation. I get various answers. Some people want to exclude the missing values. Others want them propagated. Still others want them imputed.

This article discusses possible ways to multiply matrices when the matrices contain missing values. It defines a SAS/IML module that propagates missing values. The module can be used to score a linear model data that include missing values.

Linear algebra with missing values? Does it make sense?

SAS software supports scalar operations on missing values. Multiplication that involves a missing value results in a missing value. Additive operations exclude (skip) the missing value. Many SAS procedures, including PROC IML, handle missing data in statistical calculations.

Linear algebra is another story. A vector space is defined over an algebraic field, which is a set of numbers that support addition, subtraction, multiplication, and division. Every element of the field must have an additive inverse, and every nonzero element must have a multiplicative inverse (reciprocal). A missing value does not have either. There is no number that can be added to a missing value to get zero. Similarly, there is no number such that the product of the number and a missing value is 1.

Consequently, you can't include missing values in matrices and expect to preserve the usual laws of linear algebra. However, you can define new operations on numerical arrays that are reminiscent of matrix multiplication. In the rest of this article, A and B are matrices where the number of columns of A equals the number of rows of B. Read on to discover ways to compute a "matrix product" C = A*B when either matrix has a missing value.

Excluding rows with missing values

By far the most common way to handle missing values in a statistical analysis is to exclude them. See my previous blog post about how to perform listwise deletion of missing values in the DATA step and in SAS/IML.

Propagating missing values

In some software packages (including MATLAB and R), missing values are propagated by matrix multiplication. If the matrix A has a missing value anywhere in the ith row, the product A*B contains missing values everywhere in the ith row. Similarly, if B has a missing value anywhere in the jth column, the entire jth column of the product A*B contains missing values.

It is easy to define a SAS/IML module that implements this multiplication scheme: Use ordinary matrix multiplication on rows and columns that are free of missing values and put missing values everywhere else. The following function implements this multiplication method:

proc iml;
/* matrix "multiplication" where missing values are propagated */
start MVMult(A, B);
   C = j(nrow(A), ncol(B), .);
   rows = loc(countmiss(A, "ROW")=0);
   cols = loc(countmiss(B, "COL")=0);
   if ncol(rows)>0 & ncol(cols)>0 then 
      C[rows, cols] = A[rows,] * B[,cols];
A = {1 2 3,
     . 4 1,
    -1 0 1};
B = {1 2 -1,
     3 4  0,
     0 1  .};
C = MVMult(A,B);
print C;

The product matrix contains a missing value for every element of the second row because the A matrix has a missing value in the second row. The product matrix contains a missing value for every element of the third column because the B matrix has a missing value in the third column.

This a reasonable way to handle missing values when you are trying to score data according to a linear model. In that case, the matrix on the left side is an n x p data matrix (X). Each column of the right-hand matrix (B) contains p coefficients of a linear model. Often the right-hand matrix is a column vector. The matrix product P=X*B evaluates (scores) the linear model to obtain predicted values of the response. A missing value anywhere in the ith row of X indicates that you cannot predict a value for that observation. When scoring linear models, the right-hand matrix does not usually contain missing values, although PROC SCORE permits missing value for coefficients and treats them as 0.

The previous paragraph shows why propagating missing values makes sense when scoring linear models. However, I prefer not to call it matrix multiplication. For example, in true matrix multiplication, a product that involves the identity matrix results in the original matrix. That does not hold true when missing values propagate. Furthermore, in a product of three matrices for which the middle matrix contains missing values, every element of the product is missing, which limits the usefulness of this technique.

A1 = MVMult(A, I(3));   /* A*I ^= A  */
A2 = MVMult(I(3), A);   /* I*A ^= A  */
H = MVMult(A2, I(3));   /* every element of I*A*I is missing */
print A1, A2, H;

Skipping missing values

Some SAS customers have expressed interest in "skipping" missing values when they multiplying matrices. Recall that each element of a matrix product A*B is formed by multiplying elements of A and B and then summing those scalar products. It is reasonable to expect that multiplication of a missing value should result in a missing value, but that missing values are skipped in the summation, just as they are when using the SUM function.

You can define a SAS/IML module that implements this scheme, but this is not a good way to handle missing values. This scheme treats missing values as if they were zero. The following statements replace each missing value by 0, and then perform ordinary matrix multiplication. The product is the same as when you "skip" missing values during the summation:

A0 = A;  B0 = B;
A0[ loc(A=.) ] = 0;   /* replace missing values with 0 */
B0[ loc(B=.) ] = 0;
C0 = A0*B0;
print C0;

Replacing missing values with 0 is a bad idea. Substituting the value 0 is likely to bias whatever analysis you are trying to perform. If you want to impute the missing values, you should use an imputation scheme that makes sense for the data and for the analysis.


In conclusion, there are several ways to deal with multiplying matrices that contain missing values in SAS/IML software:

  • Use only complete cases of the data by deleting any observation that contains a missing value.
  • Propagate missing values by using the MVMult function in this blog post. This approach makes sense if you are evaluating a linear model on data that contain missing values.
  • Impute the missing values. There are many ways to impute missing values in SAS, but imputing them with the value 0 is not usually a good choice. Consequently, I do not recommend that you skip missing values during matrix multiplication, which is equivalent to substitution by 0.
Post a Comment

Writing data in chunks: Does the chunk size matter?

I often blog about the usefulness of vectorization in the SAS/IML language. A one-sentence summary of vectorization is "execute a small number of statements that each analyze a lot of data." In general, for matrix languages (SAS/IML, MATLAB, R, ...) vectorization is more efficient than the alternative, which is to run many statements (often inside a loop) that each analyze a small amount of data.

I usually think of vectorization as applying to computations. However, I recently realized that the same principle applies when writing data to a SAS data set. I needed to compute a large number of results and write them to a SAS data set. Because each result required a similar computation, I wrote a program that looked something like the following.

proc iml;
N = 1e5;                   /* 100,000 = number of observations to write */
x = j(1,5);
t0 = time();
call randseed(123);
create DS1 from x[colname=("x1":"x5")];   /* open data set for writing */
do i = 1 to N;
   call randgen(x, "Uniform");         /* replaces complex calculation */
   append from x;                      /* write 1 obs with 5 variables */
close DS1;                             /* close data set */
t1 = time() - t0;
print t1[F=5.3];

In the preceding program, the call to the RANDGEN function is used in place of a complicated computation. After running the program I was disappointed. Almost 7 seconds for a mere 100,000 results?! That's terrible performance! After studying the program I began to wonder whether the problem was that I was calling the APPEND statement many times, and each call does only a little bit of work (writes one observation). That is a classic vectorize-this-step situation.

I decided to see what would happen if I accumulated 1,000 results and then output those results in a single APPEND statement. My hope was that the performance would improve by calling the APPEND statement fewer times and writing more data with each call. The rewritten program looked similar to the following:

proc iml;
N = 1e5;                      /* total number of observations to write */
BlockSize = 1e3;              /* number of observations in each APPEND stmt */
NumIters = N/BlockSize;       /* number of calls to the APPEND stmt */
x = j(1,5);
t0 = time();
call randseed(123);
xChunk = j(BlockSize,5,0);
create DS2 from xChunk[colname=("x1":"x5")];
do i = 1 to NumIters;
   do j = 1 to BlockSize;
      call randgen(x, "Uniform");      /* replaces complex calculation */
      xChunk[j,] = x;                  /* accumulate results */
   append from xChunk;                 /* write 1,000 results */
close DS2;
t2 = time() - t0;
print t2[F=5.3];

Ahh! That's better! By calling the APPEND statement 100 times and writing 1,000 observations for each call, the performance of the program increased dramatically.


I wrote the second program so that the number of observations written by the APPEND statement (BlockSize) is a parameter. That enables me to run tests to examine the performance as a function of how many observations are written with each call. The following graph summarizes the situation.

The graph indicates that the major performance improvements occur when increasing the number of observations from 1 to 20. The curve flattens out after 50 observations. Beyond 100 observations there is no further improvement.

There are two lessons to learn from this experience:

  • Writing (and reading!) data in a matrix language is similar to other numerical operations. Try to avoid reading and writing small quantities of data, such as one observation at a time. Instead, read and write larger chunks of data.
  • You don't have to create huge matrices with gigabytes of data to realize the performance improvement due to vectorization. In this program, writing a matrix that has 100 rows is a vast improvement over writing one observation at a time.

The principal of vectorization tells us to execute a small number of statements that each analyze a lot of data. The same principle applies to reading and writing data in a matrix language. Avoid reading or writing one observation at a time. Instead, you can improve performance by ensuring that each I/O statement reads or writes a hundred or more observations.

Post a Comment

Create a custom PDF and CDF in SAS

In my previous post, I showed how to approximate a cumulative density function (CDF) by evaluating only the probability density function. The technique uses the trapezoidal rule of integration to approximate the CDF from the PDF.

For common probability distributions, you can use the CDF function in Base SAS to evaluate the cumulative distributions. The technique from my previous post becomes relevant if you need to compute the CDF of a distribution that is not built into SAS.

A recent question on a discussion forum was "How can I plot the PDF and CDF for [an unsupported] distribution?" It turns out that the distribution from the discussion forum has an analytical expression for the CDF, so I will use the Lévy distribution for this article. The density of the Lévy distribution is given by the following formula:


Evaluating a custom PDF in SAS

When evaluating any function in SAS, you need to make sure that you understand the domain of the function. For the Lévy distribution, the support is the semi-infinite interval [μ, ∞). The location parameter μ determines the left-hand endpoint for the support of the distribution. The scale parameter c must be a positive value. The limit as x → μ+ is 0. The following SAS/IML module evaluates the Lévy density function. The function is vectorized so that it returns a vector of densities if you call it with a vector of x values. The implementation sets default parameter values of μ=0 and c=1.

proc iml;
start LevyPDF(x, mu=0, c=1);
   pi = constant("pi");
   f = j(nrow(x), ncol(x), 0);         /* any values outside support are 0 */
   idx = loc(x>mu);                    /* find valid values of x */
   if ncol(idx)>0 then do;             /* evaluate density for valid values */
      v = x[idx];         
      f[idx] = sqrt(c/(2*pi)) # exp(-c/(2*(v-mu))) /(v-mu)##1.5; 
   return( f );
/* plot PDF on [0, 5] */
x = do(0,5,0.02);
pdf = LevyPDF(x);
title "Levy PDF, mu=0, c=1";
call Series(x, pdf) grid={x y};

Evaluating a custom CDF in SAS: The quick-and-dirty way

As shown in my previous post, you can approximate a cumulative density function (CDF) by using the trapezoidal rule to add up the area under the PDF. The following statements implement this approximation. The approximation is accurate provided that the distance between adjacent x values is small, especially when the second derivative of the PDF is large.

start CumTrapIntegral(x,y);      /* cumulative sums of trapezoidal areas */
   N = nrow(colvec(x));
   dx    =   x[2:N] - x[1:N-1];
   meanY = ( y[2:N] + y[1:N-1] )/2;
   return( 0 // cusum(dx # meanY) );
CDFApprox = CumTrapIntegral(x, pdf);   
title "Levy CDF, mu=0, c=1";
call Series(x, CDFApprox) yvalues=do(0,0.7,0.1) grid={x y};

Evaluating a custom CDF in SAS: The slow-and-steady way

Although the trapezoidal approximation of the CDF is very fast to compute, sometimes slow and steady wins the race. If you want to evaluate the CDF as accurately as possible, or you only need the CDF at a few locations, you can use the QUAD subroutine to numerically integrate the PDF.

To use the QUAD subroutine, the integrand must be a function of a single variable (x). Any parameters (such as μ and c) must be specified by using a GLOBAL clause. The following statements define a function named _Func that calls the LevyPDF function with the value of the global parameters g_mu and g_c. From the plot of the CDF function, it looks like the median of the distribution is approximately at x=2.2. The following statements evaluate the integral of the Lévy PDF on the interval [0, 2.2]. The integral is approximately 0.5.

start _Func(x) global(g_mu, g_c);
   return( LevyPDF(x, g_mu, g_c) );
/* looks like median is about x=2.2 */
g_mu=0; g_c=1;
call quad(result, "_Func", {0 2.2});
print result;

The call the QUAD subroutine gave the expected answer. Therefore you can create a function that calls the QUAD subroutine many times to evaluate the Lévy cumulative distribution at a vector of values:

start LevyCDF(x, mu=0, c=1) global(g_mu, g_c);
   cdf = j(nrow(x), ncol(x), 0);             /* allocate result */
   N = nrow(x)*ncol(x);
   g_mu=mu; g_c=c;
   do i = 1 to N;                            /* for each x value */
      call quad(result, "_Func", 0 || x[i]); /* compute integral on [0,x] */
      cdf[i] = result;
CDF = LevyCDF(x);
title "Levy CDF, mu=0, c=1";
call Series(x, CDF) yvalues=do(0,0.7,0.1) grid={x y};

The graph is visually indistinguishable from the previous CDF graph and is not shown. The CDF curve computed by calling the LevyCDF function is within 2e-5 of the approximate CDF curve. However, the approximate curve is computed almost instantaneously, whereas the curve that evaluates integrals requires a noticeable fraction of a second.

In summary, if you implement a custom probability distribution in SAS, try to write the PDF function so that it is vectorized. Pay attention to special points in the domain (such as 0) that might require special handling. From the PDF, you can create a CDF function. You can use the CumTrapIntegral function to evaluate an approximate CDF from values of the PDF, or you can use the QUAD function to create a computationally expensive (but more accurate) function that integrates the PDF.

Post a Comment

An easy way to approximate a cumulative distribution function

Evaluating a cumulative distribution function (CDF) can be an expensive operation. Each time you evaluate the CDF for a continuous probability distribution, the software has to perform a numerical integration. (Recall that the CDF at a point x is the integral under the probability density function (PDF) where x is the upper limit of integration. I am assuming that the PDF does not have a closed-form antiderivative.) For example, the following statements compute and graph the CDF for the standard lognormal distribution at 121 points in the domain [0,6]. To create the graph, the software has to compute 121 numerical integrations.

proc iml;
x = do(0, 6, 0.05);
CDF = cdf("Lognormal", x);
title "Standard Lognormal CDF";
call series(x, CDF) grid={x y};


An integral is an accumulation of slices

In contrast, evaluating the PDF is relatively cheap: you only need to evaluate the density function at a series of points. No integrals reqired. This suggests a clever little approximation: Instead of calling the CDF function many times, call the PDF function and use the cumulative sum function (CUSUM) to add together the areas of many slices of the density. For example, it is simple to modify the trapezoidal rule of integration to return a cumulative sum of the trapezoidal areas that approximate the area under a curve, as follows:

start CumTrapIntegral(x,y);
   N = nrow(colvec(x));
   dx    =   x[2:N] - x[1:N-1];
   meanY = ( y[2:N] + y[1:N-1] )/2;
   return( 0 // cusum(dx # meanY) );

With this function you can approximate the (expensive) CDF by using a series of (cheap) PDF evaluations, as follows:

pdf = pdf("Lognormal", x);             /* evaluate PDF */
CDFApprox = CumTrapIntegral(x, pdf);   /* cumulative sums of trapezoidal areas */

How good is the approximation to the CDF? You can plot the difference between the values that were computed by the CDF function and the values that were computed by using the trapeoidal areas:

Difference = CDF - CDFApprox`;
title "Difference between Lognormal CDF and Aproximate CDF";
call Series(x, Difference) grid={x y};

The graph shows that for the lognormal distribution evaluated on [0,6], the approximate CDF differs by less than 0.001. For most values of x, CDF(x) differs from the cumulative trapezoidal approximation by less than 0.0001. This result uses PDF values with a step size of 0.05. In general, smaller step sizes lead to smaller errors, and the error for the trapezoidal rule is proportional to the square of the step size.

Adjustments to the approximation

The lognormal distribution is exactly zero for x < 0, which means that the integral from minus infinity to 0 is zero. However, for distributions that are defined on the whole real line, you need to make a small modification to the algorithm. Nanmely, if you are approximating a cumulative distribution on the interal [a, b], you need to add the CDF value at a to the values returned by the CumTrapIntegral function. For example, the following statements compare the values of the standard normal CDF on [-3, 3] with the cumulative trapezoidal sums:

/* for distributions on [-infinity, x], correct by adding constant */
x = do(-3, 3, 0.05);
CDF = cdf("Normal", x);              /* gold standard */
pdf = pdf("Normal", x);
CDFApprox = CumTrapIntegral(x, pdf); /* cumulative sums */
CDFAdd = cdf("Normal", -3);          /* integral on (-infinity, -3) */
CDFApprox = CDFAdd + CDFApprox`;     /* trapezoidal approximation */
Difference = CDF - CDFApprox;
title "Difference between Normal CDF and Aproximate CDF";
call Series(x, Difference) grid={x y};

The graph shows that the approximation is within ±5e-5 of the values from the CDF function.

When should you use this approximation?

You might be asking the same question that my former math students would ask: "When will I ever use this stuff?" One answer is to use the approximation when you are computing the CDF for a distribution that is not supported by the SAS CDF function. You can implement the CDF by using the QUAD subroutine, but you can use the method in this post when you only need a quick-and-dirty approximation to the CDF.

In my next blog post I will show how to graph the PDF and CDF for a distribution that is not built into SAS.

Post a Comment