Sum a series in SAS

A customer asked:

How do we go about summing a finite series in SAS? For example, I want to compute series
for various integers n ≥ 3. I want to output two columns, one for the natural numbers and one for the summation of the series.

Summations arise often in statistical computing, so this is a great question. I don't understand why the upper limit only goes to n–2, but I think you'll see that the programs below can compute sums for any upper limit.

You can compute summations in the DATA step by using a DO loop. To compute them efficiently in the SAS/IML language, you usually want to construct a vector such that the ith element of the vector is the ith term of the series. Then you use the SUM function to add the terms of the series.

A summation by using the DATA step

It is easy to sum a series by using the DATA step. You set the value of the sum to 0, then loop over the values of i, summing up each term as you go. For this example, the ith term is i / floor(n/i), and the summation is over the terms i=1 to i=n–2. The following DATA step computes this summation for a sequence of values of n:

data Series;
do n = 3 to 10;
   sum = 0;
   do i = 1 to n-2;
      sum = sum + i/ floor(n/i);
   end;
   output;
end;
keep n sum;
run;
 
proc print data=Series noobs; run;
t_summation2

A vectorized summation

In a vector language such as R, MATLAB, or SAS/IML, an efficient summation begins by constructing vectors that contain the elements that are being summed. Suppose that you define i to be a vector that contains the sequence 1, 2, ..., n–2. Then the expression floor(n/i) is also a vector, as is the elementwise ratio of these vectors. These three vectors are shown in the rows of the following table:

t_summation

The bottom row of the table contains the terms of the series. Notice that the terms sum to 6.333, which agrees with the output from the DATA step in the previous section. Use the SUM function to add the terms, as shown in the following function, which computes the summation Sn for an arbitrary value of n > 2:

proc iml;
start SumSeries(n);
   i = 1:(n-2);                    /* index of terms */
   return( sum(i / floor(n/i)) );  /* sum of terms */
finish;

If you want the summation for several values of n, you can use a DO loop to iterate over values of n. The result is the same as for the DATA step program.

n = T(3:10);
sum = j(nrow(n),1);     /* allocate a vector for the results */
do k = 1 to nrow(n);
   sum[k] = SumSeries( n[k] );
end;
print n sum;

A matrix approach to summation

The previous section shows an efficient way to use vectorized operations to compute a summation. However, just for fun, let's see if we can compute the summation for MANY values of n without writing ANY loops! As is typical, a matrix approach uses more memory.

The main idea is to construct a lower triangular matrix whose nth row contains the terms for Sn. You can start by constructing the lower triangular matrix A whose rows contains the vector 1:(n–2) for various values of n. In order to avoid dividing by 0 when you form the expression floor(n/i), you should use missing values for the upper triangular elements of the matrix. The ROW and COL functions are useful for constructing the matrix, as shown below. The ROW and COL functions were introduced in SAS/IML 12.3, but if you are running an earlier version of SAS you can find the definitions in a previous blog post.

nMax = 10;
A = col( j(nMax-2, nMax-2) );      /* each row is {1 2 3 ... 10} */
A[loc(col(A)>row(A))] = .;         /* set upper triangular elements to missing */
*print A;

If you define the column vector n = {3, 4, ..., 10}, then each row of the matrix floor(n/A) contains the denominators for the series. To compute the summation for all values of n, you can form the expression A / floor(n/A) and compute the sums across each row by using the summation subscript reduction operator, as follows:

n = T(3:nMax);         /* column vector */
B = A / floor( n/A );  /* each row contains terms of series for a given n value */
sum = B[,+];           /* sum across rows */
print n sum;

Although I wouldn't ordinarily use the matrix method to sum a series, the technique is useful for constructing structured matrices whose elements are given by a formula. A canonical example is the Hilbert matrix.

Post a Comment

Create a waterfall plot in SAS

waterfall1

In clinical trials, a waterfall plot is often used to indicate how patients in the study responded to treatment. In oncology trials, the response variable might be the percent change in the size of a tumor from the individual's baseline value at the start of the trial. The percent change is plotted for each patient, who are usually ordered from worst (the tumor grew) to best (the tumor vanished). Tumors that grow have a positive change; a reduction in size corresponds to a negative value between 0% and -100%. Clinical trials in other areas would have a different response variable, but the ideas are the same.

Clinical researchers use a bar chart to show the distribution of the response variable because the color of the bar can encode additional characteristics for each patient, such as the patient's Response Evaluation Criteria in Solid Tumors (RECIST) scores. The RECIST score classifies a tumor's response to treatment into categories such as "Progressive Disease," "Stable Disease," "Partial Response," and "Complete Response." In the waterfall plot at the top of this article, these values are used to color each bar.

This article creates a waterfall plot that is based on Figure 2 in "Understanding Waterfall Plots" (Gillespie 2012). The data are based on Kwak et al (2010). I did not have access to the original data, so I estimated values from the published graph.

Creating a waterfall plot as a bar plot

The waterfall plot is essentially a bar chart where each bar represents a patient and the bars are colored by a discrete variable. Waterfall plots are limited to studies that have a few hundred patients because the width of each bar requires several pixels.

The following DATA step defines the data. The response variable (REPONSE) is the tumor's percent change from baseline measurements. A decimal value represents the percentages (for example, -0.75 represents -75% change, and the PERCENTN7. format is used to format the response. Discrete categories in SAS are (by default) arranged in alphabetical order, so I use the values 1, 2, 3, and 4 to encode the RECIST values and I create a user-defined format to display those values as text. You can download the complete program that creates the data and waterfall plot.

proc format;
value RECISTFmt   1='Disease progression'  2='Stable disease'
                  3='Partial response'     4='Complete response';
run;
 
data Tumor;
format PatientID z4. Response percentN7. RECIST RECISTFmt.;
input  PatientID Response RECIST;
datalines;
1001 -0.75 2 
1002 -0.73 3 
1003 -0.51 2 
1004 -0.09 2 
1005 -0.10 2 
... more data ...
;

The first step is to sort the data by the response variable. For a response variable that measures a change in tumor size, sort the data in descending order so that first patient represents the worst outcome and the last patient represents the most desirable outcome. Each observation in the sorted data is then assigned an index that indicates the position in the sorted list. This index variable will be used to plot the response along a horizontal axis:

proc sort data=Tumor out=Waterfall;
by descending Response RECIST;  *use RECIST code to break ties;
run;
 
data Waterfall;
set Waterfall;
Position + 1;                   /* Position = _N_ */
run;

The data are now ready to plot. It is ironic that the WATERFALL statement the SGPLOT procedure is not used to create a waterfall plot! Instead, use the VBAR statement to create the waterfall plot. The GROUP= option enables you to color the bars according to the RECIST values.

proc sgplot data=Waterfall;
   refline -0.3 / axis=y lineattrs=(pattern=shortdash);
   vbar Position / response=Response group=RECIST;
   xaxis label="Patient Number" fitpolicy=thin;
   yaxis label="Change from baseline (%)" values=(-1 to 1 by 0.2) valueshint;
   keylegend / location=inside down=2;
run;

The plot appears at the top of this article. Each statement is explained below:

  • The REFLINE statement puts a horizontal reference line to indicate tumors that changed by 30%, which is a clinically important value. You might also want to put a reference line at 0.2 to indicate tumors that grew by 20% or more. If you put the REFLINE statement first, the reference line vanishes behind the bars. To overlay the reference line on top of the bars, put the VBAR statement first.
  • The VBAR statement draws the main bar chart. The RESPONSE= option scales each bar according to the specified response variable. The GROUP= option enables you to color to bars according to a categorical variable.
  • The XAXIS statement controls options for the horizontal axis. Use the FITPOLICY= option if you have so many patients that individual labels begin to overlap. To suppress the patient numbers, you can use the DISPLAY=NONE option.
  • The YAXIS statement controls options for the vertical axis. Use the VALUES= option to specify the location of tick marks. Use the VALUESHINT option to specify that the minimum and maximum values of the axis are determined by data values, not by the VALUES= option.
  • The KEYLEGEND statement is optional. It enables you to place the legend in a convenient location.

If you want to specify particular colors for the bars, use the STYLEATTRS statement as shown in Sanjay Matange's blog post about Clinical Graphs.

The previous example shows how to create a waterfall plot in SAS for a small to moderate number of patients. However, even for this small sample (79 patients), the labels for the patients overlap and so you need to use the FITPOLICY= option to make the tick labels look nicer. Even after thinning the labels, this example looks a little strange because the tick labels are the "non-round" numbers 1, 5, 9, 13, 17,....

An alternative waterfall plot

An alternative way to create the waterfall plot is to replace the VBAR statement (which assumes a categorical X variable) with a NEEDLE statement (which assumes a continuous X variable). You can set the widths of the needles to be thick, thereby simulating the appearance of a bar chart. However, the continuous X axis enables you to specify the VALUES= option in the XAXIS statement, thereby producing "round numbers" for the tick values, as shown below:

proc sgplot data=Waterfall;
   refline -0.3 / axis=y lineattrs=(pattern=shortdash);
   needle x=Position y=Response / baseline=0 group=RECIST
          lineattrs=(thickness=5px);
   xaxis label="Patient Number" values=(1, 5 to 75 by 5, 79) integer;
   yaxis label="Change from baseline (%)" values=(-1 to 1 by 0.2) valueshint;
   keylegend / location=inside down=2;
run;
waterfall2

The new waterfall plot (click to enlarge) has axis labels that indicate the first and last patients; intermediate patient numbers differ by 5. You might need to play with the relative widths of the plot and of the needles in order to eliminate irregular gaps between adjacent needles.

In summary, you can create a basic waterfall plot in SAS software by using the VBAR or NEEDLE statements in the SGPLOT procedure. More complicated waterfall plots are discussed in the following papers by Pandya (2012) and Sarkar (2014).

References

Post a Comment

The distribution of Pythagorean triples by angle

Last week I was chatting with some mathematicians and I mentioned the blog post that I wrote last year on the distribution of Pythagorean triples.

In my previous article, I showed that there is an algorithm that uses matrix multiplication to generate every primitive Pythagorean triple by starting with the simple (3,4,5) right triangle. The algorithm does not systematically generate triangles with large legs from triangles with smaller legs. Instead, it might generate several triangles with large legs, followed by one or more triangles with small legs. Given a specific radius r, it would be interesting to prove how many iterations of the algorithm are required to guarantee that the algorithm has generated all primitive Pythagorean triples whose hypothesis is less than r.

From a statistical point of view, it is interesting to estimate the distribution of the smallest angle in primitive Pythagorean right triangles. Suppose that you consider only the primitive triangles whose hypotenuse is less than 106. The following histogram shows the distribution of the smallest angle (arctan(b/a) if a is the longer leg) for angles ranging from 0 to 45 degrees in increments of 0.5 degrees. The histogram is formed from 98,151 triangles, which is considerably less than the 159,139 primitive right triangles whose hypotenuses are less than one million.

pythagangle

The histogram has several interesting features:

  • Apparently there are relatively few long-and-skinny triangles. In this sample, few have angles less than 5 degrees. The smallest angle is 3.47 degrees, which corresponds to the (33, 544, 545) right triangle.
  • The distribution is not uniform. In fact, there are noticeable gaps in the density distribution. The biggest gaps are near 37, 23, and 28 degrees.
  • Some Pythagorean triangles that are nearly isosceles. In this sample, the (137903, 137904, 195025) triangle has an angle equal to 44.99979 degrees. This triangle is an example of a twin-leg Pythagorean triple in which the legs lengths differ by 1.

The gaps are very interesting. I conjecture that the gaps in the distribution are not random, but that they are related to low-order triangles. The following statements compute the smallest angle of some right triangles whose legs are small integers:

proc iml;
v = {  4  3  5, 
      12  5 13,
      15  8 17,
      24  7 25,
      21 20 29,
      35 12 37 };
a = atan2(v[,2], v[,1]) * 180/constant('pi');
print v[L="" c={"a" "b" "c"}] a[L="Angle"];
t_pythagangle

If you overlay lines at each of the angles above, they fall right into the gaps of the histogram. The (3,4,5) triangle is centered at the biggest gap (37 degrees). The (5, 12, 13) triangle is responsible for the next largest gap (23 degrees), and so forth. You can overlay these lines on the histogram, or overlay the lines on a fringe plot, as shown in the following image in which the red lines represent the angles of six low-order triangles:

pythagangle3

If this histogram is representative of the true density distribution of the smallest angle in a Pythagorean triangle, the low-order triangles seem to be surrounded by a region of reduced density. If I were to study this problem further, I would conjecture that the gap induced by a low-order triangle is related to Farey sequences. Farey sequences arise in the darndest places, ranging from the dynamics of planetary objects to the study of Diophantine equations. In fact, the fringe plot reminds me of the gaps and divisions in Saturn's rings, which is another place where Farey sequences have an application. Of course, these gaps could also be an artifact of the algorithm that I am using to generate the primitive Pythagorean triangles, but that can be checked by using a different algorithm to generate the Pythagorean triples. Personally, I think the gaps are produced by a real number-theoretic phenomenon.

The density distribution of the smallest angle is interesting, don't you think? Mathematicians who work in number theory can probably explain why the gaps exist, but they were a surprise to me. Who would have guessed that Pythagorean triangles would lead to such an interesting distribution? It's a beautiful connection between probability, statistics, and high-school geometry.

Addendum 16APR2015: Well, my conjecture was wrong. Ian Wakeling implemented a different algorithm for generating all primitive Pythagorean triples that have a hypotenuse length that is less than 1,000,000. I have linked to his SAS/IML program. His program indicates that the angles of the Pythagorean triangles are uniformly distributed in the interval (0, 45) degrees, which is itself an interesting result! The artifacts that I saw are a result of the algorithm that I was using to construct the triples, not a property of the triples themselves. Thanks, Ian!

The algorithm that Ian used enables you to investigate "twin-leg triangles" in which the legs differ by 1. For long legs, these triangles are nearly isosceles. Or investigate "twin leg-hypotenuse triangles" in which the hypotenuse and the longer leg differ by 1. For long legs, these triangles are long and skinny. It is also interesting to ask questions such as "which number is associated with a lot of Pythagorean triangles" (60,060 is one).

Post a Comment

DO loop = 1 TO 600;

Today is my 600th blog post for The DO Loop. I have written about many topics that are related to statistical programming, math, statistics, simulation, numerical analysis, matrix computations, and more. The right sidebar of my blog contains a tag cloud that links to many topics.

What topics do you, my reader, enjoy the most? The least? What topics would you like to see more of in the future? Here are some topics that I have written about frequently.

More than 200 of my posts have been "Getting Started" articles in which I describe some feature of SAS software or matrix programming that is useful for the novice programmer. I have to admit that it is getting harder to think of new introductory topics after blogging for so many years, so please send me any statistical programming topics that perplex you or your colleagues. (I don't blog about DATA steps techniques, PROC SQL, or macro programming, because those topics are routinely covered in SAS Global Forum papers and on the SAS Support Communities.)

In a SAS program, a DO loop is always terminated with an END statement. An END for this DO Loop will occur inevitably, but for now I will keep iterating.

Feedback is always welcome, so leave a comment. Which topics do you enjoy the most and which topics make your eyes glaze over with disinterest?

Post a Comment

Compute the rank of a matrix in SAS

A common question from statistical programmers is how to compute the rank of a matrix in SAS. Recall that the rank of a matrix is defined as the number of linearly independent columns in the matrix. (Equivalently, the number of linearly independent rows.) This article describes how to compute the rank of a matrix in SAS by using functions in SAS/IML software.

An important application of the rank is to determine whether a square matrix is nonsingular. The techniques in this article can also help you decide whether a square matrix is singular.

The rank of a matrix is one of those vexing problems in numerical linear algebra. Mathematically, the question is not difficult. However, in finite-precision arithmetic, computing the rank of a matrix (and whether it is nonsingular) can be challenging. This article suggests some best practices. The details have filled many books, journal articles, and PhD dissertations.

The mathematical rank of a matrix

In theory, you can use Gaussian elimination to compute the rank of a matrix. You reduce the matrix to row echelon form; the rank is the number of rows that contain a nonzero element.

For square matrices, the same mathematical process determines whether a matrix is nonsingular. Other mathematical facts are also true: a singular matrix has a determinant equal to 0, at least one eigenvalue that is 0, and at least one singular value that is 0.

However, these mathematical facts are not good ways to numerically determine whether a matrix is rank-deficient or singular. A naive implementation of Gaussian elimination is numerically unstable. A numerical determinant that should be mathematically zero might be computed as a very tiny nonzero number in finite-precision arithmetic. Or an ill-conditioned but non-singular matrix might have a determinant so small that it is numerically indistinguishable from a singular matrix. In any case, you should never compare a numerical value to 0.

The numerical rank of a matrix

So what's a statistical programmer to do? For the rank problem, there are several numerically robust techniques, but I will describe two that are based on the singular value decomposition (SVD). Some researchers use methods based on the QR decomposition, which is faster to compute but less reliable.

Mathematically, the rank of a matrix equals the number of nonzero singular values. Therefore a simple algorithm to estimate the rank is to use the SVD routine in SAS/IML to compute the singular values and then declare that any singular value smaller than some cutoff value is numerically the same as zero. The following example computes and displays the singular values for a 6x5 matrix by using the SVD:

proc iml;
/* rank(A)=4 because only four linearly independent columns */
A = {1 0 1 0 0,
     1 0 0 1 0,
     1 0 0 0 1,
     0 1 1 0 0,
     0 1 0 1 0,
     0 1 0 0 1 };
 
call svd(U, Q, V, A);
print Q[L="SingVals"];
t_rankmatrix1

You can see that the last singular value is tiny, and in fact the matrix is rank 4. Although you might be tempted to use a fixed cutoff value such as 1e-8 to determine which singular values are "small," it is usually better to incorporate factors that reflect the scale of the problem. A common cutoff value (used by MATLAB) is δ = d σmax ε where d is the maximum dimension of A, σmax is the maximum singular value of A, and ε is "machine epsilon." The machine epsilon can be computed by using the CONSTANT function in SAS. The following example computes the cutoff value and estimates the rank of the matrix:

tol = max(dimension(A)) * constant("maceps") * (max(Q));
rankSVD = sum(Q > tol);
print tol rankSVD;
t_rankmatrix2

There are other techniques that you can use to estimate the rank of a matrix. The SAS/IML documentation recommends using the generalized-inverse technique. The mathematical properties of the generalized inverse (also known as the Moore-Penrose inverse) are explained in a short 1978 article by M. James in The Mathematical Gazette.

The generalized inverse always exists, even for nonsquare matrices and for singular matrices. (If A is nonsingular, then the generalized inverse is the same as the inverse matrix A-1.) To reveal the rank of A, compute the trace of the product of the generalized inverse with A. For a nonsingular matrix, the trace is obviously the dimension of A. For nonsquare or singular matrices, the (rounded) trace is an estimate for the rank of the matrix (Penrose, 1954, p. 408), as follows:

rankGInv = round(trace(ginv(A)*A));
print rankGInv;
t_rankmatrix3

The generalized-inverse technique does not enable you to specify a tolerance/scaling parameter, but a parameter is used internally by the GINV function to compute the generalized inverse.

It is useful to package up these techniques into a module, as follows:

start RankMatrix(A, method="SVD", tol=);
   if upcase(method)="SVD" then do;
      call svd(U, Q, V, A);
      if IsEmpty(tol) then 
         tol = max(dimension(A))*constant("maceps")*(max(Q));
      return( sum(Q > tol) );
   end;
   else do;
      return( round(trace(ginv(a)*a)) ); 
   end;
finish;
 
rankSVD = RankMatrix(A);
rankGInv = RankMatrix(A, "GINV");

A numerical test for singularity

Let's test the rank algorithms on a notorious ill-conditioned matrix, the Hilbert matrix. For an n x n Hilbert matrix, the determinant approaches zero quickly, but is always positive, which means that the Hilbert matrix is nonsingular for all values of n.

The following table shows the result of computing the rank for five Hilbert matrices.

t_rankmatrix4

The first column shows the size of the problem, which is also the true rank of the Hilbert matrix. The second column shows the determinant of the matrix, which shows that the matrices are very close to being singular. The third column shows the estimated rank by using the SVD algorithm. The SVD algorithm performs very well. It correctly computes the rank of the Hilbert matrix for n ≤ 10, and correctly finds the rank of a matrix with determinant 2E-53. The fourth column shows the estimated rank by using the generalized inverse algorithm. It also performs well and correctly computes the rank of the Hilbert matrix for n ≤ 9 and for a determinant as tiny as 1E-42.

In summary, both methods perform well. The SVD method is slightly faster and performed better on this test, so I recommend that you use the SVD method to estimate the rank of rectangular matrices and to determine whether a square matrix is nonsingular. The SVD method is the default method for the RankMatrix function.

Post a Comment

Let's talk at SAS Global Forum 2015

The 2015 SAS Global Forum is in Dallas, TX, and I'll be there. There are many talks to see and people to meet, so thank goodness for the agenda builder, which enables you to create a schedule in advance.

I always enjoy talking with SAS customers about statistics, simulations, matrix computations, and the SAS/IML product. In fact, my main reason to attend this conference is to talk to customers, so please find me and say hello! Here's my schedule:

Sunday, April 26, 2015

  • At 5:00 p.m. I will give a Super Demo in The Quad (formerly known as the SAS Support and Demo Area). The topic is "Visualizing SAS/IML Programs" and I will show how you can use graphics to find errors, detect inefficiencies, and create high-quality statistical programs in the SAS/IML language.
  • I will be at the SAS/IML booth in The Quad during 4:30 – 6:30 p.m. I am sharing the booth with the SAS Studio folks, so come and see the new browser-based interface to running SAS programs.
  • Find me at the Get-Acquainted Dessert Reception from 8:30 – 9:00 p.m.

Monday, April 27

  • At 10:30 a.m. I give a presentation "Ten Tips for Simulating Data with SAS" in Room C148. This talk features techniques that enable you to write efficient simulations in SAS. The tips are from my book Simulating Data with SAS. Whether you are a beginner or an experienced SAS programmer, I promise that you'll learn something interesting about simulating data with SAS.
  • You will find me at the SAS/IML booth in The Quad during the Customer Appreciation Mixer from 6:30 – 8:00 p.m. Grab a drink and some hors d'oeuvres and talk to me about the statistical and data analysis problems you face at your company.

Tuesday, April 28

  • I'll be at the SAS/IML booth in The Quad from 10:00 a.m. – 12:00 p.m.,
  • At 11:00 a.m. I present a Super Demo titled "Ten Tips for Simulating Data with SAS." If you missed my talk on Monday, come to The Quad to hear an informal presentation about the same topic.
  • Have you ever considered writing a book for SAS Press? Join me and other SAS Press authors for lunch in Ballroom D2 and find out what it takes to write a book. Those who attend will have a chance to win a free book (your choice) from the SAS Bookstore.
  • From 6:30 p.m. – 8:00 p.m., please join me and other SAS professionals for a linkup of online communities, which will include leading members of the SAS online communities, including representatives from SAS-L, sasCommunity.org, the SAS Support Communities, and other online networking sites for SAS professionals.
  • .

During the conference I will occasionally tweet from my Twitter handle @RickWicklin to the hashtag #SASGF15.

So whether you want to discuss statistical programming, comment on a blog topic, or tell me about a cool project that you are working on, please introduce yourself at the conference. I look forward to meeting old friends... and making some new ones.

If you are going to SAS Global Forum 2015, what are you looking forward to? The talks? The social events? Leave a comment about what you hope to gain by attending this conference.

Post a Comment

Simulate the Monty Hall Problem in SAS

The Monty Hall Problem is one of the most famous problems in elementary probability. It is famous because the correct solution is counter-intuitive and because it caused an uproar when it appeared in the "Ask Marilyn" column in Parade magazine in 1990. Discussing the problem has been known to create a Jekyll-and-Hyde effect among mathematicians that transforms ordinarily calm logicians into frenzied, red-faced, name-calling beasts.

The following statement from Wikipedia is the usual statement of the Monty Hall Problem:

Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host [Monty Hall], who knows what's behind the doors, opens another door, say No. 3, which has a goat. He then says to you, "Do you want to pick door No. 2?" Is it to your advantage to switch your choice?

The implicit assumption here is that the host does not act randomly: he always chooses to reveal one of the two goats. (The other implicit assumption is that "winning" means choosing the car, which might shock rural goat farmers!) The surprising mathematical result is that, on average, if you (the player) switch from your original choice to the other unrevealed door, you will win the car two-thirds of the time. Thus your best strategy is to switch. If you like videos, watch this scene from the movie 21 in which Kevin Spacey uses the Monty Hall Problem to identify a brilliant math student.

A simple analysis of why this "always switch" strategy is advantageous goes like this:

  • Case 1: The door you chose hides a goat. The host shows the other goat. If you switch doors, you always win the car. This case occurs with probability 2/3.
  • Case 2: The door you chose hides the car. If you switch doors, you lose. This case occurs with probability 1/3.

A section of the Wikipedia article criticizes this simple solution and includes a long discussion of various real and fallacious arguments.

The Monty Hall Problem became (in)famous when thousands of eminent mathematicians and physicists wrote Parade magazine "explaining" why switching could not possibly affect the probability of winning the car. Apparently the great mathematician Paul Erdös refused to believe that switching was a better strategy even after being presented with several mathematical proofs. He was not convinced until he was shown the results of a computer simulation.

Perhaps you, like Erdös, need to see a simulation before you are convinced? Many SAS programmers have written a simulation of the Monty Hall Problem by using the DATA step. The following SAS-oriented discussions are worth reading:

What I haven't seen is a simulation in the SAS/IML language. Because of its vectorized nature, you can often write a simulation in SAS/IML without writing any loops. That is the case for the Monty Hall Problem.

Without loss of generality (WLOG), assume that the contestant always picks Door Number 1. The host then shows one of the goats, which is behind either Door Number 2 or Door Number 3. In the following simulation, the probability of winning is computed if you stay with your original guess (which is obviously 1/3) versus if you switch your guess:

proc iml;
call randseed(54321);
NumSim = 1e5;                               /* number of simulated games */
car = randfun(NumSim, "Table", {1 1 1}/3);  /* unknown door hides car */
guess = j(NumSim, 1, 1);                    /* WLOG, guess door=1 */
show = choose(car=2, 3, 2);                 /* host shows a goat */
switch = choose(show=2, 3, 2);              /* the door you could switch to */
 
winIfStay = mean(guess=car);                /* (P(win | do not switch) */
winIfSwitch = mean(switch=car);             /* (P(win | switch) */
print winIfStay winIfSwitch;
t_montyhall

The simulation (which requires less than 10 lines in SAS/IML) convincingly shows that the probability of winning if you do not switch is 1/3, whereas the probability of winning the car if you switch is 2/3.

The simulation uses three SAS/IML functions that are worth explaining:

I'd like to thank my colleague Stephen Mistler for recently reminding me about the Monty Hall Problem and for advocating that simulations provide clarity to probability problems.

Don't like my approach? Think you can do better? Still don't believe the result? Leave a comment.

Post a Comment

Visualizing the causes of airline crashes

aircrashseries

There has been a spate of recent high-profile airline crashes (Malaysia Airlines, TransAsia Airways, Germanwings,...) so I was surprised when I saw a time series plot of the number of airline crashes by year, which indicates that the annual number of airline crashes has been decreasing since 1993. The data show that the annual number of crashes in recent years is about half of the number from 20 years ago.

A flow diagram that visualizes airline crash data

The series plot appeared in Significance magazine as part of an interview with designer David McCandless, who produces infographics about data. McCandless has recently published a new book titled Knowledge is Beautiful, which includes the following infographic (click to enlarge). The time series plot appears in the upper left. The main portion of the infographic presents information about the causes of commercial air crashes from 1993–2013. In addition to the cause of the crash, McCandless shows the "phase of flight" (take-off, en route, landing) during which the disaster occurred.

aircrashOrig

In general, infographics are designed to appeal as well as to inform. In the article, McCandless says that he is trying to show "the links, connections and relations between things." His goal is to "mediate between the data and the audience" so that people "pay attention" to the data. Many statisticians share that goal, with a possible difference being that statistical graphics attempt to objectively represent the data so that the data speak for themselves.

McCandless's goal is to create beautiful infographics (and he has succeeded!), but I would like to propose an alternative visualization of these data that trades beauty for compactness.

I have two main issues with McCandless's "flow diagram," which is reminiscent of a Sankey diagram. The first is that it uses a lot ink to display the relationship between "Cause" and "Phase," which are each five-level categorical variables. A standard frequency analysis would display this data in a 5x5 table or visualize it with a tile-based plot with 25 tiles. I don't think that Edward Tufte would approve of McCandless's plot. Unlike Tufte's favorite plot—Minard's "Napoleon’s March to Moscow", which uses a similar design—there is no temporal or geographic change for these data. The "flow" through the various "pipes" merely represents conditional frequencies.

My second objection is simply that some of the blocks in the diagram are the wrong sizes. The area of the blocks is supposed to be proportional to the number of crashes in each category. Notice that the bottom row is properly scaled: If you stack the "take-off" and "en route" boxes, which account for 50% of the cases, they approximately equal the area of the "landing" box. However, some boxes are not proportional to the frequency that they represent:

  • The top block represents 100% of the crashes, but its area is too big compared to the cumulative areas of the blocks in the "Causes" or "Phase" row.
  • The "Human Error" block (48%) is too big within its row. The other blocks on the "Causes" row account for 51% of the cases, so the sum of their areas should be greater than the area of the "Human Error" block.
  • The "Human Error" block (48%) is too big across rows. It is much larger than the "Landing" block in the bottom row, which represents 49% of the cases.

Visualizing airline crashes with a mosaic plot

At the risk of building a glass house after hurling those stones, I suggest that a mosaic plot is a standard statistical graphic that can display the same air crash data. You can download the data from McCandless's web site. The data needs a little data cleansing before you can analyze it, and I have provided my SAS program that creates the mosaic plot and other related plots. McCandless's diagram uses data through mid-2013, whereas the mosaic plot uses data through March 2015.

Many SAS procedures use ODS statistical graphics to display graphs as readily as they display tables. For example, the following call to PROC FREQ creates a 5x5 frequency table (click to see the table) and a mosaic plot:

proc freq data=Crash order=data;
tables Cause*Phase / plots=(mosaic(square));
run;
aircrashMosaic

In a mosaic plot, the area of each tile is proportional to the number of cases. I have ordered both variables according to frequency, whereas McCandless ordered the "Phase" variable temporally (standing, take-off, en route, landing). I did this so that the most important causes and phases appear in the first row (bottom) and first column (left). The relative widths of the columns indicate the relative frequency of the one-way frequencies for the "Phase" variable. Colors connect the categories for the "Cause" variable.

The mosaic plot is far from perfect. It is hard to visually trace the categories for the "Cause" variable. For example, finding the "mechanical" category for each flight phase requires some eyeball gymnastics. Also, although the mosaic plot makes it possible to visualize the proportions of crashes for each cause and phase, the absolute frequencies are not available on the graph by default. The frequency counts are available in the 5x5 table that PROC FREQ produces. Alternatively, with additional effort you can create a mosaic plot that includes frequency counts for the largest mosaic tiles, which I think is far superior to the default display.

What do you think? Do you prefer the mosaic plot that is produced automatically by PROC FREQ, the one with frequency counts, or do you prefer McCandless's flow diagram? Which helps you to understand the causes of airline crashes? Feel free to adapt my SAS program and create your own alternative visualization!

Addendum

(April 13, 2015) Several visualization experts wrote in with alternate suggestions and comments.

Antony Unwin says that he prefers to visualize these data by using multiple bar charts. I also like multiple bar charts, which show the conditional distribution of cause given the flight phase. In SAS, you can use the SGPANEL procedure to produce a panel of bar charts (click to enlarge), as follows, although I might choose to combine the "unknown" and "standing" phases to save room:

ods graphics / reset width=600px;
proc sort data=Crash;
by descending CauseN PhaseN ;
run;
proc sgpanel data=Crash;
panelby Phase /layout=rowlattice columns=1 novarname 
               sort=data rowheaderpos=right onepanel;
vbar Cause / stat=percent;
rowaxis discreteorder=data grid;
run;
aircrashseries3

Michael Friendly noted that although mosaic plots are useful for visualizing cross-tabulations, they also are useful for evaluating various statistical models. For example, a statistician might want to assess whether there is an association between flight phase and the cause of an airline crash. As I have written before, you can color the cells in a mosaic plot according to the residuals in a model, thereby visualizing how the observed data deviate from the model. For more about using mosaic plots in categorical models, see Friendly (1999).

Xan Gregg pointed me to an article by Stephen Few in which Few argues that an innovative design is not necessarily a better design. (Be sure to read the comments!) Few is more critical of McCandless than I am, but I agree that a different design is not necessarily a better design. If a standard statistical graphic can display the data well, then I prefer to use it rather than to design something new and more complex. My main goal is to communicate as simply as possible. However, I have no problem with artists who choose to create data-inspired art, especially if they convince the public that data are beautiful.

Post a Comment

On the number of permutations supported in SAS software

There's "big," and then there is "factorial big."

If you have k items, the number of permutations is "k factorial," which is written as k!. The factorial function gets big fast. For example, the value of k! for several values of k is shown in the following table. You can use a custom algorithm to compute accurate, extended-precision values of the factorial function for k>18.

t_maxperm

Base SAS has several functions for working with permutations, including the ALLPERM function (which generates all permutations of k items) and the RANPERM function (which generates a random permutation of k items). The documentation for the ALLPERM function mentions that you can specify no more than 18 items. That is, it only enables you to compute all permutations when k≤ 18.

A SAS programmer recently asked why k=18 is the limit in these factorial computations. At first glance, 18 does seem to be an unusual choice. Why not k=16, since computer scientists love powers of 2? Or why not k=20, since most of us count in base 10?

The explanation is that k=18 is the largest value for which k! can be represented by an 8-byte integer. The first argument to the ALLPERM function is an index that enumerates the permutation, and this argument does not fit into an 8-byte integer when k>18. You can use the CONSTANT function in SAS to find the value of the largest 8-byte integer:

data _null_;
Fact18 = fact(18);
ExactInt=constant("exactint");
put Fact18=;
put ExactInt=;
run;
t_maxperm2

From a programming perspective, having a limit also prevents us programmers from accidentally creating a loop that iterates a billion billion times (which is 20!). When dealing with permutations and combinations, implementing a smart algorithm is almost always preferable to a brute-force computation that iterates over every possible permutation.

Do you have any tales of woe from when you inadvertently tried to compute with a large combinatorial number? Leave a comment.

Post a Comment

Vectors that have a fractional number of elements

The title of this article makes no sense. How can the number of elements (in fact, the number of anything!) not be a whole number? In fact, it can't. However, the title refers to the fact that you might compute a quantity that ought to be an integer, but is not an integer when computed by using floating-point arithmetic.

Most programmers are aware of examples of finite-precision numerical arithmetic that seem to violate mathematical truths. One of my favorite examples is that (0.1 + 0.1 + 0.1) ≠ 0.3 in finite precision arithmetic:

data _null_;
x = 0.1 + 0.1 + 0.1;
y = 0.3;
if x=y then put "x = y";
else put "x is not equal to y";
run;
t_fractvec

This is a classic example from computer science that shows why you should avoid testing floating-point values for equality. Furthermore, if you add 0.1 a multiple of 10 times, you will not obtain an integer. This example is not unique, so you should use a truncation function such as ROUND, INT, FLOOR, or CEIL if you need to ensure that a floating-point computation results in an integer.

With that example in mind, let's return to the title of this blog post. A SAS programmer asked on the SAS Support Communities why a SAS/IML vector had a smaller number of elements than he had asked for. After looking at his computation, I realized that he had computed the number of elements by summing a series of floating-point values. Mathematically, the sum should have been 7, but in floating-point arithmetic the value was 6.99999.

So what happens if you ask PROC IML for a vector with 6.99999 elements? The software truncates the number, and you get a vector with 6 elements, as shown by the following program:

proc iml;
n = 6.99999;
v = j(n, 1);    /* allocate a vector with n elements */
print "The vector has " (nrow(v)) " elements";
t_fractvec

A more accurate title for this article might be "What do you get if you ask for a vector that has a fractional number of elements?" The answer is that you get a vector where the number of elements is the truncation of the number that you specified. If you ever find yourself in this situation, you can use the ROUND function to round, rather than truncate, the non-integer value.

There are other operations in SAS (for example, subscripting an array) that similarly truncate non-integer values. However, there is no need to fear an "unexpected truncation" because most programmers do not sum up fractional values when they specify indices and dimensions. Of course, when you use integer-valued functions such as NROW, NCOL, and LENGTH to specify a dimension, then you get what you expect.

In closing, analyze this DATA step that intentionally uses fractional values to index into an array. Do you think it runs? What values do you think are assigned to the array? Is this example interesting or an abomination?

data A(drop=h);
array x[5];
/* index variable takes on fractional values */
do h = 1 to 5 by 0.5;  
  x[h] = h;
end;
run;
proc print; run;
Post a Comment