The distribution of Pythagorean triples by angle

Last week I was chatting with some mathematicians and I mentioned the blog post that I wrote last year on the distribution of Pythagorean triples.

In my previous article, I showed that there is an algorithm that uses matrix multiplication to generate every primitive Pythagorean triple by starting with the simple (3,4,5) right triangle. The algorithm does not systematically generate triangles with large legs from triangles with smaller legs. Instead, it might generate several triangles with large legs, followed by one or more triangles with small legs. Given a specific radius r, it would be interesting to prove how many iterations of the algorithm are required to guarantee that the algorithm has generated all primitive Pythagorean triples whose hypothesis is less than r.

From a statistical point of view, it is interesting to estimate the distribution of the smallest angle in primitive Pythagorean right triangles. Suppose that you consider only the primitive triangles whose hypotenuse is less than 106. The following histogram shows the distribution of the smallest angle (arctan(b/a) if a is the longer leg) for angles ranging from 0 to 45 degrees in increments of 0.5 degrees. The histogram is formed from 98,151 triangles, which is considerably less than the 159,139 primitive right triangles whose hypotenuses are less than one million.

pythagangle

The histogram has several interesting features:

  • Apparently there are relatively few long-and-skinny triangles. In this sample, few have angles less than 5 degrees. The smallest angle is 3.47 degrees, which corresponds to the (33, 544, 545) right triangle.
  • The distribution is not uniform. In fact, there are noticeable gaps in the density distribution. The biggest gaps are near 37, 23, and 28 degrees.
  • Some Pythagorean triangles that are nearly isosceles. In this sample, the (137903, 137904, 195025) triangle has an angle equal to 44.99979 degrees. This triangle is an example of a twin-leg Pythagorean triple in which the legs lengths differ by 1.

The gaps are very interesting. I conjecture that the gaps in the distribution are not random, but that they are related to low-order triangles. The following statements compute the smallest angle of some right triangles whose legs are small integers:

proc iml;
v = {  4  3  5, 
      12  5 13,
      15  8 17,
      24  7 25,
      21 20 29,
      35 12 37 };
a = atan2(v[,2], v[,1]) * 180/constant('pi');
print v[L="" c={"a" "b" "c"}] a[L="Angle"];
t_pythagangle

If you overlay lines at each of the angles above, they fall right into the gaps of the histogram. The (3,4,5) triangle is centered at the biggest gap (37 degrees). The (5, 12, 13) triangle is responsible for the next largest gap (23 degrees), and so forth. You can overlay these lines on the histogram, or overlay the lines on a fringe plot, as shown in the following image in which the red lines represent the angles of six low-order triangles:

pythagangle3

If this histogram is representative of the true density distribution of the smallest angle in a Pythagorean triangle, the low-order triangles seem to be surrounded by a region of reduced density. If I were to study this problem further, I would conjecture that the gap induced by a low-order triangle is related to Farey sequences. Farey sequences arise in the darndest places, ranging from the dynamics of planetary objects to the study of Diophantine equations. In fact, the fringe plot reminds me of the gaps and divisions in Saturn's rings, which is another place where Farey sequences have an application. Of course, these gaps could also be an artifact of the algorithm that I am using to generate the primitive Pythagorean triangles, but that can be checked by using a different algorithm to generate the Pythagorean triples. Personally, I think the gaps are produced by a real number-theoretic phenomenon.

The density distribution of the smallest angle is interesting, don't you think? Mathematicians who work in number theory can probably explain why the gaps exist, but they were a surprise to me. Who would have guessed that Pythagorean triangles would lead to such an interesting distribution? It's a beautiful connection between probability, statistics, and high-school geometry.

Addendum 16APR2015: Well, my conjecture was wrong. Ian Wakeling implemented a different algorithm for generating all primitive Pythagorean triples that have a hypotenuse length that is less than 1,000,000. I have linked to his SAS/IML program. His program indicates that the angles of the Pythagorean triangles are uniformly distributed in the interval (0, 45) degrees, which is itself an interesting result! The artifacts that I saw are a result of the algorithm that I was using to construct the triples, not a property of the triples themselves. Thanks, Ian!

The algorithm that Ian used enables you to investigate "twin-leg triangles" in which the legs differ by 1. For long legs, these triangles are nearly isosceles. Or investigate "twin leg-hypotenuse triangles" in which the hypotenuse and the longer leg differ by 1. For long legs, these triangles are long and skinny. It is also interesting to ask questions such as "which number is associated with a lot of Pythagorean triangles" (60,060 is one).

Post a Comment

DO loop = 1 TO 600;

Today is my 600th blog post for The DO Loop. I have written about many topics that are related to statistical programming, math, statistics, simulation, numerical analysis, matrix computations, and more. The right sidebar of my blog contains a tag cloud that links to many topics.

What topics do you, my reader, enjoy the most? The least? What topics would you like to see more of in the future? Here are some topics that I have written about frequently.

More than 200 of my posts have been "Getting Started" articles in which I describe some feature of SAS software or matrix programming that is useful for the novice programmer. I have to admit that it is getting harder to think of new introductory topics after blogging for so many years, so please send me any statistical programming topics that perplex you or your colleagues. (I don't blog about DATA steps techniques, PROC SQL, or macro programming, because those topics are routinely covered in SAS Global Forum papers and on the SAS Support Communities.)

In a SAS program, a DO loop is always terminated with an END statement. An END for this DO Loop will occur inevitably, but for now I will keep iterating.

Feedback is always welcome, so leave a comment. Which topics do you enjoy the most and which topics make your eyes glaze over with disinterest?

Post a Comment

Compute the rank of a matrix in SAS

A common question from statistical programmers is how to compute the rank of a matrix in SAS. Recall that the rank of a matrix is defined as the number of linearly independent columns in the matrix. (Equivalently, the number of linearly independent rows.) This article describes how to compute the rank of a matrix in SAS by using functions in SAS/IML software.

An important application of the rank is to determine whether a square matrix is nonsingular. The techniques in this article can also help you decide whether a square matrix is singular.

The rank of a matrix is one of those vexing problems in numerical linear algebra. Mathematically, the question is not difficult. However, in finite-precision arithmetic, computing the rank of a matrix (and whether it is nonsingular) can be challenging. This article suggests some best practices. The details have filled many books, journal articles, and PhD dissertations.

The mathematical rank of a matrix

In theory, you can use Gaussian elimination to compute the rank of a matrix. You reduce the matrix to row echelon form; the rank is the number of rows that contain a nonzero element.

For square matrices, the same mathematical process determines whether a matrix is nonsingular. Other mathematical facts are also true: a singular matrix has a determinant equal to 0, at least one eigenvalue that is 0, and at least one singular value that is 0.

However, these mathematical facts are not good ways to numerically determine whether a matrix is rank-deficient or singular. A naive implementation of Gaussian elimination is numerically unstable. A numerical determinant that should be mathematically zero might be computed as a very tiny nonzero number in finite-precision arithmetic. Or an ill-conditioned but non-singular matrix might have a determinant so small that it is numerically indistinguishable from a singular matrix. In any case, you should never compare a numerical value to 0.

The numerical rank of a matrix

So what's a statistical programmer to do? For the rank problem, there are several numerically robust techniques, but I will describe two that are based on the singular value decomposition (SVD). Some researchers use methods based on the QR decomposition, which is faster to compute but less reliable.

Mathematically, the rank of a matrix equals the number of nonzero singular values. Therefore a simple algorithm to estimate the rank is to use the SVD routine in SAS/IML to compute the singular values and then declare that any singular value smaller than some cutoff value is numerically the same as zero. The following example computes and displays the singular values for a 6x5 matrix by using the SVD:

proc iml;
/* rank(A)=4 because only four linearly independent columns */
A = {1 0 1 0 0,
     1 0 0 1 0,
     1 0 0 0 1,
     0 1 1 0 0,
     0 1 0 1 0,
     0 1 0 0 1 };
 
call svd(U, Q, V, A);
print Q[L="SingVals"];
t_rankmatrix1

You can see that the last singular value is tiny, and in fact the matrix is rank 4. Although you might be tempted to use a fixed cutoff value such as 1e-8 to determine which singular values are "small," it is usually better to incorporate factors that reflect the scale of the problem. A common cutoff value (used by MATLAB) is δ = d σmax ε where d is the maximum dimension of A, σmax is the maximum singular value of A, and ε is "machine epsilon." The machine epsilon can be computed by using the CONSTANT function in SAS. The following example computes the cutoff value and estimates the rank of the matrix:

tol = max(dimension(A)) * constant("maceps") * (max(Q));
rankSVD = sum(Q > tol);
print tol rankSVD;
t_rankmatrix2

There are other techniques that you can use to estimate the rank of a matrix. The SAS/IML documentation recommends using the generalized-inverse technique. The mathematical properties of the generalized inverse (also known as the Moore-Penrose inverse) are explained in a short 1978 article by M. James in The Mathematical Gazette.

The generalized inverse always exists, even for nonsquare matrices and for singular matrices. (If A is nonsingular, then the generalized inverse is the same as the inverse matrix A-1.) To reveal the rank of A, compute the trace of the product of the generalized inverse with A. For a nonsingular matrix, the trace is obviously the dimension of A. For nonsquare or singular matrices, the (rounded) trace is an estimate for the rank of the matrix (Penrose, 1954, p. 408), as follows:

rankGInv = round(trace(ginv(A)*A));
print rankGInv;
t_rankmatrix3

The generalized-inverse technique does not enable you to specify a tolerance/scaling parameter, but a parameter is used internally by the GINV function to compute the generalized inverse.

It is useful to package up these techniques into a module, as follows:

start RankMatrix(A, method="SVD", tol=);
   if upcase(method)="SVD" then do;
      call svd(U, Q, V, A);
      if IsEmpty(tol) then 
         tol = max(dimension(A))*constant("maceps")*(max(Q));
      return( sum(Q > tol) );
   end;
   else do;
      return( round(trace(ginv(a)*a)) ); 
   end;
finish;
 
rankSVD = RankMatrix(A);
rankGInv = RankMatrix(A, "GINV");

A numerical test for singularity

Let's test the rank algorithms on a notorious ill-conditioned matrix, the Hilbert matrix. For an n x n Hilbert matrix, the determinant approaches zero quickly, but is always positive, which means that the Hilbert matrix is nonsingular for all values of n.

The following table shows the result of computing the rank for five Hilbert matrices.

t_rankmatrix4

The first column shows the size of the problem, which is also the true rank of the Hilbert matrix. The second column shows the determinant of the matrix, which shows that the matrices are very close to being singular. The third column shows the estimated rank by using the SVD algorithm. The SVD algorithm performs very well. It correctly computes the rank of the Hilbert matrix for n ≤ 10, and correctly finds the rank of a matrix with determinant 2E-53. The fourth column shows the estimated rank by using the generalized inverse algorithm. It also performs well and correctly computes the rank of the Hilbert matrix for n ≤ 9 and for a determinant as tiny as 1E-42.

In summary, both methods perform well. The SVD method is slightly faster and performed better on this test, so I recommend that you use the SVD method to estimate the rank of rectangular matrices and to determine whether a square matrix is nonsingular. The SVD method is the default method for the RankMatrix function.

Post a Comment

Let's talk at SAS Global Forum 2015

The 2015 SAS Global Forum is in Dallas, TX, and I'll be there. There are many talks to see and people to meet, so thank goodness for the agenda builder, which enables you to create a schedule in advance.

I always enjoy talking with SAS customers about statistics, simulations, matrix computations, and the SAS/IML product. In fact, my main reason to attend this conference is to talk to customers, so please find me and say hello! Here's my schedule:

Sunday, April 26, 2015

  • At 5:00 p.m. I will give a Super Demo in The Quad (formerly known as the SAS Support and Demo Area). The topic is "Visualizing SAS/IML Programs" and I will show how you can use graphics to find errors, detect inefficiencies, and create high-quality statistical programs in the SAS/IML language.
  • I will be at the SAS/IML booth in The Quad during 4:30 – 6:30 p.m. I am sharing the booth with the SAS Studio folks, so come and see the new browser-based interface to running SAS programs.
  • Find me at the Get-Acquainted Dessert Reception from 8:30 – 9:00 p.m.

Monday, April 27

  • At 10:30 a.m. I give a presentation "Ten Tips for Simulating Data with SAS" in Room C148. This talk features techniques that enable you to write efficient simulations in SAS. The tips are from my book Simulating Data with SAS. Whether you are a beginner or an experienced SAS programmer, I promise that you'll learn something interesting about simulating data with SAS.
  • You will find me at the SAS/IML booth in The Quad during the Customer Appreciation Mixer from 6:30 – 8:00 p.m. Grab a drink and some hors d'oeuvres and talk to me about the statistical and data analysis problems you face at your company.

Tuesday, April 28

  • I'll be at the SAS/IML booth in The Quad from 10:00 a.m. – 12:00 p.m.,
  • At 11:00 a.m. I present a Super Demo titled "Ten Tips for Simulating Data with SAS." If you missed my talk on Monday, come to The Quad to hear an informal presentation about the same topic.
  • Have you ever considered writing a book for SAS Press? Join me and other SAS Press authors for lunch in Ballroom D2 and find out what it takes to write a book. Those who attend will have a chance to win a free book (your choice) from the SAS Bookstore.
  • From 6:30 p.m. – 8:00 p.m., please join me and other SAS professionals for a linkup of online communities, which will include leading members of the SAS online communities, including representatives from SAS-L, sasCommunity.org, the SAS Support Communities, and other online networking sites for SAS professionals.
  • .

During the conference I will occasionally tweet from my Twitter handle @RickWicklin to the hashtag #SASGF15.

So whether you want to discuss statistical programming, comment on a blog topic, or tell me about a cool project that you are working on, please introduce yourself at the conference. I look forward to meeting old friends... and making some new ones.

If you are going to SAS Global Forum 2015, what are you looking forward to? The talks? The social events? Leave a comment about what you hope to gain by attending this conference.

Post a Comment

Simulate the Monty Hall Problem in SAS

The Monty Hall Problem is one of the most famous problems in elementary probability. It is famous because the correct solution is counter-intuitive and because it caused an uproar when it appeared in the "Ask Marilyn" column in Parade magazine in 1990. Discussing the problem has been known to create a Jekyll-and-Hyde effect among mathematicians that transforms ordinarily calm logicians into frenzied, red-faced, name-calling beasts.

The following statement from Wikipedia is the usual statement of the Monty Hall Problem:

Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host [Monty Hall], who knows what's behind the doors, opens another door, say No. 3, which has a goat. He then says to you, "Do you want to pick door No. 2?" Is it to your advantage to switch your choice?

The implicit assumption here is that the host does not act randomly: he always chooses to reveal one of the two goats. (The other implicit assumption is that "winning" means choosing the car, which might shock rural goat farmers!) The surprising mathematical result is that, on average, if you (the player) switch from your original choice to the other unrevealed door, you will win the car two-thirds of the time. Thus your best strategy is to switch. If you like videos, watch this scene from the movie 21 in which Kevin Spacey uses the Monty Hall Problem to identify a brilliant math student.

A simple analysis of why this "always switch" strategy is advantageous goes like this:

  • Case 1: The door you chose hides a goat. The host shows the other goat. If you switch doors, you always win the car. This case occurs with probability 2/3.
  • Case 2: The door you chose hides the car. If you switch doors, you lose. This case occurs with probability 1/3.

A section of the Wikipedia article criticizes this simple solution and includes a long discussion of various real and fallacious arguments.

The Monty Hall Problem became (in)famous when thousands of eminent mathematicians and physicists wrote Parade magazine "explaining" why switching could not possibly affect the probability of winning the car. Apparently the great mathematician Paul Erdös refused to believe that switching was a better strategy even after being presented with several mathematical proofs. He was not convinced until he was shown the results of a computer simulation.

Perhaps you, like Erdös, need to see a simulation before you are convinced? Many SAS programmers have written a simulation of the Monty Hall Problem by using the DATA step. The following SAS-oriented discussions are worth reading:

What I haven't seen is a simulation in the SAS/IML language. Because of its vectorized nature, you can often write a simulation in SAS/IML without writing any loops. That is the case for the Monty Hall Problem.

Without loss of generality (WLOG), assume that the contestant always picks Door Number 1. The host then shows one of the goats, which is behind either Door Number 2 or Door Number 3. In the following simulation, the probability of winning is computed if you stay with your original guess (which is obviously 1/3) versus if you switch your guess:

proc iml;
call randseed(54321);
NumSim = 1e5;                               /* number of simulated games */
car = randfun(NumSim, "Table", {1 1 1}/3);  /* unknown door hides car */
guess = j(NumSim, 1, 1);                    /* WLOG, guess door=1 */
show = choose(car=2, 3, 2);                 /* host shows a goat */
switch = choose(show=2, 3, 2);              /* the door you could switch to */
 
winIfStay = mean(guess=car);                /* (P(win | do not switch) */
winIfSwitch = mean(switch=car);             /* (P(win | switch) */
print winIfStay winIfSwitch;
t_montyhall

The simulation (which requires less than 10 lines in SAS/IML) convincingly shows that the probability of winning if you do not switch is 1/3, whereas the probability of winning the car if you switch is 2/3.

The simulation uses three SAS/IML functions that are worth explaining:

I'd like to thank my colleague Stephen Mistler for recently reminding me about the Monty Hall Problem and for advocating that simulations provide clarity to probability problems.

Don't like my approach? Think you can do better? Still don't believe the result? Leave a comment.

Post a Comment

Visualizing the causes of airline crashes

aircrashseries

There has been a spate of recent high-profile airline crashes (Malaysia Airlines, TransAsia Airways, Germanwings,...) so I was surprised when I saw a time series plot of the number of airline crashes by year, which indicates that the annual number of airline crashes has been decreasing since 1993. The data show that the annual number of crashes in recent years is about half of the number from 20 years ago.

A flow diagram that visualizes airline crash data

The series plot appeared in Significance magazine as part of an interview with designer David McCandless, who produces infographics about data. McCandless has recently published a new book titled Knowledge is Beautiful, which includes the following infographic (click to enlarge). The time series plot appears in the upper left. The main portion of the infographic presents information about the causes of commercial air crashes from 1993–2013. In addition to the cause of the crash, McCandless shows the "phase of flight" (take-off, en route, landing) during which the disaster occurred.

aircrashOrig

In general, infographics are designed to appeal as well as to inform. In the article, McCandless says that he is trying to show "the links, connections and relations between things." His goal is to "mediate between the data and the audience" so that people "pay attention" to the data. Many statisticians share that goal, with a possible difference being that statistical graphics attempt to objectively represent the data so that the data speak for themselves.

McCandless's goal is to create beautiful infographics (and he has succeeded!), but I would like to propose an alternative visualization of these data that trades beauty for compactness.

I have two main issues with McCandless's "flow diagram," which is reminiscent of a Sankey diagram. The first is that it uses a lot ink to display the relationship between "Cause" and "Phase," which are each five-level categorical variables. A standard frequency analysis would display this data in a 5x5 table or visualize it with a tile-based plot with 25 tiles. I don't think that Edward Tufte would approve of McCandless's plot. Unlike Tufte's favorite plot—Minard's "Napoleon’s March to Moscow", which uses a similar design—there is no temporal or geographic change for these data. The "flow" through the various "pipes" merely represents conditional frequencies.

My second objection is simply that some of the blocks in the diagram are the wrong sizes. The area of the blocks is supposed to be proportional to the number of crashes in each category. Notice that the bottom row is properly scaled: If you stack the "take-off" and "en route" boxes, which account for 50% of the cases, they approximately equal the area of the "landing" box. However, some boxes are not proportional to the frequency that they represent:

  • The top block represents 100% of the crashes, but its area is too big compared to the cumulative areas of the blocks in the "Causes" or "Phase" row.
  • The "Human Error" block (48%) is too big within its row. The other blocks on the "Causes" row account for 51% of the cases, so the sum of their areas should be greater than the area of the "Human Error" block.
  • The "Human Error" block (48%) is too big across rows. It is much larger than the "Landing" block in the bottom row, which represents 49% of the cases.

Visualizing airline crashes with a mosaic plot

At the risk of building a glass house after hurling those stones, I suggest that a mosaic plot is a standard statistical graphic that can display the same air crash data. You can download the data from McCandless's web site. The data needs a little data cleansing before you can analyze it, and I have provided my SAS program that creates the mosaic plot and other related plots. McCandless's diagram uses data through mid-2013, whereas the mosaic plot uses data through March 2015.

Many SAS procedures use ODS statistical graphics to display graphs as readily as they display tables. For example, the following call to PROC FREQ creates a 5x5 frequency table (click to see the table) and a mosaic plot:

proc freq data=Crash order=data;
tables Cause*Phase / plots=(mosaic(square));
run;
aircrashMosaic

In a mosaic plot, the area of each tile is proportional to the number of cases. I have ordered both variables according to frequency, whereas McCandless ordered the "Phase" variable temporally (standing, take-off, en route, landing). I did this so that the most important causes and phases appear in the first row (bottom) and first column (left). The relative widths of the columns indicate the relative frequency of the one-way frequencies for the "Phase" variable. Colors connect the categories for the "Cause" variable.

The mosaic plot is far from perfect. It is hard to visually trace the categories for the "Cause" variable. For example, finding the "mechanical" category for each flight phase requires some eyeball gymnastics. Also, although the mosaic plot makes it possible to visualize the proportions of crashes for each cause and phase, the absolute frequencies are not available on the graph by default. The frequency counts are available in the 5x5 table that PROC FREQ produces. Alternatively, with additional effort you can create a mosaic plot that includes frequency counts for the largest mosaic tiles, which I think is far superior to the default display.

What do you think? Do you prefer the mosaic plot that is produced automatically by PROC FREQ, the one with frequency counts, or do you prefer McCandless's flow diagram? Which helps you to understand the causes of airline crashes? Feel free to adapt my SAS program and create your own alternative visualization!

Addendum

(April 13, 2015) Several visualization experts wrote in with alternate suggestions and comments.

Antony Unwin says that he prefers to visualize these data by using multiple bar charts. I also like multiple bar charts, which show the conditional distribution of cause given the flight phase. In SAS, you can use the SGPANEL procedure to produce a panel of bar charts (click to enlarge), as follows, although I might choose to combine the "unknown" and "standing" phases to save room:

ods graphics / reset width=600px;
proc sort data=Crash;
by descending CauseN PhaseN ;
run;
proc sgpanel data=Crash;
panelby Phase /layout=rowlattice columns=1 novarname 
               sort=data rowheaderpos=right onepanel;
vbar Cause / stat=percent;
rowaxis discreteorder=data grid;
run;
aircrashseries3

Michael Friendly noted that although mosaic plots are useful for visualizing cross-tabulations, they also are useful for evaluating various statistical models. For example, a statistician might want to assess whether there is an association between flight phase and the cause of an airline crash. As I have written before, you can color the cells in a mosaic plot according to the residuals in a model, thereby visualizing how the observed data deviate from the model. For more about using mosaic plots in categorical models, see Friendly (1999).

Xan Gregg pointed me to an article by Stephen Few in which Few argues that an innovative design is not necessarily a better design. (Be sure to read the comments!) Few is more critical of McCandless than I am, but I agree that a different design is not necessarily a better design. If a standard statistical graphic can display the data well, then I prefer to use it rather than to design something new and more complex. My main goal is to communicate as simply as possible. However, I have no problem with artists who choose to create data-inspired art, especially if they convince the public that data are beautiful.

Post a Comment

On the number of permutations supported in SAS software

There's "big," and then there is "factorial big."

If you have k items, the number of permutations is "k factorial," which is written as k!. The factorial function gets big fast. For example, the value of k! for several values of k is shown in the following table. You can use a custom algorithm to compute accurate, extended-precision values of the factorial function for k>18.

t_maxperm

Base SAS has several functions for working with permutations, including the ALLPERM function (which generates all permutations of k items) and the RANPERM function (which generates a random permutation of k items). The documentation for the ALLPERM function mentions that you can specify no more than 18 items. That is, it only enables you to compute all permutations when k≤ 18.

A SAS programmer recently asked why k=18 is the limit in these factorial computations. At first glance, 18 does seem to be an unusual choice. Why not k=16, since computer scientists love powers of 2? Or why not k=20, since most of us count in base 10?

The explanation is that k=18 is the largest value for which k! can be represented by an 8-byte integer. The first argument to the ALLPERM function is an index that enumerates the permutation, and this argument does not fit into an 8-byte integer when k>18. You can use the CONSTANT function in SAS to find the value of the largest 8-byte integer:

data _null_;
Fact18 = fact(18);
ExactInt=constant("exactint");
put Fact18=;
put ExactInt=;
run;
t_maxperm2

From a programming perspective, having a limit also prevents us programmers from accidentally creating a loop that iterates a billion billion times (which is 20!). When dealing with permutations and combinations, implementing a smart algorithm is almost always preferable to a brute-force computation that iterates over every possible permutation.

Do you have any tales of woe from when you inadvertently tried to compute with a large combinatorial number? Leave a comment.

Post a Comment

Vectors that have a fractional number of elements

The title of this article makes no sense. How can the number of elements (in fact, the number of anything!) not be a whole number? In fact, it can't. However, the title refers to the fact that you might compute a quantity that ought to be an integer, but is not an integer when computed by using floating-point arithmetic.

Most programmers are aware of examples of finite-precision numerical arithmetic that seem to violate mathematical truths. One of my favorite examples is that (0.1 + 0.1 + 0.1) ≠ 0.3 in finite precision arithmetic:

data _null_;
x = 0.1 + 0.1 + 0.1;
y = 0.3;
if x=y then put "x = y";
else put "x is not equal to y";
run;
t_fractvec

This is a classic example from computer science that shows why you should avoid testing floating-point values for equality. Furthermore, if you add 0.1 a multiple of 10 times, you will not obtain an integer. This example is not unique, so you should use a truncation function such as ROUND, INT, FLOOR, or CEIL if you need to ensure that a floating-point computation results in an integer.

With that example in mind, let's return to the title of this blog post. A SAS programmer asked on the SAS Support Communities why a SAS/IML vector had a smaller number of elements than he had asked for. After looking at his computation, I realized that he had computed the number of elements by summing a series of floating-point values. Mathematically, the sum should have been 7, but in floating-point arithmetic the value was 6.99999.

So what happens if you ask PROC IML for a vector with 6.99999 elements? The software truncates the number, and you get a vector with 6 elements, as shown by the following program:

proc iml;
n = 6.99999;
v = j(n, 1);    /* allocate a vector with n elements */
print "The vector has " (nrow(v)) " elements";
t_fractvec

A more accurate title for this article might be "What do you get if you ask for a vector that has a fractional number of elements?" The answer is that you get a vector where the number of elements is the truncation of the number that you specified. If you ever find yourself in this situation, you can use the ROUND function to round, rather than truncate, the non-integer value.

There are other operations in SAS (for example, subscripting an array) that similarly truncate non-integer values. However, there is no need to fear an "unexpected truncation" because most programmers do not sum up fractional values when they specify indices and dimensions. Of course, when you use integer-valued functions such as NROW, NCOL, and LENGTH to specify a dimension, then you get what you expect.

In closing, analyze this DATA step that intentionally uses fractional values to index into an array. Do you think it runs? What values do you think are assigned to the array? Is this example interesting or an abomination?

data A(drop=h);
array x[5];
/* index variable takes on fractional values */
do h = 1 to 5 by 0.5;  
  x[h] = h;
end;
run;
proc print; run;
Post a Comment

Finding observations that match a target value

Imagine that you have one million rows of numerical data and you want to determine if a particular "target" value occurs. How might you find where the value occurs?

For univariate data, this is an easy problem. In the SAS DATA step you can use a WHERE clause or a subsetting IF statement to find the rows that contain the target value. In the SAS/IML language you can use the LOC function to find the rows.

For multivariate data, you can use a similar approach, except the code become messier. For example, suppose that you have a data set that contains five numeric variables and you want to test whether the 5-tuple (1, 2, 3, 4, 5) appears in the data. In the DATA step you might write the following statement:

if x1=1 & x2=2 & x3=3 & x4=4 & x5=5 then ...;

A general mathematical principle is that a "target problem" is equivalent to a root-finding problem. The standard mathematical trick is to subtract the target value and then find zeros of the resulting set of numbers. This trick provides the basis for writing a general SAS/IML programs that is vectorized and that can handle an arbitrary number of variables. The following statements generate a matrix that has one million rows and five columns. Each element is an integer 0 through 9.

proc iml;
call randseed(123);
x = floor( 10*randfun({1e6 5}, "Uniform") ); /* matrix of values 0-9 */

Suppose that you want to find whether there is a row that matches the target value {1 2 3 4 5}. The following statements subtract the target value from the data and then find all rows for which the difference is the zero vector:

target = {1 2 3 4 5};
y = x - target;             /* match target ==> y={0 0...0}   */
count = (y=0)[,+];          /* count number of 0s in each row */
idx = loc(count=ncol(target));      /* rows that match target */
if ncol(idx)>0 then 
   print "Target pattern found in " (ncol(idx)) " rows";
else print "Target pattern not found";
t_matchtarget

The variable idx contains the row numbers for which the pattern occurs. For this random integer matrix, the target pattern appeared four times. Other random matrices might have six, 12, or 15 rows that match the target. It is easy to show that in a random matrix with one million rows, the target value is expected to appear 10 times.

The example in this article shows how to search a numerical matrix for rows that have a particular value. For character matrices, you can use the ROWCATC function in SAS/IML to concatenate the column values into a single vector of strings. You can then use the LOC function to find the rows that match the pattern.

Post a Comment

How to pass parameters to a SAS program

This article show how to run a SAS program in batch mode and send parameters into the program by specifying the parameters when you run SAS from a command line interface. This technique has many uses, one of which is to split a long-running SAS computation into a series of less time-intensive computations. There is a large class of problems that naturally divide into smaller pieces: programs in which the same computation is repeated many times for different values of parameters. A computation such as this is an example of an "embarrassingly parallel" computation.

To give a concrete example, suppose that you are running a simulation that involves generating many samples from the Beta(a,b) distribution over a wide range of (a,b) values, such as 0 < a ≤ 4 and 0 < b ≤ 4. You could write a single program that loops over a regular grid of (a,b) values, but here are some reasons that you might want to divide the computation into smaller pieces:

  • Some parameter values might require more iterations than others in order to obtain accurate estimates.
  • Certain parameter values might require a different estimation technique or an asymptotic approximation.
  • An error in one portion of your program (for example, when a parameter is zero) will not result in the loss of earlier computations.
  • You can begin analyzing the first set of results even while you are computing more results.
  • If you have access to multiple computers, you can submit different parameter values to each computer, thus achieving parallel processing with little effort and no cost.

With apologies to Neil Sedaka, breaking up (a program) is easy to do.

An example SAS program

A simple DATA step is sufficient to illustrate the process. It lacks the motivation (there is no actual need to break up the program), but it illustrates the main ideas in a simple way. Start the break-up process by using macro variables to replace the upper and lower limits of the parameters, as follows:

%let aMin = 0.1;         /* lower limit of parameter 'a' */
%let aMax = 4;           /* upper limit of parameter 'a' */
%let bMin = 0.1;         /* lower limit of parameter 'b' */
%let bMax = 4;           /* upper limit of parameter 'b' */
%let DSName = Out1;      /* name of data set that contain results for params */
 
libname dest ".";        /* put results in current working directory */
data dest.&DSName(keep = a b kurtosis);
do a = &aMin to &aMax by 0.1;                 /* loop over a */
   do b = &bMin to &bMax by 0.1;              /* loop over b */
      /* compute kurtosis of Beta(a,b) distribution */
      numer = 6*((a-b)**2 * (a+b+1)-a*b*(a+b+2));
      denom = a*b*(a+b+2)*(a+b+3);
      kurtosis = numer / denom;
      output;
   end;
end;
run;

The program computes the kurtosis of the Beta distribution for each (a,b) value on a regular grid 0.1 ≤ a ≤ 4 and 0.1 ≤ b ≤ 4. Notice that the DEST libref ensures that the results are stored in the current directory.

Breaking up the computation into smaller pieces

The %LET statements define the range of the a and b parameters and the name of the output data set. (You can also use this technique to write a simulation that accepts a seed value as a parameter to the RAND function.) I can use the values of the macro variable to divide the computation into a series of smaller computations. For example, I could divide the computation into the following four smaller computations:

  1. Compute the results on 0.1 ≤ a ≤ 2 and 0.1 ≤ b ≤ 2. Store these results in the data set Out1.
  2. Compute the results on 0.1 ≤ a ≤ 2 and 2.1 ≤ b ≤ 4. Store these results in Out2.
  3. Compute the results on 2.1 ≤ a ≤ 4 and 0.1 ≤ b ≤ 2. Store these results in Out3.
  4. Compute the results on 2.1 ≤ a ≤ 4 and 2.1 ≤ b ≤ 4. Store these results in Out4.

You could change the parameter values manually, but SAS provides a feature that makes it simple to specify parameter values on the command line. At the top of the program, you can replace the simple %LET statements with calls to the %SYSGET function, as follows:

/* get values of environment variables from SAS command line */
%let aMin = %sysget(aMin);
%let aMax = %sysget(aMax);
%let bMin = %sysget(bMin);
%let bMax = %sysget(bMax);
%let DSName = %sysget(DSName);

With this change, you can run the DATA step in batch mode and use the -SET option on the SAS command line to change the parameter values for each invocation. For example, if the program is stored in the file PassParams.sas then the first invocation from a Windows command prompt could be

> "C:\Program Files\SASHome\SASFoundation\9.4\sas.exe" PassParams.sas 
    -set DSName "Out1" 
    -set aMin 0.1 -set aMax 2 -set bMin 0.1 -set bMax 2

When the SAS program runs, the %SYSGET function gets the values that you specified by using the -SET command line option. Notice that the options are specified as keyword/value pairs such a -set aMin 0.1. SAS runs the program in batch mode and creates a SAS data set Out1.sas7bdat in the current directory. In a similar way you can run the program three more times to create the data sets Out2, Out3, and Out4.

After all the data sets are created, you can concatenate them together to form the complete data set, which contains results for the complete range of parameter values:

data All;
set dest.Out:;           /* use colon to specify Out1, Out2, Out3, and Out4 */
run;
 
/* analyze the data and/or make plots */
proc sgrender data=All template=ContourPlotParm;
dynamic _TITLE="Kurtosis of Beta(a,b) Distribution"
        _X="a" _Y="b" _Z="Result";
run;
passparams

The call to the SGRENDER procedure uses a GTL template from a previous article about how to create a contour plot in SAS. The graph shows that the kurtosis of the beta distribution is small when the values of a and b are approximately equal, but the kurtosis can be arbitrarily large when a or b are close to zero.

Although I demonstrated this technique for a simple DATA step, you can use the approach to control the parameters that are passed to any SAS procedures. For example, the following PROC IML statements might begin a long-running simulation in the SAS/IML language:

proc iml;
aGrid = do(&aMin, &aMax, 0.1); /* evenly spaced [aMin, aMax] */
bGrid = do(&bMin, &bMax, 0.1); /* evenly spaced [bMin, bMax] */
So next time you have a long-running SAS program that you want to run during lunch (or overnight), think about using the -SET command line option to specify parameter values that are passed to your program. Inside your program, use the %SYSGET function to receive the values. This technique enables you to run SAS programs in batch mode and pass in program parameters. It also enables you to break up a long-running program into smaller, less time-intensive programs.
Post a Comment