The distribution of Pythagorean triples

pythagdistrib

When I studied high school geometry, I noticed that many homework problems involved right triangles whose side lengths were integers. The canonical example is the 3-4-5 right triangle, which has legs of length 3 and 4 and a hypotenuse of length 5. The triple (3, 4, 5) is called a Pythagorean triple because it satisfies the Pythagorean theorem: 32 + 42 = 52. Similarly, (5, 12, 13) and (7, 24, 25) are Pythagorean triples that sometimes appear in geometry textbooks.

In general, a triple of natural numbers (a, b, c) is a Pythagorean triple if a2 + b2 = c2. Obviously if (a, b, c) is a Pythagorean triple, then so is (ka, kb, kc) for any natural number k. Geometrically, this transformation is not very interesting because it simply scales a right triangle into a similar triangle that has the same shape. I did not learn any trick in high school that enabled me to start with a Pythagorean triple and generate a new, geometrically different, right triangle.

Generating Pythagorean triples

Because my high school geometry book reused the same three or four triples over and over again, I assumed that it is difficult to generate Pythagorean triples. In fact, that is not the case: There are several formulas for generating Pythagorean triples, some of which were known to the ancient Greeks.

One interesting algorithm comes from the 20th-century and involves linear transformations via matrices. Here's how it works. Write any Pythagorean triple, v, as a row vector. Then the linear transformation v*L is a new Pythagorean triple, where L is any of the three following linear transformations:

pythagxform

For example, if you represent v = (3, 4, 5) as a row vector, then v*L1 = (5, 12, 13). You can multiply on the right again to get another new triple, such as v*L1*L3 = (45, 28, 53).

In fact, more is true. A primitive Pythagorean triple is a triple such that the three numbers have no nontrivial divisors, that is, they are relatively prime. It turns out that all primitive Pythagorean triples can be obtained by iteratively applying one of three linear transformations to the triple (3, 4, 5). In other words, the triple (3, 4, 5) is the "parent" of all primitive Pythagorean triples!

That is a fabulous result. It means that you can write a program that uses matrix multiplication to produce arbitrarily many primitive Pythagorean triples, as follows:

  • Start with v = (3, 4, 5).
  • Apply L1, L2, and L3 to create the first generation of children.
  • Apply the three transformations to each of the children to create a generation of nine grandchildren.
  • Continue this process. The ith generation will contain 3i triples.

You can implement this algorithm in the SAS/IML matrix language as follows:

proc iml;
start GenerateTriples(depth=3);
   L1 = { 1  2  2,
         -2 -1 -2, 
          2  2  3};
   L2 = { 1  2  2,
          2  1  2,
          2  2  3};
   L3 = {-1 -2 -2,
          2  1  2,
          2  2  3};
   NumTriples = sum(3##(0:Depth));  /* total number of triples */
   triples = j(NumTriples,3,.);     /* allocate array for results */
   triples[1,] = {3 4 5};           /* parent triple */
   k = 1; n = 2;
   do i = 1 to Depth;               /* for each generation */
      do j = 1 to 3##(i-1);         /* generate 3##i children */
         x = triples[k,];
         triples[n,]   = x*L1;
         triples[n+1,] = x*L2; 
         triples[n+2,] = x*L3; 
         k = k + 1;   n = n + 3;
      end;
   end;
   return( triples );
finish;
 
p = GenerateTriples();
print p;
t_pythag

The parent and the first three generations account for 1 + 3 + 9 + 27 = 40 triples. The list contains some old familiar friends, as well as a few strangers that I haven't met before.

The distribution of Pythagorean triples

What do the Pythagorean triples look like if you plot their location in a Cartesian coordinate system? It turns out that they make fascinating patterns!

Because the hypotenuse length is a function of the leg lengths, it suffices to plot the locations of a and b in the coordinate plane. However, it turns out that when the algorithm produces a particular triple such as (5, 12, 13), it does not produce the symmetric triple (12, 5, 13). To make the pattern more symmetric, we can to manually add the triple (b, a, c) whenever the triple (a, b, c) appears.

The following SAS/IML statements construct many primitive Pythagorean triples and create a scatter plot of the ones for which a and b are both less than 4,500:

p = GenerateTriples(15);                  /* generate many triples */
m = p[ loc(p[,1]<=4500 & p[,2]<=4500), ]; /* exclude large ones */
mm = m // m[,{2 1 3}];                    /* generate symmetric pairs */
 
ods graphics / width=600px height=600px;
title "Primitive Pythagorean Triples";
title2 "Depth = 15, Symmetric";
call scatter(mm[,1], mm[,2]) procopt="aspect=1"  
     option="markerattrs=(size=3 symbol=SquareFilled)";

The image is shown at the beginning of this article. You can see certain parabolic curves that have a high density of points. There is a similar plot in Wikipedia that includes nonprimitive triples. The nonprimitive triples "fill in" some of the apparent "gaps" in the pattern.

These interesting patterns hint at deep relationships in number theory between Diophantine equations and the geometry of the solutions. Are you tempted to generate a similar image for the generalized equation a3 + b3 = c3? Don't bother: Fermat's Last Theorem states that this equation has no integer solutions!

Post a Comment

An efficient way to increment a matrix diagonal

I was recently asked about how to use the SAS/IML language to efficiently add a constant to every element of a matrix diagonal. Mathematically, the task is to form the matrix sum A + kI, where A is an n x n matrix, k is a scalar value, and I is the identity matrix. The person who asked the question was dealing with large matrices. He knew that explicitly forming the identity matrix is inefficient and that adding n2n zeros to A is unnecessary, but he wasn't sure how to avoid it.

I have previously blogged about how to assign the diagonal elements of a matrix. In that post I wrote a little helper function that will set the diagonal of a matrix to a given vector or scalar value. The function is repeated below:

proc iml;
start SetDiag(A, v);
   diagIdx = do(1,nrow(A)*ncol(A), ncol(A)+1); /* index diagonal elements */
   A[diagIdx] = v;                             /* set diagonal elements */
finish;

This function sets the diagonal elements of the matrix A to the value v, which can be a vector or a scalar. To increment (rather than replace) the diagonal, use the VECDIAG function to extract the current diagonal values of the matrix. Add a constant and call the function like this:

m = j(5000, 5000, 1);              /* create a 5000 x 5000 matrix */
run SetDiag(m, vecdiag(m) + 100);  /* increment diagonal elements by 100 */

Timing the performance

matrixdiag

Recall that it is easy to use the TIME function in SAS to compare the performance of two competing algorithms. The plot at the left shows the time required to increment a matrix diagonal for a sequence of large matrices, up to 15,000 x 15,000. The blue line represents the naive computation
A + kI. Notice that the time increases quadratically with the size of the matrix because this algorithm adds n2 numbers to the matrix.

In contrast, the red line shows the time that it takes to form a vector that indexes the diagonal elements of a matrix and to replace only those elements. Theoretically, the time should be linear with the size of the matrix, but the computation is so fast that the elapsed time registers as instantaneous for even the largest matrix.

In summary, when working with large matrices, you can save lots of computational effort if you avoid unnecessary computations. Never multiply or add large diagonal matrices unless absolutely necessary. Instead, think about whether it is possible to index into the matrix and change only those numbers that you need to change.

By the way, increasing the diagonal of a matrix is sometimes called "ridging a matrix" and it is used in ridge regression. Ridging a matrix is also sometimes used in numerical computations that involve estimated covariance matrices. Sometimes an estimate is not positive definite, and ridging is one approach to try to obtain an estimate for a covariance matrix that has this important property.

Post a Comment

The distribution of blood types by country

My colleague Robert Allison has a knack for finding fascinating data. Last week he did it again by locating data about how blood types and Rh factors vary among countries.

He produced a series of eight world maps, each showing the prevalence of a blood type (A+, A-, B+, B-, AB+, AB-, O+, and O-) in various countries around the world. As I studied his maps, I noticed that the distribution of blood types in certain ethnic groups (Chinese, Japanese, Indians,...) was different than the distribution in Western Europe and former European colonies.

When dealing with multivariate data, a single visualization is rarely sufficient to answer all the questions that you can ask. Robert's maps answer the question, "What is the spatial distribution of each blood type?" I was curious about a different question: "Within each country, what is the distribution of blood types?" To answer my question, I needed a different visualization of the same multivariate data.

bloodtypes

My attempt is shown to the left. (Click to enlarge.) The graph is a stacked bar chart of the percentage of blood type for 63 countries, sorted by the percentage of types that have positive Rh factors. Blood types with positive Rh factors are plotted on the right; negative Rh factors are plotted on the left. The A+ and A- types are plotted closest to the 0% reference line. The next types are AB, B, and O, in increasing distances from the 0% reference line.

A few ethnic differences are apparent. At the top of the chart are Western European countries and former European colonies such as Brazil, Australia, and New Zealand. A little lower on the list are countries in Eastern Europe and Scandinavia.

After that, the list starts to get geographically jumbled. The United States, Canada, and South Africa were all settled by people of multiple ethnicities. The middle of the list is dominated by countries from the Middle East, Northern Africa, and the Near East.

The next set of countries include South American countries such as Argentina and Bolivia, Caribbean countries, India, and African countries.

Finally, the bottom of the list features Asian populations such as China, Japan, Mongolia, and the Philippines. These populations have almost no negative Rh factors in their blood. The distribution of blood types in those countries are similar to each other, although regional differences appear, such as the relatively small number of A+ blood in Thailand.

This one dimensional ranking of countries by blood types reflects historical connections between peoples as a result of conquest, trade, and colonization.

A few countries seem "out of place" in their list order. Lebanon, Ireland and Iceland, and Peru and Chile, are some of the countries whose distribution of blood types differ from those adjacent to them in the list.

Relationships between countries

Some of the "out of place" countries are probably a result of the fact that it is hard to linearly order the countries when there are eight variables to consider. Principal component analysis (PCA) is a statistical technique that can group observations according to similar characterisics. In SAS software, you can use the PRINCOMP procedure to conduct a principal component analysis.

The analysis reveals that 81% of the variation in the data can be explained by the first two principal components. About 92% can be explained by using three principal components, which means that the eight variables (percentages of each blood type) fit well into these lower-dimensional linear subspaces.

bloodtypes2

The score plot from a two-dimensional PCA analysis is shown to the left. (Click to enlarge.) I added colors to the data to indicate a geographical region for the countries; the regions came from the United Nations list of countries and geographic regions. This plot shows the relationships between countries based on similarities in the distribution of blood types.

The middle of the plot contains African and West Asian nations. (West Asia is the UN name for the region that many people call the Middle East.) The right side of the plot is dominated by European countries and their former colonies. The upper left quadrant contains the Asian countries. The lower left quadrant includes Caribbean, Central American, and South American countries. This presentation once again shows that the distribution of blood types in Peru and Chile are different from other countries, but are similar to each other.

You can download the data and the SAS program that analyzes it and do additional analyses.

What interesting features can you find in these data? Are there other ways to view these data? Leave a comment.

Post a Comment

Binning data by quantiles? Beware of rounded data

In my article about how to create a quantile plot, I chose not to discuss a theoretical issue that occasionally occurs. The issue is that for discrete data (which includes rounded values), it might be impossible to use quantile values to split the data into k groups where each group has approximately the same number of elements. To put it bluntly, the algorithm that I proposed for creating a quantile bin plot can fail for large values of k.

The problem can occur for data that are rounded to the nearest unit, such as age, height, weight, and blood pressure. The algorithm assumes that the data quantiles are (mostly) unique and that they divide the empirical cumulative distribution function (ECDF) of N observations into k groups that each have approximately N/k observations. However, if there are more than N/k repeated values, the repeated value can occupy more than one quantile value. In fact, this will always happen if a particular value is repeated more than 2N/k times.

For example, suppose that you want to divide the diastolic blood pressure data in the Sashelp.Heart data set into 10 groups that each have approximately 10% of the data. The following statements draw the ECDF for the diastolic measurements and mark the deciles:

proc univariate data=Sashelp.Heart;
   var diastolic;
   cdfplot diastolic / vref=(10 to 90 by 10);
   ods select CDFPlot Quantiles;
run;
QuantCDFPlot

The value 80 appears in 711 of the 5209 observations, which is about 14% of the data. Notice that the value 80 intersects two reference lines, one at 30 and the other at 40. This means that 80 is both the 30th percentile and the 40th percentile. Assuming that you do not want to split the value 80 across two bins, the deciles of the data produce at most nine bins.

Another way to see this graphically is to use the RANK procedure to try to group the data into 10 groups, as described in the article "Grouping observations based on quantiles." The following statements create a new variable called Group, which for continuous data would have the values 0–9. However, for this rounded data only nine groups exist:

proc sort data=Sashelp.Heart out=Heart;
   by Diastolic;
run;
proc rank data=Heart out=Heart groups=10 ties=high;
   var Diastolic;        /* variable on which to group */
   ranks Group;          /* name of variable to contain groups 0,1,...,k-1 */
run;
proc sgplot data=Heart;
   scatter x=Group y=Diastolic / group=Group;
   refline 80 / axis=y;
   xaxis values=(0 to 9);
run;
QuantRankPlot

As shown by the legend on the scatter plot, only nine groups were created, instead of the 10 that were asked for. Group 3 was not created.

I've described a problem, but what can you do about it? Not a lot. The problem is in the data. If you are flexible about the number of groups, you can try decreasing the value of k. Because each group contains approximately N/k observations, decreasing k might circumvent the problem. Another possibility is to jitter the data by adding a small amount of random noise. That will eliminate the problem of repeated values.

So I'll leave you with this warning: beware of using quantiles to bin rounded data into groups. Although the technique works great when almost all of the data values are distinct, you can run into problems if you ask for many bins and your data contain many repeated values.

Post a Comment

Recent additions to the SAS/IML file exchange

It has been three months since the introduction of the SAS/IML File Exchange, so I thought I'd give a short update on recent submissions and activity.

  • The web site administrators have created an easy-to-remember shortcut to the File Exchange:
    http://communities.sas.com/sas-iml-file-exchange
  • There have been eight new articles submitted to the File Exchange. The application areas include
    • experimental design
    • discretization of a continuous variable
    • generalized linear models for complex sample data
    • multiple imputations
    • quality improvement

What I love about the SAS/IML File Exchange is that these programs were written by experts in their subject areas. Some programs are based on research that has been presented at conferences or in professional journals. Others are shorter examples and fun programs, such as Ian Wakeling's Sudoku Puzzle Generator! Together, they hint at the wide range of applications to which matrix computations and the SAS/IML language applies.

The SAS/IML file exchange is a valuable resource, but it relies on contributions from SAS/IML programmers who want to share their knowledge and programs. Have you written an interesting SAS/IML program? See the article "How to Contribute to the SAS/IML File Exchange" to find out how easy it is to share your work with others.

If you are more of a consumer than a creator of programs, the File Exchange still needs you. Download and try out the programs that are there. Use the interactive features of the SAS Community to rate the programs, to ask questions, to submit bug reports, or just to say thanks.

The community would love to hear from you. What are you waiting for?

Post a Comment

The next power of 2 and other tricks with logarithms

The other day I was doing some computations that caused me to wonder, "What is the smallest power of 2 that is greater than a given number?" The mathematics is straightforward. Given a number n, find the least value of k such that 2kn or, equivalently,
k ≥ log2(n).

It is easy to implement this formula in the SAS DATA step or in the SAS/IML language. The LOG2 function computes the base-2 logarithm of a number, and you can use the CEIL function to return the next-higher integer. With a little effort, you can even return a missing value if n is not a positive number. Here is a SAS/IML function that implements the formula:

proc iml;
start NextPow2(n);
   p = j(nrow(n), ncol(n), .);         /* allocate array to missing values */
   idx = loc(n>0);                     /* identify valid inputs */
   if ncol(idx)>0 then 
      p[idx] = 2##ceil(log2(n[idx]));  /* apply formula to valid inputs */
   return( p );
finish;
 
/* test the function on valid and invalid data */
y = {-2 0.01 0.124 0.49 2 4.5 50 8100}`;
pow2 = NextPow2(y);  /* next greater power of 2 */
k = log2(pow2);      /* just get the exponent   */
print y pow2[format=FRACT.] k;
t_pow2

Notice that the FRACTw.d format enables you to format decimals as fractions in SAS.

By changing the 2s to 10s, you can use the same algorithm (and the LOG10 function) to find the next power of 10.

However, if you want to find the next power of an arbitrary number b, then you need to use a little logarithm trick. If b is the base of a logarithm, then logb(x) = log10(x) / log10(b). Because of this identity, the LOG10 function is the only function you need to compute logarithms for any base! By using this handy mathematical identity, you can generalize the NextPow2 function and write a function that computes the next power of b, where b is any positive number.

The following SAS/IML function computes the next highest power of b for a given set of input values:

/* given n, compute the next highest power of b */
start NextPow(n, b);
   p = j(nrow(n), ncol(n), .);         /* allocate array to missing values */
   idx = loc(n>0);                     /* identify valid inputs */
   if ncol(idx)>0 then do;
      x = n[idx];
      p[idx] = b##ceil(log10(x)/log10(b));
   end;
   return( p );
finish;
 
pow2 = NextPow(y,2);  /* next greater power of 2 */
pow3 = NextPow(y,3);  /* next greater power of 3 */
pow5 = NextPow(y,5);  /* next greater power of 5 */
print y pow2[format=FRACT.] pow3[format=FRACT.] pow5[format=FRACT.];
t_powb

I think this TRICK is a real TREAT!

Post a Comment

How to use frequency analysis to crack the Cryptoquote puzzle

Many people enjoy solving word games such as the daily Cryptoquote puzzle, which uses a simple substitution cipher to disguise a witty or wise quote by a famous person. A common way to attack the puzzle is frequency analysis. In frequency analysis you identify letters and pairs of letters (bigrams) that occur often in the enciphered text. It is a fact that certain letters (e, t, a, o, i, n,....) and bigrams (th, he, in, er, an, re, on,...) appear frequently in English text. You can use this fact to guess that the most frequently occurring symbol in the text might represent e, t, or a. You can use the bigram frequencies in the text to help you make statistically likely guesses.

Usually frequency analysis is used for the initial guesses. Soon recognizable snippets of words begin to appear in the partially deciphered text, and you can often guess small words such as 'the', 'and', 'of', and 'for'. Then, like pulling a string on a sweater, the puzzle unravels. Use the context of the message to guess the remaining letters.

This article shows how to use frequency analysis to solve a cryptogram. I use a computer program to count the frequency of letters and bigrams and to quickly substitute guessed letters into the puzzle. However, the computer program merely provides an implementation of a tedious and repetitive task. You can perform the same analysis with pencil and paper. The program does not make any guesses itself, but merely makes it easier for a human to make a guess. (Type 'cryptogram solver' into an internet seach engine if you want an online solver.)

I use the convention that capital letters are symbols for the enciphered text whereas lowercase letters indicate the plaintext. Thus the plaintext message "word games are fun" might be enciphered as "RHES ADLUI DEU VYP." A partially deciphered solution might appear as "RHrS AaLeI are VYn," where the uppercase letters represent yet-to-be-solved letters.

Some functions that help decipher Cryptoquotes

In this blog I usually discuss statistical programming in the SAS/IML language. This post uses statistics and programming, but the discussion is about deciphering the Cryptoquote. The program that I use in this article is available for download for SAS users who have access to SAS/IML software. I wrote the following SAS/IML functions:

  • The PrintFreqRef routine prints reference tables that show how often certain letters and pairs of letters appear in English text. The tables are the results from previous blog posts and show the most frequently appearing letters, the most frequent bigrams, and the most frequent double-letter bigrams.
  • The FreqAnalysis routine reports the distribution of letters, bigrams, and double letters in the enciphered quote.
  • The SplitByWords function constructs an array from a quote. Each word is on one row; each letter is in a separate column. The main purpose of this routine is so that I can say "look at the 7th word" and you will be able to find it easily.
  • The ApplySubs function enables you to quickly see the result of replacing one or more cipher symbols with a guess. You can use the function to replace 'Q' with 't' and 'F' with 'e' and see whether the substitutions appear to make sense in the quote. If the substitution results in nonsense—such as 'te' as a two-letter word—then the substitution was not good and you can try a different guess.

Solving the Cryptoquote by using frequency analysis

The following statements start PROC IML, load the utility modules, and print the reference tables that show the expected frequencies of letters and bigrams in English text.

proc iml;
load module=_all_;  /* load the utility modules */
run PrintFreqRef();
cryptofreqref

The reference tables show the relative frequencies of the most common letters, such as e, t, a, o, i, n, and s. The most frequent pairs of letters are th, he, in, er, and so forth. The most common double-letter bigrams are ll, ss, ee, oo, and so forth. If you like to solve word puzzles by pencil and paper, you can print this table and keep a copy in your wallet or purse. Click the table to get it to appear on a separate web page.

Now let's look at an enciphered quote and perform a frequency analysis on its symbols:

msg = "KBSCS OCS KPUSH XBSW DOCSWKBTTG HSSUH WTKBPWJ " +
      "IAK LSSGPWJ KBS UTAKB KBOK IPKSH MTA. " +
      "~ DSKSC GS FCPSH";
run FreqAnalysis(msg);
cryptofreq

These tables look similar to the reference tables, but these tables are specific to this quote. You can create these table by hand if you have the time and patience. In this quote, almost 20% of the symbols are an S. The next most frequent symbols are K and B. Furthermore, the bigrams KB and BS appear frequently. This co-occurrence of three symbols and two bigrams is a "trifecta" that gives strong statistical evidence that S=e, K=t, and B=h. The S=e guess is further supported by the fact that SS appears twice in the message, and 'ee' is a common double-letter bigram.

Let's make the substitutions:

w = SplitByWords(msg);                /* split into array */
rownum = char(1:nrow(w));             /* row numbers, for reference */
w1 = ApplySubs(w, {K B S}, {t h e});  /* guess K=t, B=h, and S=e */
print w1[r=rownum];
cryptoquote1

It looks like we made a great guess! The first word is probably 'there', so the next substitution should include C=r. This guess is bolstered by the fact that CS appears several times in this message, and 're' is a frequent bigram. (The first word could also be 'these', but 'se' occurs less often than 're'.)

If that guess is correct, then we can further guess that O=a because that makes the second word 'are' and the 12th word 'that'.

To generate another guess, notice that the double-letter bigram TT appears in this quote in the 5th word. The bigram follows the plaintext bigram 'th', so TT must represent a double vowel. Because we have already deduced the symbol for 'e', we guess that TT is 'oo', which is the only other common double-vowel bigram. Therefore T=o.

Lastly, the 16th word is the author's first name. Since C=r, we guess that the first name might be 'peter'. The following statement makes these substitutions:

w2 = ApplySubs(w1, {O C}, {a r});
w3 = ApplySubs(w2, {T D}, {o p});
print w3[r=rownum];
cryptoquote2

Frequency analysis can suggest additional guesses. The symbols H, P, and W appear frequently in the enciphered quote. We also see that SH (=eH) is a frequent bigram. From the bigram frequency analysis, we might guess that H=n or H=s. By looking at the partially deciphered quote, it looks like W=n because the 5th word is probably 'parenthood', so we will guess H=s.

By looking at where the symbol P appears in the quote, you might guess that P is a vowel. The remaining vowel that occurs frequently is 'i'. This guess makes sense because the trigram PWJ appears twice, and this trigram could be 'ing'.

The following statement make these substitutions.

w4 = ApplySubs(w3, {H W G P J}, {s n d i g});
print w4[r=rownum];

Click this link to see the partially deciphered text that results from these substitutions. At this point, frequency analysis has revealed all but seven of the symbols in the enciphered quote. The remaining symbols are easily deduced by looking at the partially deciphered words and guessing the remaining letters. The deciphered quote is:

There are times when parenthood seems nothing but feeding the mouth that bites you.
     ~ Peter de Vries

Frequency analysis enabled us to guess values for certain frequently occurring symbols. By correctly guessing these key symbols, the puzzle was solved. Notice that the frequency of bigrams was essential in this process. If you use only the frequency of single letters, you might be able to guess the 'e' and 't', but the other letters are indistinguishable by looking at the empirical frequencies. However, by using bigrams, you can guess more symbols. For example, in this puzzle the bigrams helped us to discover the symbols for 'h', 'o', 's', and 'n'.

Of course, this process is not always linear. Sometimes you will make a guess that turns out to be wrong. In that case, eliminate the wrong guess and try another potential substitution.

To conclude this article, I will leave you with another Cryptoquote. I will generate the frequency analysis for you, in case you want to solve the puzzle by hand. You can download the SAS/IML program that deciphers this quote. The program contains comments that explain how frequency analysis is used to solve the puzzle.

cryptofreq2
CG EKLVFX WLU MVXZG IGLIFG YH UKGSB IGPQ LR GNDGFFGWDG; YVU YH UKG XSEUPWDG UKGH KPTG UBPTGFGX RBLA UKG ILSWU CKGBG UKGH EUPBUGX.
     ~ KGWBH CPBX YGGDKGB

Post a Comment

Video: Using heat maps to visualize matrices

One of my presentations at SAS Global Forum 2014 was about the new heat map functions in SAS/IML 13.1. Over the summer I created a short video of my presentation, which gives an overview of visualizing matrices with heat maps, and describes how to choose colors for heat maps:



If your browser does not support embedded video, you can go directly to the video on YouTube.

If you prefer to read articles about heat maps so that you can study the concepts and cut and paste examples, here are a few recent blog posts that are based on my SAS Global Forum presentation:

For a fun application of discrete heat maps, see my article about implementing a one-dimensional cellular automata in SAS.

Post a Comment

Does this kurtosis make my tail look fat?

What is kurtosis? What does negative or positive kurtosis mean, and why should you care? How do you compute kurtosis in SAS software?

It is not clear from the definition of kurtosis what (if anything) kurtosis tells us about the shape of a distribution, or why kurtosis is relevant to the practicing data analyst. Mathematically, the kurtosis of a distribution is defined in terms of the standardized fourth central moment. The estimate of kurtosis from a sample also involves a summation that accumulates the fourth powers of a standardized variable. In SAS software, the formula for the sample kurtosis is given in the "Simple Statistics" section of the documentation.

This article presents an intuitive explanation of kurtosis that is often true in practice.

What does kurtosis mean?

Let's compute the sample kurtosis for two variables in the Sashelp.Heart data set and see whether the kurtosis values tell us anything about the shape of the data distributions. The variables are AgeAtStart (the age of a patient when he or she entered the study) and Systolic (his or her systolic blood pressure). The following statements call the MEANS procedure to compute estimates for the mean, median, standard deviation, skewness, and kurtosis:

proc means data=sashelp.heart mean median std skew kurtosis;
var AgeAtStart Systolic;
run;
t_kurt1

The last column shows that the kurtosis of the AgeAtStart data is negative, whereas the kurtosis of the Systolic data is positive. What does that mean? Well, first recall that whereas the mean, median, and standard deviation are expressed in the same units as the data, kurtosis (like skewness) is a dimensionless quantity. The zero value corresponds to a standard reference situation. For kurtosis, the reference distribution is the normal distribution, which is defined to have a kurtosis of zero. (Technically, I am describing the excess kurtosis, since this is the value returned by statistical software packages. Researchers sometimes consider the full kurtosis, which is 3 more than the excess kurtosis.)

The interpretation of kurtosis in terms of the shape of the distribution can be tricky, but for many unimodal distributions the kurtosis compares the peak and tails of the distribution to the normal distribution:

  • A data distribution with negative kurtosis is often broader, flatter, and has thinner tails than the normal distribution.
  • A data distribution with positive kurtosis is often narrower at its peak and has fatter tails than the normal distribution.

Negative kurtosis for a sample indicates that the sample contains many observations that are a moderate distance from the center, but few outliers that are far from the center. The peak—if there is one—looks more like a butte or a rounded hilltop than a jutting spire. The canonical distribution that has negative kurtosis is the continuous uniform distribution, which has a kurtosis of –1.2.

In contrast, positive kurtosis indicates that many values will be near the center, but that you should also expect a certain proportion of values that are far from the center. A heavy-tailed distribution has large kurtosis. The canonical distribution that has a large positive kurtosis is the t distribution with a small number of degrees of freedom. Some heavy-tailed distributions have infinite kurtosis.

Graphically testing the kurtosis interpretation

You can run a little experiment to see how the shape of the AgeAtStart and Systolic distributions compares with the normal distribution. What would you expect to see if you overlay a normal curve on a histogram of those variables? For the AgeAtStart variable, which has negative kurtosis, you should observe a distribution that is "broad in the shoulders" and that has thin tails. For the Systolic variable, you should observe a distribution that has many observations clustered together, drops off faster than a normal distribution, and has at least one heavy tail. The following statements use PROC SGPLOT to overlay a histogram with a normal curve centered at the median:

ods graphics / width=330px height=250px;
proc sgplot data=sashelp.heart noautolegend;
histogram AgeAtStart / binwidth=5 binstart=30 showbins;
density AgeAtStart / type=normal(mu=43); /* center at median */
run;
 
proc sgplot data=sashelp.heart noautolegend;
histogram Systolic / binwidth=10 binstart=80 showbins;
density Systolic / type=normal(mu=132); /* center at median */
run;
kurt1

The histogram for the AgeAtStart variable shows that the distribution is broad and flat. It does not have a pronounced central peak, and there is substantial mass along most of the range of the data. The distribution appears to be bounded: Apparently the heart study did not accept any patients who were younger than 28 or older than 62. This implies that there are no tails for the population distribution. Overall, the AgeAtStart variable provides a prototypical example of a distribution that has negative kurtosis.


kurt2

The histogram of the Systolic variable has a narrow peak and a long right tail. There are many patients in the study with systolic blood pressure in the range 120–140. (The American Heart Association calls that range prehypertensive.) Compared to the normal distribution, there are relatively few observations in the ranges (80, 120) and (140, 170). However, the distribution has more extreme observations than would be expected for a normally distributed variable, so the distribution has a fat tail.

Computing kurtosis in SAS software

There are three SAS procedures that compute the kurtosis of data distributions. PROC MEANS was used earlier in this article. Use PROC UNIVARIATE when you want to fit a univariate model to the data. PROC IML includes the KURTOSIS function in SAS/IML 13.1, as shown in the following example:

proc iml;
varNames = {"AgeAtStart" "Systolic"};
use Sashelp.Heart;  read all var varNames into X;  close Sashelp.Heart;
 
kurt = kurtosis(X);
print kurt[colname=varNames];
t_kurt2

When the intuitive interpretation of kurtosis can fail

The relationship between kurtosis and the shape of the distribution are not mathematical rules, they are merely intuitive ideas that often hold true for unimodal distributions. If your distribution is multimodal or is a mixture of distributions, these intuitive rules might not apply.

There have been many articles about kurtosis and its interpretation. My favorite is an article by Balanda and MacGillivray (1988, TAS). They give examples of three different densities that each have the same kurtosis value but that look very different. They conclude that "because of the averaging process involved in its definition, a given value of [kurtosis] can correspond to several different distributional shapes" (p. 114).

They recommend that kurtosis be defined as "the location- and scale-free movement of probability mass from the shoulders of a distribution into its center and tails. In particular, this definition implies that peakedness and tail weight are best viewed as components [emphasis mine] of kurtosis.... This definition is necessarily vague because the movement can be formalized in many ways" (p. 116). In other words, the peaks and tails of a distribution contribute to the value of the kurtosis, but so do other features.

The tail of the distribution is the most important contributor. Although Balanda and MacGillivray do not mention it, the kurtosis is a non-robust statistic that can be severely influenced by the value of a single outlier. For example, if you choose 999 observations from a normal distribution, the sample kurtosis will be close to 0. However, if you add a single observation that has the value 100, the sample kurtosis jumps to more than 800!

If you like pathological distributions and mathematically contrived examples of all the ways that kurtosis can fail to indicate the shape of a distribution, see Balanda and MacGillivray's article. However, the intuitive notions in this article hold true for many unimodal data distributions that arise in practice.

Post a Comment

Compute the log-determinant of an arbitrary matrix

A few years ago I wrote an article that shows how to compute the log-determinant of a covariance matrix in SAS. This computation is often required to evaluate a log-likelihood function. My algorithm used the ROOT function in SAS/IML to compute a Cholesky decomposition of the covariance matrix. The Cholesky decomposition exists only for symmetric positive definite matrices, which means that this implementation cannot be used to compute the log-determinant of an arbitrary square matrix.

For an arbitrary square matrix, the determinant can be negative or zero. The log function is undefined for nonpositive values, but the log of the absolute value of the determinant always exists. This computation, along with the sign of the determinant, enables you to reconstruct the determinant of any matrix.

SAS/IML 13.1 supports the LOGABSDET function, which computes the logarithm of the absolute value of the determinant for any square matrix. The function returns a vector with two elements: the first element is the logarithm, the second is the sign of the determinant.

In my earlier article I wrote:

The determinant of a Vandermonde matrix with consecutive integer elements increases super-factorially with the dimension of the matrix! For a 30 x 30 integer Vandermonde matrix, the determinant is too large to represent as a standard double-precision floating-point number.

A Vandermonde matrix is not symmetric, so let's construct a Vandermonde matrix and compute its log-determinant by using the LOGABSDET function. The following SAS/IML function computes a Vandermonde matrix as defined in Golub and van Loan. The Wikipedia article uses the transpose of this matrix as the definition.

proc iml;
   /* Given a vector v = {v1 v2 v3 ... vn}, this function returns 
      the Vandermonde matrix defined by
      [ 1    1    1    ...   1        ]
      [ v1   v2   v3   ...   vn       ]
      [ v1^2 v2^2 v3^2 ...   vn^2     ]
      [ ...  ...  ...  ...    ...     ]
      [ v1^{n-1}  ...  ...   vn^{n-1} ]
   */
start CreateVandermondeMatrix( _v );
   v = rowvec(_v);
   m = ncol(v);
   mVander = j(m,m,1);
   do i = 2 to m;
      mVander[i,] = mVander[i-1,] # v;
   end;
   return ( mVander );
finish;
 
v = 1:10;
m = CreateVandermondeMatrix(v);
logDet = logabsdet(m);
print logDet[colname={"log(det)" "sign(det)"}];
t_logabsdet

For this 1 x 10 matrix, the log-determinant is about 49, which means that the determinant is exp(49)  ≈ 1.8 x 1021. For the 30 x 30 nonsymmetric matrix generated from v = 1:30, the determinant is too large to be computed by using the DET function, but the LOGABSDET function has no problem computing the log-determinant, which is approximately 901.

Post a Comment