Does this kurtosis make my tail look fat?

What is kurtosis? What does negative or positive kurtosis mean, and why should you care? How do you compute kurtosis in SAS software?

It is not clear from the definition of kurtosis what (if anything) kurtosis tells us about the shape of a distribution, or why kurtosis is relevant to the practicing data analyst. Mathematically, the kurtosis of a distribution is defined in terms of the standardized fourth central moment. The estimate of kurtosis from a sample also involves a summation that accumulates the fourth powers of a standardized variable. In SAS software, the formula for the sample kurtosis is given in the "Simple Statistics" section of the documentation.

This article presents an intuitive explanation of kurtosis that is often true in practice.

What does kurtosis mean?

Let's compute the sample kurtosis for two variables in the Sashelp.Heart data set and see whether the kurtosis values tell us anything about the shape of the data distributions. The variables are AgeAtStart (the age of a patient when he or she entered the study) and Systolic (his or her systolic blood pressure). The following statements call the MEANS procedure to compute estimates for the mean, median, standard deviation, skewness, and kurtosis:

proc means data=sashelp.heart mean median std skew kurtosis;
var AgeAtStart Systolic;
run;
t_kurt1

The last column shows that the kurtosis of the AgeAtStart data is negative, whereas the kurtosis of the Systolic data is positive. What does that mean? Well, first recall that whereas the mean, median, and standard deviation are expressed in the same units as the data, kurtosis (like skewness) is a dimensionless quantity. The zero value corresponds to a standard reference situation. For kurtosis, the reference distribution is the normal distribution, which is defined to have a kurtosis of zero. (Technically, I am describing the excess kurtosis, since this is the value returned by statistical software packages. Researchers sometimes consider the full kurtosis, which is 3 more than the excess kurtosis.)

The interpretation of kurtosis in terms of the shape of the distribution can be tricky, but for many unimodal distributions the kurtosis compares the peak and tails of the distribution to the normal distribution:

  • A data distribution with negative kurtosis is broader, flatter, and has thinner tails than the normal distribution.
  • A data distribution with positive kurtosis is narrower at its peak and has fatter tails than the normal distribution.

Negative kurtosis for a sample indicates that the sample contains many observations that are a moderate distance from the center, but few outliers that are far from the center. The peak—if there is one—looks more like a butte or a rounded hilltop than a jutting spire. The canonical distribution that has negative kurtosis is the continuous uniform distribution, which has a kurtosis of –1.2.

In contrast, positive kurtosis indicates that many values will be near the center, but that you should also expect a certain proportion of values that are far from the center. A heavy-tailed distribution has large kurtosis. The canonical distribution that has a large positive kurtosis is the t distribution with a small number of degrees of freedom. Some heavy-tailed distributions have infinite kurtosis.

Graphically testing the kurtosis interpretation

You can run a little experiment to see how the shape of the AgeAtStart and Systolic distributions compares with the normal distribution. What would you expect to see if you overlay a normal curve on a histogram of those variables? For the AgeAtStart variable, which has negative kurtosis, you should observe a distribution that does not have a very pronounced peak, that is "broad in the shoulders," and that has thin tails. For the Systolic variable, you should observe a distribution that has a pronounced peak, drops off faster than a normal distribution, and has at least one heavy tail. The following statements use PROC SGPLOT to overlay a histogram with a normal curve centered at the median:

ods graphics / width=330px height=250px;
proc sgplot data=sashelp.heart noautolegend;
histogram AgeAtStart / binwidth=5 binstart=30 showbins;
density AgeAtStart / type=normal(mu=43); /* center at median */
run;
 
proc sgplot data=sashelp.heart noautolegend;
histogram Systolic / binwidth=10 binstart=80 showbins;
density Systolic / type=normal(mu=132); /* center at median */
run;
kurt1

The histogram for the AgeAtStart variable shows that the distribution is broad and flat. It does not have a pronounced central peak, and there is substantial mass along most of the range of the data. The distribution appears to be bounded: Apparently the heart study did not accept any patients who were younger than 28 or older than 62. This implies that there are no tails for the population distribution. Overall, the AgeAtStart variable provides a prototypical example of a distribution that has negative kurtosis.


kurt2

The histogram of the Systolic variable has a narrow peak and a long right tail. There are many patients in the study with systolic blood pressure in the range 120–140. (The American Heart Association calls that range prehypertensive.) Compared to the normal distribution, there are relatively few observations in the ranges (80, 120) and (140, 170). However, the distribution has more extreme observations than would be expected for a normally distributed variable, so the distribution has a fat tail.

Computing kurtosis in SAS software

There are three SAS procedures that compute the kurtosis of data distributions. PROC MEANS was used earlier in this article. Use PROC UNIVARIATE when you want to fit a univariate model to the data. PROC IML includes the KURTOSIS function in SAS/IML 13.1, as shown in the following example:

proc iml;
varNames = {"AgeAtStart" "Systolic"};
use Sashelp.Heart;  read all var varNames into X;  close Sashelp.Heart;
 
kurt = kurtosis(X);
print kurt[colname=varNames];
t_kurt2

When the intuitive interpretation of kurtosis can fail

The relationship between kurtosis and the shape of the distribution are not mathematical rules, they are merely intuitive ideas that often hold true for unimodal distributions. If your distribution is multimodal or is a mixture of distributions, these intuitive rules might not apply.

There have been many articles about kurtosis and its interpretation. My favorite is an article by Balanda and MacGillivray (1988, TAS). They give examples of three different densities that each have the same kurtosis value but that look very different. They conclude that "because of the averaging process involved in its definition, a given value of [kurtosis] can correspond to several different distributional shapes" (p. 114).

They recommend that kurtosis be defined as "the location- and scale-free movement of probability mass from the shoulders of a distribution into its center and tails. In particular, this definition implies that peakedness and tail weight are best viewed as components [emphasis mine] of kurtosis.... This definition is necessarily vague because the movement can be formalized in many ways" (p. 116). In other words, the peaks and tails of a distribution contribute to the value of the kurtosis, but so do other features.

Although Balanda and MacGillivray do not mention it, it is also worth remembering that kurtosis is a non-robust statistic that can be severely influenced by the value of a single outlier. For example, if you choose 999 observations from a normal distribution, the sample kurtosis will be close to 0. However, if you add a single observation that has the value 100, the sample kurtosis jumps to more than 800!

If you like pathological distributions and mathematically contrived examples of all the ways that kurtosis can fail to indicate the shape of a distribution, see Balanda and MacGillivray's article. However, the intuitive notions in this article hold true for many unimodal data distributions that arise in practice.

Post a Comment

Compute the log-determinant of an arbitrary matrix

A few years ago I wrote an article that shows how to compute the log-determinant of a covariance matrix in SAS. This computation is often required to evaluate a log-likelihood function. My algorithm used the ROOT function in SAS/IML to compute a Cholesky decomposition of the covariance matrix. The Cholesky decomposition exists only for symmetric positive definite matrices, which means that this implementation cannot be used to compute the log-determinant of an arbitrary square matrix.

For an arbitrary square matrix, the determinant can be negative or zero. The log function is undefined for nonpositive values, but the log of the absolute value of the determinant always exists. This computation, along with the sign of the determinant, enables you to reconstruct the determinant of any matrix.

SAS/IML 13.1 supports the LOGABSDET function, which computes the logarithm of the absolute value of the determinant for any square matrix. The function returns a vector with two elements: the first element is the logarithm, the second is the sign of the determinant.

In my earlier article I wrote:

The determinant of a Vandermonde matrix with consecutive integer elements increases super-factorially with the dimension of the matrix! For a 30 x 30 integer Vandermonde matrix, the determinant is too large to represent as a standard double-precision floating-point number.

A Vandermonde matrix is not symmetric, so let's construct a Vandermonde matrix and compute its log-determinant by using the LOGABSDET function. The following SAS/IML function computes a Vandermonde matrix as defined in Golub and van Loan. The Wikipedia article uses the transpose of this matrix as the definition.

proc iml;
   /* Given a vector v = {v1 v2 v3 ... vn}, this function returns 
      the Vandermonde matrix defined by
      [ 1    1    1    ...   1        ]
      [ v1   v2   v3   ...   vn       ]
      [ v1^2 v2^2 v3^2 ...   vn^2     ]
      [ ...  ...  ...  ...    ...     ]
      [ v1^{n-1}  ...  ...   vn^{n-1} ]
   */
start CreateVandermondeMatrix( _v );
   v = rowvec(_v);
   m = ncol(v);
   mVander = j(m,m,1);
   do i = 2 to m;
      mVander[i,] = mVander[i-1,] # v;
   end;
   return ( mVander );
finish;
 
v = 1:10;
m = CreateVandermondeMatrix(v);
logDet = logabsdet(m);
print logDet[colname={"log(det)" "sign(det)"}];
t_logabsdet

For this 1 x 10 matrix, the log-determinant is about 49, which means that the determinant is exp(49)  ≈ 1.8 x 1021. For the 30 x 30 nonsymmetric matrix generated from v = 1:30, the determinant is too large to be computed by using the DET function, but the LOGABSDET function has no problem computing the log-determinant, which is approximately 901.

Post a Comment

Wolfram's Rule 30 in SAS

My previous blog post describes how to implement Conway's Game of Life by using the dynamically linked graphics in SAS/IML Studio. But the Game of Life is not the only kind of cellular automata. This article describes a system of cellular automata that is known as Wolfram's Rule 30. In this blog post I show how to compute and visualize this system in SAS by using traditional PROC IML and a simple heat map.

Wolfram's Rule 30

Conway's Game of Life is a set of rules for evolving cellular automata on a two-dimensional grid. However, one-dimensional automata are simpler to describe and to compute. A well-known one-dimensional example is Wolfram's Rule 30 (1983, Rev. Mod. Phys.). Consider a sequence of binary symbols, such as 0 and 1. An initial configuration determines the next generation by using the following two rules for each cell:

  1. If the cell and its right-hand neighbor are both 0, then the new value of the cell is the value of its left-hand neighbor.
  2. Otherwise, the new value of the cell is the opposite of the value of its left-hand neighbor.

The rules are often summarized by using three-term sequences that specify the value of the left, center, and right cells. The sequence 100 means that the center cell is 0, its left neighbor is 1, and its right neighbor is 0. According to the first rule, the value of the center cell for the next generation will be 1. The sequence 000 means that the center cell and both its neighbors are 0. According to the first rule, the value of the center cell for the next generation will be 0. All other sequences are governed by the second rule. For example, the sequence 010 implies that the center cell for the next generation will be 1, which is the opposite of the value of the left neighbor.

There are several ways that you can implement Rule 30 in a matrix language such as SAS/IML. The following SAS/IML program defines a function that returns the next generation of binary values when given the values for the current generation. As for the Game of Life, I define the leftmost and rightmost cells to be neighbors. The evolution function is vectorized, which means that no loops are used to compute a new generation from the previous generation. The second function computes each new generation as a row in a matrix. You can print that matrix or use the HEATDISC subroutine in SAS/IML 13.1 to create a black-and-white heat map of the result.

proc iml;
start Rule30Evolve(x);                     /* x is binary row vector */
   y = x[ncol(X)] || x || x[1];            /* extend x periodically */
   N = ncol(y);
   j = 2:N-1;    xj = y[,j];               /* the original cells */
   R = y[,j+1];  L = y[,j-1];              /* R=right neighbor; L=left neighbor */
   idxSame = loc(xj=0 & R=0);              /* x00 --> center cell is x */
   if ^IsEmpty(idxSame) then y[,idxSame+1] = L[,idxSame];      
   idxDiff = setdif(1:N-2, idxSame);       /* otherwise, center cell is ^x */
   if ^IsEmpty(idxDiff) then y[,idxDiff+1] = ^L[,idxDiff];
   return( y[,j] );
finish;
 
start Rule30(x0, Iters=ceil(ncol(x0)/2));
   N = ncol(x0);
   M = j(Iters,N,0);                       /* initialize array to 0 */
   M[1,] = x0;                             /* set initial configuration */
   do i = 2 to Iters;
      M[i,] = Rule30Evolve(M[i-1,]);       /* put i_th generation into i_th row */
   end;
   return( M );
finish;
 
N = 101;
x0 = j(1,N,0);                       /* initialize array to 0 */
x0[ceil(N/2)] = 1;                   /* set initial condition for 1st row */
M = Rule30(x0, N);
ods graphics / width=800px height=600px;
call heatmapdisc(M) displayoutlines=0 colorramp={White Black}
     title = "Wolfram's Rule 30 (N=101)";
rule30

The discrete heat map uses white to indicate cells that are 0 and black to indicate cells that are 1. The initial configuration has 100 cells of white and a single black cell in the middle of the array. That configuration is shown in the first row of the heat map. (Click to enlarge.) The second generation has three black cells in the middle, as shown in the second row. The central region of cells grows two cells larger for each generation. For the first 50 generations, the left boundary is always 001 and the right boundary alternates between 111 and 001. The pattern of the interior cells is not periodic, although certain interesting patterns appear, such as "martini glasses" of various sizes.

After 50 generations, the left and right edges begin to interact in a nontrivial way. Periodicity at the edges is replaced by irregular patterns.

Supposedly the evolution of the middle cell of the array generates nearly random 0-1 values. Mathematica, a computer software package created by Stephen Wolfram, uses the central column of a Rule-30 cellular automata as a random number generator. From a simple deterministic rule comes complex behavior that seems indistinguishable from randomness.

The heat map shows all generations at a single glance by stacking them as rows in a single image. This is different from the Game of Life visualization, which used animation to show how an initial configuration evolves from one generation to the next. You can also use animation to visualize the Rule-30 system. The following animated GIF shows the same 100 generations of an initial configuration that consists of a single black cell in the middle of a 101-cell array. This animation shows how that configuration evolves in time.

rule30

It is fun to play with Rule 30. To begin, I suggest an array of 21 cells that evolves for 20 generations. The resulting black-and-white heat map is small enough to study, and you can use the DISPLAYOUTLINES=1 option to better visualize which cells are white during the evolution. You can invent your own initial configuration or use random initial conditions, as shown below:

N = 21;
x0 = T( randfun(N, "Bernoulli", 4/N) );   /* random; expect 4 black cells */
M = Rule30(x0);
call heatmapdisc(M) displayoutlines=1 colorramp={White Black}
     title = "Wolfram's Rule 30 (N=21)";

Enjoy!

Post a Comment

Cellular automata and the Game of Life in SAS

A colleague jokingly teases me whenever I write a blog that demonstrates how to write fun and exciting programs by using SAS software. "Why do you get to have all the fun?" he mock-chides.

Today I'm ready to face his ribbing, because this article is about Conway's Game of Life and how to implement cellular automata in the SAS/IML language.

Cellular automata and Conway's Game of Life

Conway's Game of Life is a set of simple rules that give rise to beautiful regular and irregular patterns. The rules describes how cellular automata live or die based on the presence or absence of neighbors on a rectangular grid. This dynamical system is a "game" because it is fun to create some initial distribution of automata and watch the evolution of the system from that initial condition. Like many programmers, I wrap the edges of the grid so that the cells on the right and left sides of the grid are neighbors, and similarly for the cells on the top and bottom edges.

The following rules were used to evolve the automata:

  • Birth Rule: An organism is born into any empty cell that has exactly three living neighbors.
  • Survival Rule: An organism with either two or three neighbors survives from one generation to the next.
  • Death Rule: An organism with four or more neighbors dies from overcrowding. An organism with fewer than two neighbors dies from loneliness.

The Game of Life in SAS/IML Studio

Almost every beginning programmer has attempted to program the Game of Life. In a matrix language such as SAS/IML, the implementation of the rules are particularly easy and compact. I first implemented the Game of Life in SAS/IML Studio in 2001. My program is part of the standard set of demo programs that are distributed with SAS/IML Studio. To access the program from the SAS/IML Studio menu, select File ⇒ Open ⇒ File. Then click Go to Installation directory. Select the Programs and Demos folders, then open the file Life.sx.

When you run the program, you are prompted for the size of a square grid (I like 20 rows and columns) and told to click on the grid to create an initial distribution of the automata. (Hold down the SHIFT key while you click.) The following animated GIF shows 100 steps that evolve from a simple initial condition that contains an "exploder" pattern and a "glider" pattern.

lifeani

The interesting thing about this implementation in SAS/IML Studio is that I didn't program the animation at all. The "live" automata are observations that are selected in the grid. The program figures out which cells are alive and dead at the next generation and then simply tells the in-memory copy of the data to update the selected observations. The plot automatically updates to reflect the newly selected observations. To learn more about dynamically linked graphics and how to select observations in SAS/IML Studio, see my book Statistical Programming with SAS/IML Software, especially Chapters 6 and 10.

Recall that SAS/IML Studio is a free download and can be installed on a Windows PC and used by anyone who has a license for SAS/IML and SAS/STAT software. Unfortunately, SAS/IML Studio is not available for users of the free SAS University Edition. If you do not have access to SAS/IML Studio, here is a version of the Life.sx program. You can use the computational portions of the program in PROC IML, even though you cannot create the IMLPlus graphics.

Creating an animated GIF in SAS/IML Studio

When you run the program in SAS/IML Studio, the graphics are displayed on your PC monitor. So how did I create an animated GIF to embed in this blog post? It was surprisingly easy! In the IMLPlus program, the on-screen graph is controlled through a Java variable called plot. You can use the SaveToFile method to create a bitmap version of the graph. If you call that method within the loop that updates the graph for each new generation of automata, you get a sequence of bitmaps in some local directory, like this:

path = 'C:\Temp\AnimGif\Life\';       /* directory to store images */
plot.SaveToFile(path+"Life000.bmp");  /* save the initial configuration */
do i = 1 to nGenerations;             /* for each new generation */
   run update_generation( grid );     /* compute new configuration for automata */
   run Grid2Selection( dobj, grid );  /* update the selected observations */
   plot.SetTitleText("Game of Life: Generation = "+char(i,3));  /* update title */
   suffix = trim(putn(i, "Z3."));     /* 001, 002, ..., 100 */
   plot.SaveToFile(path+"Life"+suffix+".bmp"); /* save image to Life001.bmp */
end;

After creating 100 bitmap images in a directory, you can create an animated GIF by using a program that stitches the images together in sequence. I used a freeware Java program called GiftedMotion, which does not require you to install any software. Just download the JAR file to your desktop, double click, and navigate to the directory that contains the images.

Post a Comment

Fat-tailed and long-tailed distributions

The tail of a probability distribution is an important notion in probability and statistics, but did you know that there is not a rigorous definition for the "tail"? The term is primarily used intuitively to mean the part of a distribution that is far from the distribution's peak or center. There are several reasons why a formal definition of "tail" is challenging:

  • Where does the tail start? The tail is "far away from the mean," but how far away? For the standard normal distribution, should the tail start three standard deviations from the mean? Five?
  • Not all distributions are as well-behaved as the normal distribution. The Student t distribution with 1 degree of freedom (also called the Cauchy distribution) is the classic example of a distribution that does not have a well-defined mean or variance. Consequently, it is not always possible to define the tail in terms of some number of standard deviations away from the mean or median.
  • Does the uniform distribution on [0, 1] have a tail? Many people would say no. Thus not every distribution has a tail. In particular, if you want "the behavior of the tail" to describe the characteristics of the probability density function (PDF) when |x| gets large, then bounded distributions do not have tails. This example also shows that you can't merely use the 95th percentile to define the tail.

Nevertheless, some features of tails can be quantified. In particular, by using limits and asymptotic behavior you can define the notion of heavy tails. This article discusses heavy-tailed distribution and two important subclasses: the fat-tailed distributions and the long-tailed distributions.

Heavy-tailed distributions

When discussing how much mass is in the tail of a probability density function, it is convenient to use the exponential distribution as a reference. The PDF of the exponential distribution approaches zero exponentially fast. That is, the PDF looks like exp(-λx) for large values of x. Thus you can divide distributions into two categories according to the behavior of their PDFs for large values of |x|.

  • Probability distribution functions that decay faster than an exponential are called thin-tailed distributions. The canonical example of a thin-tailed distribution is the normal distribution, whose PDF decreases like exp(-x2/2) for large values of |x|. A thin-tailed distribution does not have much mass in the tail, so it serves as a model for situations in which extreme events are unlikely to occur.
  • Probability distribution functions that decay slower than an exponential are called heavy-tailed distributions. The canonical example of a heavy-tailed distribution is the t distribution. The tails of many heavy-tailed distributions follow a power law (like |x|–α) for large values of |x|. A heavy-tailed distribution has substantial mass in the tail, so it serves as a model for situations in which extreme events occur somewhat frequently.

Fat-tailed distributions

From a modeling perspective, fat-tailed distributions are important when extreme events must be part of the model. For example, models of claims in the home insurance industry have to account for an intense category 5 hurricane hitting the Miami area, or Superstorm Sandy flooding the New York area. Although the probability of these extreme events is small, the size of the payout is so large that these events are vital to models of insurance claims.

cauchypdf

From a mathematical perspective, the most fascinating aspect of fat-tailed distributions is that the mean, variance, and other measures that describe the shape of the distribution might not be defined. You might wonder how this can be. For the symmetric Cauchy distribution, pictured at the left, it looks like the mean should be at zero. After all, the function is symmetric. However, although the Cauchy distribution has a well-defined median and mode at 0, the mean is not defined.

Mathematically, the problem is that a certain integral does not exist. However, a more convincing demonstration is to run a simulation that draws random values from the Cauchy distribution and computes the mean as the sample size increases. If the mean exists, then the running mean should converge to some value. But it doesn't. Instead, every so often an arbitrarily large value is generated, which causes the mean to jump by a large amount. This is shown by the following simulation and visualization in SAS software:

proc iml;
call randseed(4321);
N = 1e6;                         /* number of random draws */
x = randfun(N, "Cauchy");        /* random draws from Cauchy distribution */
RunningMean = cusum(x)/ T(1:N);  /* mean for each sample size */
 
t = do(10000, N, 10000);         /* plot the mean only for these sample sizes */
RunningMean = RunningMean[t];
title "Running Mean of Random Cauchy Data";
ods graphics / width=800px height=400px;   /* size of image */
call scatter(t, RunningMean) other="refline 0 / axis=y;";
cauchyrunningmean

As an exercise, you can rerun the program for other distributions. Change the distribution family to "Normal" to see how quickly the running mean for the normal distribution converges to 0. Or change the family to "Exponential" and watch the running mean converge to 1.

Long-tailed distributions

Another kind of heavy-tailed distribution is the long-tailed distribution, which is used to model many internet-era phenomena such as the frequency distribution of book titles sold at Amazon.com or the frequency of internet search terms.

People submit billions of unique search terms to search engines. Imagine sorting all internet searches from a certain time period according to the number of times that the search term was submitted. The top-ranked terms will probably be for celebrities (Kim Kardashian, Katy Perry,...) along with terms related to world events such as wars, natural disasters, and politics. According to research on Google searches, the top-1000 search terms might account for 10% of the total searches. But there are many millions of search terms that appear only a few times, or perhaps only one time. How many search terms do you think account for 50% of the search traffic? 5,000? 10,000? The correct answer is much, much bigger! Author Chris Anderson, a former editor-in-chief at Wired magazine, says "if search were represented by a tiny lizard with a one-inch head, the tail of that lizard would stretch for 221 miles."

I see the same long-tailed distribution when I look at the number of views for my 500+ blog articles. The most popular 20 posts account for about 50% of the total views. But the frequency distribution of posts has a long tail, with about 180 posts getting 0.1% or more of the views, and about 300 posts getting 0.05% or more of the views. During a typical month, almost every article that I have ever written gets viewed at least once.

Attempts have been made to classify and rank distributions according to the rate of decay in their tails. For an introduction to this work, see Parzen (1979, JASA), but be sure to also read the rejoinder by John Tukey.

Summary

Although the location of the tail of a probability distribution is not rigorously defined, nevertheless a lot of research has been conducted on the asymptotic behavior of distributions. The normal and exponential distributions serve as convenient reference distributions. Distributions that have a lot of probability mass in their tails are known as heavy-tailed distributions. Two subclasses of heavy-tailed distributions are the fat-tailed distributions and the long-tailed distributions. Each is useful for modeling real-world phenomena.

Have you encountered data that exhibits a heavy tail? What was the application? How did you model the data?

Post a Comment

The frequency of double-letters in Cryptoquotes

It usually takes more than three weeks to prepare a good impromptu speech.
       --Mark Twain

In the popular Cryptoquote puzzle, you are presented with an enciphered version of a quote by a famous person. One of the appeals of the puzzle for me is reading the deciphered quote, such as the witty aphorism by Mark Twain at the top of this post. Last week I discussed the frequency of double-letter bigrams in a huge English corpus. I asserted that a frequency analysis of "double letters can help the cryptanalyst ... to solve simple substitution ciphers."

In practice, double letters might not appear at all in a short quote. Even if they do appear, is the frequency distribution of double letters in short quotes the same as in Peter Norvig's analysis of a huge 744-billion-word corpus? If a certain double-letter bigram is expected to appear 0.05% of the time in lengthy texts, how often might it appear in pithy phrases that contains about 100 letters? Obviously the bigram will appear zero times for many phrases. Should we expect it to ever appear three times within a 100-letter quote? Four times?

To find out, I collected 120 (unenciphered) quotes. Some I took from Cryptoquote puzzles that I have been saving. Others I took from puzzles that appear on the DailyCryptogram.com web site. A few I sampled from the BrainyQuote web site. You can download the program that defines the data and produces the analyses in this article.

The quotes had an average length of 97 characters (mean=96.8; standard deviation 14.5). The minimum quote length was 66 characters; the maximum was 127 characters. The characters include spaces and punctuation. The average quote contains 59 bigrams, of which 1.8 were double-letter bigrams. The Mark Twain quote at the top of this article is atypical in that this 87-character quote contains five double-letter bigrams (out of 55).

doublelettervar1

The heat map at the left shows the count for double-letter bigrams in 20 quotes that were chosen at random from the collection of 120 quotes. The vertical axis is ordered by the total count of bigrams. For example, the first row of the matrix says that the 97th quote in the data base has seven double letters: four are 'LL', two are 'OO', and one is 'NN'. The 51st quote has six double letters: three are 'LL' and the others are 'OO', 'PP', and 'RR'. Three of the quotes have only one double-letter combinations, and four have no double letters.

doublelettervar3

The bar chart at the left shows the distribution of counts for double-letter bigrams in the complete collection of 120 quotes. There are 21 quotes (18%) that do not contain a single double-letter. For these quotes, knowing the distribution of double-letter bigrams does not help to solve the puzzle. Most quotes contain one or two double-letters. For these puzzles you should probably guess that the double-letter bigram is a common one (LL, SS, EE, or OO) and see if your guess makes sense for the remainder of the puzzle. For quotes that have four or more double letters, it might be harder to guess which enciphered double-letter combinations corresponds to which common double-letter bigrams.

If you average the bigram counts, you can compute the proportion of times that 'AA', 'BB', 'CC', and so forth appear in the sample of 120 quotes. You can use statistics to compute a 95% confidence interval for the proportion of each double letter. The following graph (click to enlarge) displays the following information:

  • An estimate for the proportion of each double-letter bigram. The estimate is the number of times that the bigram appears in the sample, divided by 120 (the sample size). These estimates are shown in the graph by a circle.
  • A 95% confidence interval for each proportion, assuming that this sample of quotes is representative of the population of all quotes that have roughly 100 characters.
  • The bigram proportions from Norvig's analysis of the Google corpus. These proportions are joined by a blue line to make it easier for your eye to traverse from one bigram to the next.
doublelettervar2

In this sample of quotes, the LL, OO, and RR bigrams appeared more often than you might expect from their frequencies in the corpus. The FF and HH bigrams appeared less often than expected.

So what have we learned?

  • The heat map shows that there is considerable variation in the number and types of double-letter bigrams that appear in a short quote, such as is used in the Cryptoquote puzzle.
  • The bar chart shows that one or two double-letter bigrams are expected in a short Cryptoquote. However, about 18% of the quotes did not contain any double letters.
  • Although the frequency distribution of the double-letter bigrams was not identical to the distribution in longer texts, it is fairly close. For the purpose of solving word puzzles, you can safely use the frequency of double letters that I reported in my previous article.
Post a Comment

The name of a parameter in the parent environment

SAS/IML 13.1 includes a handy function for programmers who write a lot of modules. The PARENTNAME function obtains the name of the symbol that was passed in as a parameter to a user-defined module.

How is this useful? Well, suppose that you want to create a SAS/IML module that prints simple descriptive statistics about a data vector, similar to this call to PROC MEANS:

proc means nolabels data=sashelp.cars N mean std Q1 median Q3;
var MPG_City;
run;
t_parentname1

Look at the heading of the table. The heading displays the name of the numerical variable that the statistics summarize. If you want to display this table from within a SAS/IML module, the module somehow needs to discover the name of the data vector that was passed into the module. That is, it needs information from the parent (or calling) environment.

This problem was resolved in SAS/IML 13.1 by introducing the PARENTNAME function. The following SAS/IML statements create a module that prints descriptive statistics for a data vector. The module calls the PARENTNAME function to discover the name of the variable that was passed via the y parameter. The name is used to create a heading for the table:

proc iml;
/* create function that prints simple descriptive statistics */
start DescStats(y);            /* the local variable is "y" */
   varName = ParentName("y");  /* find the name of "y" in calling environment */
   stats = {"N" "Mean" "Std Dev" "Lower Quartile" "Median" "Upper Quartile"};
   N = countN(y);  mean = mean(y);  std = std(y);
   call qntl(quartiles, y, {0.25 0.5 0.75});
   print (N || mean || std || quartiles`)[label=varName colname=stats];
finish;
 
/* test module on a variable from Sashelp.Cars */
use sashelp.cars; read all var {"MPG_City"}; close sashelp.cars;
 
run DescStats( MPG_City );    /* the name of variable passed in is "MPG_City" */
t_parentname2

I think this is a very cool feature. The PARENTNAME function returns the name of the variable that was passed in as the y parameter. The call returns the string "MPG_City". That string is then used in the LABEL= option for the PRINT statement. The result is that the table displays the name of the MPG_City variable in the heading.

The PARENTNAME function is used internally by the modules that create standard statistical graphics in SAS/IML. For example, the following statement creates a histogram of the MPG_City variable and labels the horizontal axis with the name of the variable:

call histogram( MPG_City );

What happens if the argument to the DescStats module is a temporary variable? In this case, the PARENTNAME function returns a blank string. Try running the statement call histogram(MPG_City+5) to see the results. If necessary, a programmer can detect that the PARENTNAME function returned a blank string and act accordingly. For example, when the SAS/IML statistical graphics routines encounter a temporary variable, they display "X" or "Y" as the label for the graph axis.

Post a Comment

Convert hexadecimal colors to RGB

In response to my recent post about how to use the PALETTE function in SAS/IML to generate color ramps, a reader wrote the following:

The PALETTE function returns an array of hexadecimal values such as CXF03B20. For those of us who think about colors as RGB values, is there an easy way to convert from hex to RGB?

Yes. In a previous article I explained how to convert from RGB triplets to hexadecimal values. Going the other way is no more difficult. I suggest using the SUBSTR function to extract each pair of hex values. Then use the INPUTN function to apply the HEX2. informat, which converts the hex value to its decimal equivalent. The SAS/IML implementation follows:

proc iml;
/* function to convert an array of colors from hexadecimal to RGB */
start Hex2RGB(_hex);
   hex = colvec(_hex);        /* convert to column vector */
   rgb = j(nrow(hex),3);      /* allocate three-column matrix for results */
   do i = 1 to nrow(hex);     /* for each color, translate hex to decimal */
      rgb[i,] = inputn(substr(hex[i], {3 5 7}, 2), "HEX2.");
   end;
   return( rgb);
finish;
 
/* test the function on a five-color example */
ramp = palette("YLORRD", 5);  /* five hex values for SAS colors */
RGB = Hex2RGB(ramp);          /* convert to 5 x 3 matrix of RGB values */
print RGB[c={"Red" "Green" "Blue"} r=ramp];
t_hexrgb

In the DATA step, you can use the same functions to convert hexadecimal colors. However, the SUBSTR function in the DATA step does not accept a vector of positions, so you need to make three separate function calls.

In a 2004 paper, Perry Watts provides a SAS macro called %RGBDec that you can use to convert a hexadecimal color into a string that contains the RGB values.

Post a Comment

Which double letters appear most frequently in English text?

Double, double toil and trouble;
Fire burn, and caldron bubble.
    Macbeth, Act IV, Scene I

For the cyptanalyst or recreational puzzle solver, "double double" does not lead to toil or trouble. Just the opposite: The occurrence of a double-letter bigram in an enciphered word puzzle is quite fortunate. Certain double letters appear more frequently in English text than other double letters, which means that double letters can help the cryptanalyst use frequency analysis to solve simple substitution ciphers.

For example, in the famous ten-word quote at the top of this article, the double letter 'BB' occurs in the word 'bubble.' Suppose we count all instances of double letters in the complete speech by the three witches at the beginning of Act IV of Macbeth. In the witches' 210-word incantation, the following double letters appear:

  • OO: 8 times
  • BB: 4 times
  • LL: 3 times
  • DD: 2 times
  • EE, GG, NN, and MM: 1 time

Of course, 210 words is not a very long chunk of text, and the witches' incantation is not typical of modern English text. Nevertheless, we can see from this simple exercise that some double letters appear more frequently than others. Presumably some double letters never occur (QQ and JJ, anyone?).

The frequency of double letters in an English corpus

Are the double letters in the witches' speech representative of the frequency with which double letters occur in a typical English text? To find out, let's take another look at the frequency of bigrams in Peter Norvig's analysis of a huge 744-billion-word corpus of documents that were digitized at Google. The following SAS/IML statements continue the program that analyzes bigrams. The matrix M is a 26 x 26 matrix that contains the proportion of every bigram in the corpus:

/* separate post on double letter combinations */
Letters = "A":"Z";                      
Doublets = vecdiag(M);              /* extract matrix diagonal */
call sortndx(ndx, Doublets, 1, 1);  /* create sorting index */
D = Doublets[ndx]; L = Letters[ndx];/* sort the bigrams */
print (D`)[c=L F=percent8.3];
t_doublebigram

The diagonal elements of the bigram matrix contain the proportions of double-letter bigrams: AA, BB, CC, and so forth. By sorting the diagonal elements, you can find the double-letter combinations that appear most frequently in the corpus. The most common double letter is L, with LL accounting for 0.6% of all bigrams. Other common double-letter bigrams are SS, EE, OO, and TT. Some double letters did not appear in the corpus: JJ, KK, QQ, VV, WW, and YY.

How to make sense of certain rare bigram frequencies?

I find it puzzling that the bigrams AA appear as often as the bigram ZZ. I would think articles about blizzards, puzzles, jazz, and pizza would completely swamp the few articles about aardvarks. I think the resolution to this quandary is that the corpus includes proper nouns, not just dictionary words. The double-A bigram will show up every time that there is a mention of AA and AAA batteries, the American Automobile Association (AAA) and proper nouns such as Paas egg-dying kits, Alderaan, and any boy named Aaron, Isaac, Jamaal, or Rashaad.

Similarly, although not many English words contain a double-X, the XX bigram shows up as often as ZZ. Presumably there are many articles that discuss the ExxonMobil energy company and the Exxon Valdez oil spill. The double-X bigram can also occur in Roman numerals and sporting events like Super Bowl XXXIV. And let's not forget the ubiquitous use of 'XXX' on the internet, which contributes two double-X bigrams to the count each time that it appears!

The distribution of frequencies for common bigrams

The following SAS/IML statements create a graph that shows the most common double-letter bigrams:

idx = loc(D>0.0001);     /* get rid of rare or impossible bigrams */
D = D[idx]; L = L[idx];
 
ods graphics / width=600px height=300px;
title "Relative Frequency of Double Letters in Corpus";
call scatter(L, D) grid={x y} 
            label={"Letter" "Proportion"} datalabel=rowcatc(L||L) debug=1;
doublebigram

The graph (click to enlarge) shows that the top three double-letter bigrams are LL, SS, and EE. These occur more than twice as often as the next set of double-letter bigrams, which includes OO, TT, FF, PP, and RR.

Returning to the Three Witches' incantation in Macbeth, we note that the most common double letters in the speech are different from the most common letters in the Google corpus. This is to be expected: the frequencies in a small sample almost always deviate from the frequencies in a population or in a large sample. Nevertheless, there is some similarity. The OO bigram is the most frequent double-letter bigram in the witches' speech, and it is also fairly common (#4) among all double-letter bigrams in the Google corpus. The LL bigram also appears frequently in the incantation and in the corpus (#1). However, the BB bigram appears much more often in the incantation than would be expected by looking at the corpus because the "double double... caldron bubble" refrain is repeated four times in the short passage.

This leads to an interesting statistical question: how much variation is there in the frequencies? The Google corpus provides an estimate for frequencies "in the wild," which we can think of as being extremely close to the frequencies in the "population" of all written English text. Obviously a random passage of text of a certain length will exhibit sample variation. There is also variation due to the type of text. The distribution of words (and therefore letters) is different between scholarly writing, journalism, poetry, and Twitter messages. (U think? AFAIK, LOL!)

In a future blog post, I will discuss the variation in these frequencies. Then I think it is time to get "cracking" and apply all this frequency analysis to the problem of solving a simple substitution cipher such as you might encounter in the Cryptoquote word puzzle.

Post a Comment

How to choose colors for maps and heat maps

Have you ever looked as a statistical graph that uses bright garish colors and thought, "Why in the world did that guy choose those awful colors?" Don't be "that guy"! Your choice of colors for a graph can make a huge difference in how well your visualization is perceived by the reader. This is especially true for heat maps and choropleth maps in which the colors of small area elements must be distinguishable. This article describes how you can choose good color schemes for heat maps and choropleth maps. Most of the article is language-independent, but at the end I show how to create graphs with good color schemes in SAS.

All color schemes are not created equal

I first became aware of the importance of color design in statistical graphics when I looked at the incredibly beautiful Atlas of United States Mortality by Linda Pickle and her colleagues (1997). Image from Pickle, et al (1997), Atlas of United States Mortality One of the reasons that the Atlas is so successful at conveying complex statistical data on maps is that it uses carefully designed color schemes and visualization strategies that were the result of pioneering research by Cynthia Brewer, Dan Carr, and many others. An example from the Atlas is shown to the left. The color schemes were designed so that one color does not visually dominate the others. Bright reds and yellows are nowhere to be found. Fully saturated colors are avoided in favor of slightly muted colors. The result is that graphs in the Atlas enable you to see variations in the data without being biased by visual artifacts that can result from poor color choices.

Choosing good colors for maps and graphs is not easy, which is why every statistician and graphics designer should know about Cynthia Brewer's ColorBrewer.org web site. The web site enables you to choose from dozens of color schemes that were carefully chosen to be aesthetically pleasing as well as statistically unbiased.

Sequential, diverging, and qualitative schemes

colorbrewerseqplot

Brewer's color schemes are implemented in SAS/IML 13.1 software through the PALETTE function. (Other statistical languages also provide access to these schemes, such as via the RColorBrewer package in R.) The PALETTE function enables you to get an array of colors by knowing the name of a color scheme. Some of the available color schemes are shown at the left. Brewer's "BLUES" color scheme is single-hue progression that is similar to the two-color ramp in the SAS HTMLBlue ODS style. The "YLORRD" color scheme is a yellow-orange-red sequential color ramp that shows a blended progression from yellow to red.

The color ramps at the left are appropriate for visualizing data when it is important to differentiate high values from low values. For these so-called sequential color schemes, the reference level is the lowest value.

There are other possible color schemes. When the reference value is in the middle of the data range (such as zero or an average value), you should use a diverging color scheme, which uses a neutral color for the reference value. Low values are represented by using one hue and high values by using a different hue.

colorbrewerdivplot

Examples of diverging color schemes are shown to the right. The first diverging color scheme ("BRBG" for brown-blue-green) is the default color ramp that is used to construct heat maps of discrete data in SAS/IML software. For example, a five-color version of that color ramp was used to visualize a binned correlation matrix in my recent article about how to create heat maps with a discrete color ramp. The "BDBU" scheme (a red-blue bipolar progression) is often used when one extreme is "good" (blue) and the other extreme is "bad" (red). The "SPECTRAL" color scheme is a useful alternative to the often-seen "rainbow" color ramp, which is notoriously bad for visualization and should be avoided. A spectral progression is often used for weather-related intensities, such as storm severity.

colorbrewerqualplot

Finally, there are qualitative color schemes, as shown to the left. These schemes are useful for visualizing data that have no inherent ordering, such as ethnicities of people, different diseases, types of music, and other nominal categories.

Using color schemes with heat maps in SAS

Although these color schemes were designed for choropleth maps, the same principles are useful for designing heat maps with a small number of discrete categories. You can use the PALETTE function in SAS/IML software to choose effective colors for heat maps and other statistical graphics.

The PALETTE function is easy to use: Specify the name of the color scheme and the number of colors that you want to generate. These schemes are designed for a small number of discrete categories, so the schemes support between eight and 12 distinct and distinguishable colors.

As an example, suppose that you have computed a binned correlation matrix, but you want to override the default color scheme. For your application, you want negative values to be red and positive to be blue, so you decide on the bipolar "RDBU" color scheme. The following SAS/IML statements continue the example from my previous post:

/* The disCorr matrix is defined in the previous example. */
ramp = palette("RDBU", 5);                 /* create diverging 5-color ramp */
call HeatmapDisc(disCorr) colorramp=ramp   /* use COLORRAMP= option */
     title="Binned Correlations" xvalues=varNames yvalues=varNames;
heatmapdisc3
heatmapdisc4

As a second example, the heat map at the left use a discrete set of colors to visualize the most frequently appearing bigrams (two-letter combinations) in an English corpus. The values of the matrix have been binned into five categories. Red cells indicate bigrms that appear more than 2% of the time. Dark orange cells indicate bigrams that appear between 1% and 2%. Light orange represents between 0.5% and 1%; light yellow represents less than 0.5% of the time. Gray cells represent bigrams that appear rarely or not at all. You can download the SAS program that computes this discrete heat map of bigram frequency.

Incidentally, you can also use the PALETTE function as an easy way to generate a continuous color ramp. You can use the PALETTE function to obtain three or five anchor colors and interpolate between those colors to visualize intermediate values. In my article about hexagonal bin plots, the continuous color map used for the density map of ZIP codes was created by using the colors from the call palette("YLORRD",5).

What method do you use to choose colors for maps or heat maps that display categories? Do you have a favorite color scheme? Leave a comment.

Post a Comment