How to use frequency analysis to crack the Cryptoquote puzzle

Many people enjoy solving word games such as the daily Cryptoquote puzzle, which uses a simple substitution cipher to disguise a witty or wise quote by a famous person. A common way to attack the puzzle is frequency analysis. In frequency analysis you identify letters and pairs of letters (bigrams) that occur often in the enciphered text. It is a fact that certain letters (e, t, a, o, i, n,....) and bigrams (th, he, in, er, an, re, on,...) appear frequently in English text. You can use this fact to guess that the most frequently occurring symbol in the text might represent e, t, or a. You can use the bigram frequencies in the text to help you make statistically likely guesses.

Usually frequency analysis is used for the initial guesses. Soon recognizable snippets of words begin to appear in the partially deciphered text, and you can often guess small words such as 'the', 'and', 'of', and 'for'. Then, like pulling a string on a sweater, the puzzle unravels. Use the context of the message to guess the remaining letters.

This article shows how to use frequency analysis to solve a cryptogram. I use a computer program to count the frequency of letters and bigrams and to quickly substitute guessed letters into the puzzle. However, the computer program merely provides an implementation of a tedious and repetitive task. You can perform the same analysis with pencil and paper. The program does not make any guesses itself, but merely makes it easier for a human to make a guess. (Type 'cryptogram solver' into an internet seach engine if you want an online solver.)

I use the convention that capital letters are symbols for the enciphered text whereas lowercase letters indicate the plaintext. Thus the plaintext message "word games are fun" might be enciphered as "RHES ADLUI DEU VYP." A partially deciphered solution might appear as "RHrS AaLeI are VYn," where the uppercase letters represent yet-to-be-solved letters.

Some functions that help decipher Cryptoquotes

In this blog I usually discuss statistical programming in the SAS/IML language. This post uses statistics and programming, but the discussion is about deciphering the Cryptoquote. The program that I use in this article is available for download for SAS users who have access to SAS/IML software. I wrote the following SAS/IML functions:

  • The PrintFreqRef routine prints reference tables that show how often certain letters and pairs of letters appear in English text. The tables are the results from previous blog posts and show the most frequently appearing letters, the most frequent bigrams, and the most frequent double-letter bigrams.
  • The FreqAnalysis routine reports the distribution of letters, bigrams, and double letters in the enciphered quote.
  • The SplitByWords function constructs an array from a quote. Each word is on one row; each letter is in a separate column. The main purpose of this routine is so that I can say "look at the 7th word" and you will be able to find it easily.
  • The ApplySubs function enables you to quickly see the result of replacing one or more cipher symbols with a guess. You can use the function to replace 'Q' with 't' and 'F' with 'e' and see whether the substitutions appear to make sense in the quote. If the substitution results in nonsense—such as 'te' as a two-letter word—then the substitution was not good and you can try a different guess.

Solving the Cryptoquote by using frequency analysis

The following statements start PROC IML, load the utility modules, and print the reference tables that show the expected frequencies of letters and bigrams in English text.

proc iml;
load module=_all_;  /* load the utility modules */
run PrintFreqRef();
cryptofreqref

The reference tables show the relative frequencies of the most common letters, such as e, t, a, o, i, n, and s. The most frequent pairs of letters are th, he, in, er, and so forth. The most common double-letter bigrams are ll, ss, ee, oo, and so forth. If you like to solve word puzzles by pencil and paper, you can print this table and keep a copy in your wallet or purse. Click the table to get it to appear on a separate web page.

Now let's look at an enciphered quote and perform a frequency analysis on its symbols:

msg = "KBSCS OCS KPUSH XBSW DOCSWKBTTG HSSUH WTKBPWJ " +
      "IAK LSSGPWJ KBS UTAKB KBOK IPKSH MTA. " +
      "~ DSKSC GS FCPSH";
run FreqAnalysis(msg);
cryptofreq

These tables look similar to the reference tables, but these tables are specific to this quote. You can create these table by hand if you have the time and patience. In this quote, almost 20% of the symbols are an S. The next most frequent symbols are K and B. Furthermore, the bigrams KB and BS appear frequently. This co-occurrence of three symbols and two bigrams is a "trifecta" that gives strong statistical evidence that S=e, K=t, and B=h. The S=e guess is further supported by the fact that SS appears twice in the message, and 'ee' is a common double-letter bigram.

Let's make the substitutions:

w = SplitByWords(msg);                /* split into array */
rownum = char(1:nrow(w));             /* row numbers, for reference */
w1 = ApplySubs(w, {K B S}, {t h e});  /* guess K=t, B=h, and S=e */
print w1[r=rownum];
cryptoquote1

It looks like we made a great guess! The first word is probably 'there', so the next substitution should include C=r. This guess is bolstered by the fact that CS appears several times in this message, and 're' is a frequent bigram. (The first word could also be 'these', but 'se' occurs less often than 're'.)

If that guess is correct, then we can further guess that O=a because that makes the second word 'are' and the 12th word 'that'.

To generate another guess, notice that the double-letter bigram TT appears in this quote in the 5th word. The bigram follows the plaintext bigram 'th', so TT must represent a double vowel. Because we have already deduced the symbol for 'e', we guess that TT is 'oo', which is the only other common double-vowel bigram. Therefore T=o.

Lastly, the 16th word is the author's first name. Since C=r, we guess that the first name might be 'peter'. The following statement makes these substitutions:

w2 = ApplySubs(w1, {O C}, {a r});
w3 = ApplySubs(w2, {T D}, {o p});
print w3[r=rownum];
cryptoquote2

Frequency analysis can suggest additional guesses. The symbols H, P, and W appear frequently in the enciphered quote. We also see that SH (=eH) is a frequent bigram. From the bigram frequency analysis, we might guess that H=n or H=s. By looking at the partially deciphered quote, it looks like W=n because the 5th word is probably 'parenthood', so we will guess H=s.

By looking at where the symbol P appears in the quote, you might guess that P is a vowel. The remaining vowel that occurs frequently is 'i'. This guess makes sense because the trigram PWJ appears twice, and this trigram could be 'ing'.

The following statement make these substitutions.

w4 = ApplySubs(w3, {H W G P J}, {s n d i g});
print w4[r=rownum];

Click this link to see the partially deciphered text that results from these substitutions. At this point, frequency analysis has revealed all but seven of the symbols in the enciphered quote. The remaining symbols are easily deduced by looking at the partially deciphered words and guessing the remaining letters. The deciphered quote is:

There are times when parenthood seems nothing but feeding the mouth that bites you.
     ~ Peter de Vries

Frequency analysis enabled us to guess values for certain frequently occurring symbols. By correctly guessing these key symbols, the puzzle was solved. Notice that the frequency of bigrams was essential in this process. If you use only the frequency of single letters, you might be able to guess the 'e' and 't', but the other letters are indistinguishable by looking at the empirical frequencies. However, by using bigrams, you can guess more symbols. For example, in this puzzle the bigrams helped us to discover the symbols for 'h', 'o', 's', and 'n'.

Of course, this process is not always linear. Sometimes you will make a guess that turns out to be wrong. In that case, eliminate the wrong guess and try another potential substitution.

To conclude this article, I will leave you with another Cryptoquote. I will generate the frequency analysis for you, in case you want to solve the puzzle by hand. You can download the SAS/IML program that deciphers this quote. The program contains comments that explain how frequency analysis is used to solve the puzzle.

cryptofreq2
CG EKLVFX WLU MVXZG IGLIFG YH UKGSB IGPQ LR GNDGFFGWDG; YVU YH UKG XSEUPWDG UKGH KPTG UBPTGFGX RBLA UKG ILSWU CKGBG UKGH EUPBUGX.
     ~ KGWBH CPBX YGGDKGB

Post a Comment

Video: Using heat maps to visualize matrices

One of my presentations at SAS Global Forum 2014 was about the new heat map functions in SAS/IML 13.1. Over the summer I created a short video of my presentation, which gives an overview of visualizing matrices with heat maps, and describes how to choose colors for heat maps:



If your browser does not support embedded video, you can go directly to the video on YouTube.

If you prefer to read articles about heat maps so that you can study the concepts and cut and paste examples, here are a few recent blog posts that are based on my SAS Global Forum presentation:

For a fun application of discrete heat maps, see my article about implementing a one-dimensional cellular automata in SAS.

Post a Comment

Does this kurtosis make my tail look fat?

What is kurtosis? What does negative or positive kurtosis mean, and why should you care? How do you compute kurtosis in SAS software?

It is not clear from the definition of kurtosis what (if anything) kurtosis tells us about the shape of a distribution, or why kurtosis is relevant to the practicing data analyst. Mathematically, the kurtosis of a distribution is defined in terms of the standardized fourth central moment. The estimate of kurtosis from a sample also involves a summation that accumulates the fourth powers of a standardized variable. In SAS software, the formula for the sample kurtosis is given in the "Simple Statistics" section of the documentation.

This article presents an intuitive explanation of kurtosis that is often true in practice.

What does kurtosis mean?

Let's compute the sample kurtosis for two variables in the Sashelp.Heart data set and see whether the kurtosis values tell us anything about the shape of the data distributions. The variables are AgeAtStart (the age of a patient when he or she entered the study) and Systolic (his or her systolic blood pressure). The following statements call the MEANS procedure to compute estimates for the mean, median, standard deviation, skewness, and kurtosis:

proc means data=sashelp.heart mean median std skew kurtosis;
var AgeAtStart Systolic;
run;
t_kurt1

The last column shows that the kurtosis of the AgeAtStart data is negative, whereas the kurtosis of the Systolic data is positive. What does that mean? Well, first recall that whereas the mean, median, and standard deviation are expressed in the same units as the data, kurtosis (like skewness) is a dimensionless quantity. The zero value corresponds to a standard reference situation. For kurtosis, the reference distribution is the normal distribution, which is defined to have a kurtosis of zero. (Technically, I am describing the excess kurtosis, since this is the value returned by statistical software packages. Researchers sometimes consider the full kurtosis, which is 3 more than the excess kurtosis.)

The interpretation of kurtosis in terms of the shape of the distribution can be tricky, but for many unimodal distributions the kurtosis compares the peak and tails of the distribution to the normal distribution:

  • A data distribution with negative kurtosis is often broader, flatter, and has thinner tails than the normal distribution.
  • A data distribution with positive kurtosis is often narrower at its peak and has fatter tails than the normal distribution.

Negative kurtosis for a sample indicates that the sample contains many observations that are a moderate distance from the center, but few outliers that are far from the center. The peak—if there is one—looks more like a butte or a rounded hilltop than a jutting spire. The canonical distribution that has negative kurtosis is the continuous uniform distribution, which has a kurtosis of –1.2.

In contrast, positive kurtosis indicates that many values will be near the center, but that you should also expect a certain proportion of values that are far from the center. A heavy-tailed distribution has large kurtosis. The canonical distribution that has a large positive kurtosis is the t distribution with a small number of degrees of freedom. Some heavy-tailed distributions have infinite kurtosis.

Graphically testing the kurtosis interpretation

You can run a little experiment to see how the shape of the AgeAtStart and Systolic distributions compares with the normal distribution. What would you expect to see if you overlay a normal curve on a histogram of those variables? For the AgeAtStart variable, which has negative kurtosis, you should observe a distribution that is "broad in the shoulders" and that has thin tails. For the Systolic variable, you should observe a distribution that has many observations clustered together, drops off faster than a normal distribution, and has at least one heavy tail. The following statements use PROC SGPLOT to overlay a histogram with a normal curve centered at the median:

ods graphics / width=330px height=250px;
proc sgplot data=sashelp.heart noautolegend;
histogram AgeAtStart / binwidth=5 binstart=30 showbins;
density AgeAtStart / type=normal(mu=43); /* center at median */
run;
 
proc sgplot data=sashelp.heart noautolegend;
histogram Systolic / binwidth=10 binstart=80 showbins;
density Systolic / type=normal(mu=132); /* center at median */
run;
kurt1

The histogram for the AgeAtStart variable shows that the distribution is broad and flat. It does not have a pronounced central peak, and there is substantial mass along most of the range of the data. The distribution appears to be bounded: Apparently the heart study did not accept any patients who were younger than 28 or older than 62. This implies that there are no tails for the population distribution. Overall, the AgeAtStart variable provides a prototypical example of a distribution that has negative kurtosis.


kurt2

The histogram of the Systolic variable has a narrow peak and a long right tail. There are many patients in the study with systolic blood pressure in the range 120–140. (The American Heart Association calls that range prehypertensive.) Compared to the normal distribution, there are relatively few observations in the ranges (80, 120) and (140, 170). However, the distribution has more extreme observations than would be expected for a normally distributed variable, so the distribution has a fat tail.

Computing kurtosis in SAS software

There are three SAS procedures that compute the kurtosis of data distributions. PROC MEANS was used earlier in this article. Use PROC UNIVARIATE when you want to fit a univariate model to the data. PROC IML includes the KURTOSIS function in SAS/IML 13.1, as shown in the following example:

proc iml;
varNames = {"AgeAtStart" "Systolic"};
use Sashelp.Heart;  read all var varNames into X;  close Sashelp.Heart;
 
kurt = kurtosis(X);
print kurt[colname=varNames];
t_kurt2

When the intuitive interpretation of kurtosis can fail

The relationship between kurtosis and the shape of the distribution are not mathematical rules, they are merely intuitive ideas that often hold true for unimodal distributions. If your distribution is multimodal or is a mixture of distributions, these intuitive rules might not apply.

There have been many articles about kurtosis and its interpretation. My favorite is an article by Balanda and MacGillivray (1988, TAS). They give examples of three different densities that each have the same kurtosis value but that look very different. They conclude that "because of the averaging process involved in its definition, a given value of [kurtosis] can correspond to several different distributional shapes" (p. 114).

They recommend that kurtosis be defined as "the location- and scale-free movement of probability mass from the shoulders of a distribution into its center and tails. In particular, this definition implies that peakedness and tail weight are best viewed as components [emphasis mine] of kurtosis.... This definition is necessarily vague because the movement can be formalized in many ways" (p. 116). In other words, the peaks and tails of a distribution contribute to the value of the kurtosis, but so do other features.

The tail of the distribution is the most important contributor. Although Balanda and MacGillivray do not mention it, the kurtosis is a non-robust statistic that can be severely influenced by the value of a single outlier. For example, if you choose 999 observations from a normal distribution, the sample kurtosis will be close to 0. However, if you add a single observation that has the value 100, the sample kurtosis jumps to more than 800!

If you like pathological distributions and mathematically contrived examples of all the ways that kurtosis can fail to indicate the shape of a distribution, see Balanda and MacGillivray's article. However, the intuitive notions in this article hold true for many unimodal data distributions that arise in practice.

Post a Comment

Compute the log-determinant of an arbitrary matrix

A few years ago I wrote an article that shows how to compute the log-determinant of a covariance matrix in SAS. This computation is often required to evaluate a log-likelihood function. My algorithm used the ROOT function in SAS/IML to compute a Cholesky decomposition of the covariance matrix. The Cholesky decomposition exists only for symmetric positive definite matrices, which means that this implementation cannot be used to compute the log-determinant of an arbitrary square matrix.

For an arbitrary square matrix, the determinant can be negative or zero. The log function is undefined for nonpositive values, but the log of the absolute value of the determinant always exists. This computation, along with the sign of the determinant, enables you to reconstruct the determinant of any matrix.

SAS/IML 13.1 supports the LOGABSDET function, which computes the logarithm of the absolute value of the determinant for any square matrix. The function returns a vector with two elements: the first element is the logarithm, the second is the sign of the determinant.

In my earlier article I wrote:

The determinant of a Vandermonde matrix with consecutive integer elements increases super-factorially with the dimension of the matrix! For a 30 x 30 integer Vandermonde matrix, the determinant is too large to represent as a standard double-precision floating-point number.

A Vandermonde matrix is not symmetric, so let's construct a Vandermonde matrix and compute its log-determinant by using the LOGABSDET function. The following SAS/IML function computes a Vandermonde matrix as defined in Golub and van Loan. The Wikipedia article uses the transpose of this matrix as the definition.

proc iml;
   /* Given a vector v = {v1 v2 v3 ... vn}, this function returns 
      the Vandermonde matrix defined by
      [ 1    1    1    ...   1        ]
      [ v1   v2   v3   ...   vn       ]
      [ v1^2 v2^2 v3^2 ...   vn^2     ]
      [ ...  ...  ...  ...    ...     ]
      [ v1^{n-1}  ...  ...   vn^{n-1} ]
   */
start CreateVandermondeMatrix( _v );
   v = rowvec(_v);
   m = ncol(v);
   mVander = j(m,m,1);
   do i = 2 to m;
      mVander[i,] = mVander[i-1,] # v;
   end;
   return ( mVander );
finish;
 
v = 1:10;
m = CreateVandermondeMatrix(v);
logDet = logabsdet(m);
print logDet[colname={"log(det)" "sign(det)"}];
t_logabsdet

For this 1 x 10 matrix, the log-determinant is about 49, which means that the determinant is exp(49)  ≈ 1.8 x 1021. For the 30 x 30 nonsymmetric matrix generated from v = 1:30, the determinant is too large to be computed by using the DET function, but the LOGABSDET function has no problem computing the log-determinant, which is approximately 901.

Post a Comment

Wolfram's Rule 30 in SAS

My previous blog post describes how to implement Conway's Game of Life by using the dynamically linked graphics in SAS/IML Studio. But the Game of Life is not the only kind of cellular automata. This article describes a system of cellular automata that is known as Wolfram's Rule 30. In this blog post I show how to compute and visualize this system in SAS by using traditional PROC IML and a simple heat map.

Wolfram's Rule 30

Conway's Game of Life is a set of rules for evolving cellular automata on a two-dimensional grid. However, one-dimensional automata are simpler to describe and to compute. A well-known one-dimensional example is Wolfram's Rule 30 (1983, Rev. Mod. Phys.). Consider a sequence of binary symbols, such as 0 and 1. An initial configuration determines the next generation by using the following two rules for each cell:

  1. If the cell and its right-hand neighbor are both 0, then the new value of the cell is the value of its left-hand neighbor.
  2. Otherwise, the new value of the cell is the opposite of the value of its left-hand neighbor.

The rules are often summarized by using three-term sequences that specify the value of the left, center, and right cells. The sequence 100 means that the center cell is 0, its left neighbor is 1, and its right neighbor is 0. According to the first rule, the value of the center cell for the next generation will be 1. The sequence 000 means that the center cell and both its neighbors are 0. According to the first rule, the value of the center cell for the next generation will be 0. All other sequences are governed by the second rule. For example, the sequence 010 implies that the center cell for the next generation will be 1, which is the opposite of the value of the left neighbor.

There are several ways that you can implement Rule 30 in a matrix language such as SAS/IML. The following SAS/IML program defines a function that returns the next generation of binary values when given the values for the current generation. As for the Game of Life, I define the leftmost and rightmost cells to be neighbors. The evolution function is vectorized, which means that no loops are used to compute a new generation from the previous generation. The second function computes each new generation as a row in a matrix. You can print that matrix or use the HEATDISC subroutine in SAS/IML 13.1 to create a black-and-white heat map of the result.

proc iml;
start Rule30Evolve(x);                     /* x is binary row vector */
   y = x[ncol(X)] || x || x[1];            /* extend x periodically */
   N = ncol(y);
   j = 2:N-1;    xj = y[,j];               /* the original cells */
   R = y[,j+1];  L = y[,j-1];              /* R=right neighbor; L=left neighbor */
   idxSame = loc(xj=0 & R=0);              /* x00 --> center cell is x */
   if ^IsEmpty(idxSame) then y[,idxSame+1] = L[,idxSame];      
   idxDiff = setdif(1:N-2, idxSame);       /* otherwise, center cell is ^x */
   if ^IsEmpty(idxDiff) then y[,idxDiff+1] = ^L[,idxDiff];
   return( y[,j] );
finish;
 
start Rule30(x0, Iters=ceil(ncol(x0)/2));
   N = ncol(x0);
   M = j(Iters,N,0);                       /* initialize array to 0 */
   M[1,] = x0;                             /* set initial configuration */
   do i = 2 to Iters;
      M[i,] = Rule30Evolve(M[i-1,]);       /* put i_th generation into i_th row */
   end;
   return( M );
finish;
 
N = 101;
x0 = j(1,N,0);                       /* initialize array to 0 */
x0[ceil(N/2)] = 1;                   /* set initial condition for 1st row */
M = Rule30(x0, N);
ods graphics / width=800px height=600px;
call heatmapdisc(M) displayoutlines=0 colorramp={White Black}
     title = "Wolfram's Rule 30 (N=101)";
rule30

The discrete heat map uses white to indicate cells that are 0 and black to indicate cells that are 1. The initial configuration has 100 cells of white and a single black cell in the middle of the array. That configuration is shown in the first row of the heat map. (Click to enlarge.) The second generation has three black cells in the middle, as shown in the second row. The central region of cells grows two cells larger for each generation. For the first 50 generations, the left boundary is always 001 and the right boundary alternates between 111 and 001. The pattern of the interior cells is not periodic, although certain interesting patterns appear, such as "martini glasses" of various sizes.

After 50 generations, the left and right edges begin to interact in a nontrivial way. Periodicity at the edges is replaced by irregular patterns.

Supposedly the evolution of the middle cell of the array generates nearly random 0-1 values. Mathematica, a computer software package created by Stephen Wolfram, uses the central column of a Rule-30 cellular automata as a random number generator. From a simple deterministic rule comes complex behavior that seems indistinguishable from randomness.

The heat map shows all generations at a single glance by stacking them as rows in a single image. This is different from the Game of Life visualization, which used animation to show how an initial configuration evolves from one generation to the next. You can also use animation to visualize the Rule-30 system. The following animated GIF shows the same 100 generations of an initial configuration that consists of a single black cell in the middle of a 101-cell array. This animation shows how that configuration evolves in time.

rule30

It is fun to play with Rule 30. To begin, I suggest an array of 21 cells that evolves for 20 generations. The resulting black-and-white heat map is small enough to study, and you can use the DISPLAYOUTLINES=1 option to better visualize which cells are white during the evolution. You can invent your own initial configuration or use random initial conditions, as shown below:

N = 21;
x0 = T( randfun(N, "Bernoulli", 4/N) );   /* random; expect 4 black cells */
M = Rule30(x0);
call heatmapdisc(M) displayoutlines=1 colorramp={White Black}
     title = "Wolfram's Rule 30 (N=21)";

Enjoy!

Post a Comment

Cellular automata and the Game of Life in SAS

A colleague jokingly teases me whenever I write a blog that demonstrates how to write fun and exciting programs by using SAS software. "Why do you get to have all the fun?" he mock-chides.

Today I'm ready to face his ribbing, because this article is about Conway's Game of Life and how to implement cellular automata in the SAS/IML language.

Cellular automata and Conway's Game of Life

Conway's Game of Life is a set of simple rules that give rise to beautiful regular and irregular patterns. The rules describes how cellular automata live or die based on the presence or absence of neighbors on a rectangular grid. This dynamical system is a "game" because it is fun to create some initial distribution of automata and watch the evolution of the system from that initial condition. Like many programmers, I wrap the edges of the grid so that the cells on the right and left sides of the grid are neighbors, and similarly for the cells on the top and bottom edges.

The following rules were used to evolve the automata:

  • Birth Rule: An organism is born into any empty cell that has exactly three living neighbors.
  • Survival Rule: An organism with either two or three neighbors survives from one generation to the next.
  • Death Rule: An organism with four or more neighbors dies from overcrowding. An organism with fewer than two neighbors dies from loneliness.

The Game of Life in SAS/IML Studio

Almost every beginning programmer has attempted to program the Game of Life. In a matrix language such as SAS/IML, the implementation of the rules are particularly easy and compact. I first implemented the Game of Life in SAS/IML Studio in 2001. My program is part of the standard set of demo programs that are distributed with SAS/IML Studio. To access the program from the SAS/IML Studio menu, select File ⇒ Open ⇒ File. Then click Go to Installation directory. Select the Programs and Demos folders, then open the file Life.sx.

When you run the program, you are prompted for the size of a square grid (I like 20 rows and columns) and told to click on the grid to create an initial distribution of the automata. (Hold down the SHIFT key while you click.) The following animated GIF shows 100 steps that evolve from a simple initial condition that contains an "exploder" pattern and a "glider" pattern.

lifeani

The interesting thing about this implementation in SAS/IML Studio is that I didn't program the animation at all. The "live" automata are observations that are selected in the grid. The program figures out which cells are alive and dead at the next generation and then simply tells the in-memory copy of the data to update the selected observations. The plot automatically updates to reflect the newly selected observations. To learn more about dynamically linked graphics and how to select observations in SAS/IML Studio, see my book Statistical Programming with SAS/IML Software, especially Chapters 6 and 10.

Recall that SAS/IML Studio is a free download and can be installed on a Windows PC and used by anyone who has a license for SAS/IML and SAS/STAT software. Unfortunately, SAS/IML Studio is not available for users of the free SAS University Edition. If you do not have access to SAS/IML Studio, here is a version of the Life.sx program. You can use the computational portions of the program in PROC IML, even though you cannot create the IMLPlus graphics.

Creating an animated GIF in SAS/IML Studio

When you run the program in SAS/IML Studio, the graphics are displayed on your PC monitor. So how did I create an animated GIF to embed in this blog post? It was surprisingly easy! In the IMLPlus program, the on-screen graph is controlled through a Java variable called plot. You can use the SaveToFile method to create a bitmap version of the graph. If you call that method within the loop that updates the graph for each new generation of automata, you get a sequence of bitmaps in some local directory, like this:

path = 'C:\Temp\AnimGif\Life\';       /* directory to store images */
plot.SaveToFile(path+"Life000.bmp");  /* save the initial configuration */
do i = 1 to nGenerations;             /* for each new generation */
   run update_generation( grid );     /* compute new configuration for automata */
   run Grid2Selection( dobj, grid );  /* update the selected observations */
   plot.SetTitleText("Game of Life: Generation = "+char(i,3));  /* update title */
   suffix = trim(putn(i, "Z3."));     /* 001, 002, ..., 100 */
   plot.SaveToFile(path+"Life"+suffix+".bmp"); /* save image to Life001.bmp */
end;

After creating 100 bitmap images in a directory, you can create an animated GIF by using a program that stitches the images together in sequence. I used a freeware Java program called GiftedMotion, which does not require you to install any software. Just download the JAR file to your desktop, double click, and navigate to the directory that contains the images.

Post a Comment

Fat-tailed and long-tailed distributions

The tail of a probability distribution is an important notion in probability and statistics, but did you know that there is not a rigorous definition for the "tail"? The term is primarily used intuitively to mean the part of a distribution that is far from the distribution's peak or center. There are several reasons why a formal definition of "tail" is challenging:

  • Where does the tail start? The tail is "far away from the mean," but how far away? For the standard normal distribution, should the tail start three standard deviations from the mean? Five?
  • Not all distributions are as well-behaved as the normal distribution. The Student t distribution with 1 degree of freedom (also called the Cauchy distribution) is the classic example of a distribution that does not have a well-defined mean or variance. Consequently, it is not always possible to define the tail in terms of some number of standard deviations away from the mean or median.
  • Does the uniform distribution on [0, 1] have a tail? Many people would say no. Thus not every distribution has a tail. In particular, if you want "the behavior of the tail" to describe the characteristics of the probability density function (PDF) when |x| gets large, then bounded distributions do not have tails. This example also shows that you can't merely use the 95th percentile to define the tail.

Nevertheless, some features of tails can be quantified. In particular, by using limits and asymptotic behavior you can define the notion of heavy tails. This article discusses heavy-tailed distribution and two important subclasses: the fat-tailed distributions and the long-tailed distributions.

Heavy-tailed distributions

When discussing how much mass is in the tail of a probability density function, it is convenient to use the exponential distribution as a reference. The PDF of the exponential distribution approaches zero exponentially fast. That is, the PDF looks like exp(-λx) for large values of x. Thus you can divide distributions into two categories according to the behavior of their PDFs for large values of |x|.

  • Probability distribution functions that decay faster than an exponential are called thin-tailed distributions. The canonical example of a thin-tailed distribution is the normal distribution, whose PDF decreases like exp(-x2/2) for large values of |x|. A thin-tailed distribution does not have much mass in the tail, so it serves as a model for situations in which extreme events are unlikely to occur.
  • Probability distribution functions that decay slower than an exponential are called heavy-tailed distributions. The canonical example of a heavy-tailed distribution is the t distribution. The tails of many heavy-tailed distributions follow a power law (like |x|–α) for large values of |x|. A heavy-tailed distribution has substantial mass in the tail, so it serves as a model for situations in which extreme events occur somewhat frequently.

Fat-tailed distributions

From a modeling perspective, fat-tailed distributions are important when extreme events must be part of the model. For example, models of claims in the home insurance industry have to account for an intense category 5 hurricane hitting the Miami area, or Superstorm Sandy flooding the New York area. Although the probability of these extreme events is small, the size of the payout is so large that these events are vital to models of insurance claims.

cauchypdf

From a mathematical perspective, the most fascinating aspect of fat-tailed distributions is that the mean, variance, and other measures that describe the shape of the distribution might not be defined. You might wonder how this can be. For the symmetric Cauchy distribution, pictured at the left, it looks like the mean should be at zero. After all, the function is symmetric. However, although the Cauchy distribution has a well-defined median and mode at 0, the mean is not defined.

Mathematically, the problem is that a certain integral does not exist. However, a more convincing demonstration is to run a simulation that draws random values from the Cauchy distribution and computes the mean as the sample size increases. If the mean exists, then the running mean should converge to some value. But it doesn't. Instead, every so often an arbitrarily large value is generated, which causes the mean to jump by a large amount. This is shown by the following simulation and visualization in SAS software:

proc iml;
call randseed(4321);
N = 1e6;                         /* number of random draws */
x = randfun(N, "Cauchy");        /* random draws from Cauchy distribution */
RunningMean = cusum(x)/ T(1:N);  /* mean for each sample size */
 
t = do(10000, N, 10000);         /* plot the mean only for these sample sizes */
RunningMean = RunningMean[t];
title "Running Mean of Random Cauchy Data";
ods graphics / width=800px height=400px;   /* size of image */
call scatter(t, RunningMean) other="refline 0 / axis=y;";
cauchyrunningmean

As an exercise, you can rerun the program for other distributions. Change the distribution family to "Normal" to see how quickly the running mean for the normal distribution converges to 0. Or change the family to "Exponential" and watch the running mean converge to 1.

Long-tailed distributions

Another kind of heavy-tailed distribution is the long-tailed distribution, which is used to model many internet-era phenomena such as the frequency distribution of book titles sold at Amazon.com or the frequency of internet search terms.

People submit billions of unique search terms to search engines. Imagine sorting all internet searches from a certain time period according to the number of times that the search term was submitted. The top-ranked terms will probably be for celebrities (Kim Kardashian, Katy Perry,...) along with terms related to world events such as wars, natural disasters, and politics. According to research on Google searches, the top-1000 search terms might account for 10% of the total searches. But there are many millions of search terms that appear only a few times, or perhaps only one time. How many search terms do you think account for 50% of the search traffic? 5,000? 10,000? The correct answer is much, much bigger! Author Chris Anderson, a former editor-in-chief at Wired magazine, says "if search were represented by a tiny lizard with a one-inch head, the tail of that lizard would stretch for 221 miles."

I see the same long-tailed distribution when I look at the number of views for my 500+ blog articles. The most popular 20 posts account for about 50% of the total views. But the frequency distribution of posts has a long tail, with about 180 posts getting 0.1% or more of the views, and about 300 posts getting 0.05% or more of the views. During a typical month, almost every article that I have ever written gets viewed at least once.

Attempts have been made to classify and rank distributions according to the rate of decay in their tails. For an introduction to this work, see Parzen (1979, JASA), but be sure to also read the rejoinder by John Tukey.

Summary

Although the location of the tail of a probability distribution is not rigorously defined, nevertheless a lot of research has been conducted on the asymptotic behavior of distributions. The normal and exponential distributions serve as convenient reference distributions. Distributions that have a lot of probability mass in their tails are known as heavy-tailed distributions. Two subclasses of heavy-tailed distributions are the fat-tailed distributions and the long-tailed distributions. Each is useful for modeling real-world phenomena.

Have you encountered data that exhibits a heavy tail? What was the application? How did you model the data?

Post a Comment

The frequency of double-letters in Cryptoquotes

It usually takes more than three weeks to prepare a good impromptu speech.
       --Mark Twain

In the popular Cryptoquote puzzle, you are presented with an enciphered version of a quote by a famous person. One of the appeals of the puzzle for me is reading the deciphered quote, such as the witty aphorism by Mark Twain at the top of this post. Last week I discussed the frequency of double-letter bigrams in a huge English corpus. I asserted that a frequency analysis of "double letters can help the cryptanalyst ... to solve simple substitution ciphers."

In practice, double letters might not appear at all in a short quote. Even if they do appear, is the frequency distribution of double letters in short quotes the same as in Peter Norvig's analysis of a huge 744-billion-word corpus? If a certain double-letter bigram is expected to appear 0.05% of the time in lengthy texts, how often might it appear in pithy phrases that contains about 100 letters? Obviously the bigram will appear zero times for many phrases. Should we expect it to ever appear three times within a 100-letter quote? Four times?

To find out, I collected 120 (unenciphered) quotes. Some I took from Cryptoquote puzzles that I have been saving. Others I took from puzzles that appear on the DailyCryptogram.com web site. A few I sampled from the BrainyQuote web site. You can download the program that defines the data and produces the analyses in this article.

The quotes had an average length of 97 characters (mean=96.8; standard deviation 14.5). The minimum quote length was 66 characters; the maximum was 127 characters. The characters include spaces and punctuation. The average quote contains 59 bigrams, of which 1.8 were double-letter bigrams. The Mark Twain quote at the top of this article is atypical in that this 87-character quote contains five double-letter bigrams (out of 55).

doublelettervar1

The heat map at the left shows the count for double-letter bigrams in 20 quotes that were chosen at random from the collection of 120 quotes. The vertical axis is ordered by the total count of bigrams. For example, the first row of the matrix says that the 97th quote in the data base has seven double letters: four are 'LL', two are 'OO', and one is 'NN'. The 51st quote has six double letters: three are 'LL' and the others are 'OO', 'PP', and 'RR'. Three of the quotes have only one double-letter combinations, and four have no double letters.

doublelettervar3

The bar chart at the left shows the distribution of counts for double-letter bigrams in the complete collection of 120 quotes. There are 21 quotes (18%) that do not contain a single double-letter. For these quotes, knowing the distribution of double-letter bigrams does not help to solve the puzzle. Most quotes contain one or two double-letters. For these puzzles you should probably guess that the double-letter bigram is a common one (LL, SS, EE, or OO) and see if your guess makes sense for the remainder of the puzzle. For quotes that have four or more double letters, it might be harder to guess which enciphered double-letter combinations corresponds to which common double-letter bigrams.

If you average the bigram counts, you can compute the proportion of times that 'AA', 'BB', 'CC', and so forth appear in the sample of 120 quotes. You can use statistics to compute a 95% confidence interval for the proportion of each double letter. The following graph (click to enlarge) displays the following information:

  • An estimate for the proportion of each double-letter bigram. The estimate is the number of times that the bigram appears in the sample, divided by 120 (the sample size). These estimates are shown in the graph by a circle.
  • A 95% confidence interval for each proportion, assuming that this sample of quotes is representative of the population of all quotes that have roughly 100 characters.
  • The bigram proportions from Norvig's analysis of the Google corpus. These proportions are joined by a blue line to make it easier for your eye to traverse from one bigram to the next.
doublelettervar2

In this sample of quotes, the LL, OO, and RR bigrams appeared more often than you might expect from their frequencies in the corpus. The FF and HH bigrams appeared less often than expected.

So what have we learned?

  • The heat map shows that there is considerable variation in the number and types of double-letter bigrams that appear in a short quote, such as is used in the Cryptoquote puzzle.
  • The bar chart shows that one or two double-letter bigrams are expected in a short Cryptoquote. However, about 18% of the quotes did not contain any double letters.
  • Although the frequency distribution of the double-letter bigrams was not identical to the distribution in longer texts, it is fairly close. For the purpose of solving word puzzles, you can safely use the frequency of double letters that I reported in my previous article.
Post a Comment

The name of a parameter in the parent environment

SAS/IML 13.1 includes a handy function for programmers who write a lot of modules. The PARENTNAME function obtains the name of the symbol that was passed in as a parameter to a user-defined module.

How is this useful? Well, suppose that you want to create a SAS/IML module that prints simple descriptive statistics about a data vector, similar to this call to PROC MEANS:

proc means nolabels data=sashelp.cars N mean std Q1 median Q3;
var MPG_City;
run;
t_parentname1

Look at the heading of the table. The heading displays the name of the numerical variable that the statistics summarize. If you want to display this table from within a SAS/IML module, the module somehow needs to discover the name of the data vector that was passed into the module. That is, it needs information from the parent (or calling) environment.

This problem was resolved in SAS/IML 13.1 by introducing the PARENTNAME function. The following SAS/IML statements create a module that prints descriptive statistics for a data vector. The module calls the PARENTNAME function to discover the name of the variable that was passed via the y parameter. The name is used to create a heading for the table:

proc iml;
/* create function that prints simple descriptive statistics */
start DescStats(y);            /* the local variable is "y" */
   varName = ParentName("y");  /* find the name of "y" in calling environment */
   stats = {"N" "Mean" "Std Dev" "Lower Quartile" "Median" "Upper Quartile"};
   N = countN(y);  mean = mean(y);  std = std(y);
   call qntl(quartiles, y, {0.25 0.5 0.75});
   print (N || mean || std || quartiles`)[label=varName colname=stats];
finish;
 
/* test module on a variable from Sashelp.Cars */
use sashelp.cars; read all var {"MPG_City"}; close sashelp.cars;
 
run DescStats( MPG_City );    /* the name of variable passed in is "MPG_City" */
t_parentname2

I think this is a very cool feature. The PARENTNAME function returns the name of the variable that was passed in as the y parameter. The call returns the string "MPG_City". That string is then used in the LABEL= option for the PRINT statement. The result is that the table displays the name of the MPG_City variable in the heading.

The PARENTNAME function is used internally by the modules that create standard statistical graphics in SAS/IML. For example, the following statement creates a histogram of the MPG_City variable and labels the horizontal axis with the name of the variable:

call histogram( MPG_City );

What happens if the argument to the DescStats module is a temporary variable? In this case, the PARENTNAME function returns a blank string. Try running the statement call histogram(MPG_City+5) to see the results. If necessary, a programmer can detect that the PARENTNAME function returned a blank string and act accordingly. For example, when the SAS/IML statistical graphics routines encounter a temporary variable, they display "X" or "Y" as the label for the graph axis.

Post a Comment

Convert hexadecimal colors to RGB

In response to my recent post about how to use the PALETTE function in SAS/IML to generate color ramps, a reader wrote the following:

The PALETTE function returns an array of hexadecimal values such as CXF03B20. For those of us who think about colors as RGB values, is there an easy way to convert from hex to RGB?

Yes. In a previous article I explained how to convert from RGB triplets to hexadecimal values. Going the other way is no more difficult. I suggest using the SUBSTR function to extract each pair of hex values. Then use the INPUTN function to apply the HEX2. informat, which converts the hex value to its decimal equivalent. The SAS/IML implementation follows:

proc iml;
/* function to convert an array of colors from hexadecimal to RGB */
start Hex2RGB(_hex);
   hex = colvec(_hex);        /* convert to column vector */
   rgb = j(nrow(hex),3);      /* allocate three-column matrix for results */
   do i = 1 to nrow(hex);     /* for each color, translate hex to decimal */
      rgb[i,] = inputn(substr(hex[i], {3 5 7}, 2), "HEX2.");
   end;
   return( rgb);
finish;
 
/* test the function on a five-color example */
ramp = palette("YLORRD", 5);  /* five hex values for SAS colors */
RGB = Hex2RGB(ramp);          /* convert to 5 x 3 matrix of RGB values */
print RGB[c={"Red" "Green" "Blue"} r=ramp];
t_hexrgb

In the DATA step, you can use the same functions to convert hexadecimal colors. However, the SUBSTR function in the DATA step does not accept a vector of positions, so you need to make three separate function calls.

In a 2004 paper, Perry Watts provides a SAS macro called %RGBDec that you can use to convert a hexadecimal color into a string that contains the RGB values.

Post a Comment