What is the coefficient of variation?

I sometimes wonder whether some functions and options in SAS software ever get used. Last week I was reviewing new features that were added to SAS/IML 13.1. One of the new functions is the CV function, which computes the sample coefficient of variation for data.

Maybe it is just me, but when I compute descriptive statistics for univariate data, the coefficient of variation is not a statistic that I look at. I don't think my undergraduate statistics course even mentioned the coefficient of variation (CV). I first encountered the idea many years later when learning about distribution theory.

The CV is a simple idea. For a distribution, the coefficient of variation is the ratio of the standard deviation to the mean: CV = σ/μ. You can estimate the coefficient of variation from a sample by using the ratio of the sample standard deviation and the sample mean, usually multiplied by 100 so that it is on the percent scale. This ratio is also known as the relative standard deviation when the data are positive.

What does the coefficient of variation mean?

The coefficient of variation is a dimensionless quantity. As such, it provides a measure of the variability of a sample without reference to the scale of the data.

Suppose I tell two people to measure the heights of some plants. The first person reports that the average height is 1.2 meters, with a standard deviation of 0.275 meters. The second person measures the same plants in centimeters. She reports that the average height is 120 centimeters, with a standard deviation of 27.5 centimeters. Obviously, these are the same answers, but one person reports a standard deviation of 0.275 (which sounds small) whereas the other person reports a standard deviation of 27.2 (which sounds big). The coefficient of variation comes to the rescue: for both sets of measurements the coefficient of variation is 22.9.

The CV can also help you compare two completely different measurements. How does variation in height compare to variation in weight? Or age? Or income? These variables are measured on different scales and use different units, but the CV (which is dimensionless) enables you to compare the variation of these variables.

How to compute the coefficient of variation in SAS

The coefficient of variation is computed by several SAS procedures: MEANS, UNIVARIATE, IML, TABULATE, and so forth. The following example shows data for the plant measurement example in the previous paragraph. The MEANS and IML procedure compute the CV for measurements on the meter and centimeter scales:

data Plants;
input height @@;
cm = height * 100;
1.6 1.5 .8 1.0 1.2 .9 1.2 1.8 1.2 1.3 1.3 .9 1.2 1.0 1.1
proc means data=Plants N mean std cv;
proc iml;
use Plants; read all var _NUM_ into X[c=varNames]; close;
cv = cv(X);
print cv[c=varNames];

Theoretical uses of the coefficient of variation

The coefficient of variation has some interesting uses as a theoretical tool. It enables you to compare the variation between different probability distributions. As I mentioned in my article on fat-tailed and long-tailed distributions, the exponential distribution is an important reference distribution in the theory of distributions. Because the standard deviation and the mean of an exponential distribution are equal, the exponential distribution has a CV equal to 1. Distributions with CV < 1 are considered low-variance distributions. Distributions with CV > 1 are high-variance distributions.

Obviously the coefficient of variation is undefined for symmetric distributions such as the normal and t distributions, which is perhaps why the CV is not widely used. The sample CV is undefined for centered data and is highly variable when the population mean is close to zero.

Do you use the coefficient of variation?

Have you ever used the coefficient of variation in a real data analysis problem? Is the CV a useful but underutilized statistic for practical data analysis? Or is it primarily a theoretical tool for comparing the variability of distributions? Leave a comment.

Post a Comment

The difference between RUN and CALL for SAS/IML subroutines

Have you ever noticed that some SAS/IML programmers use the CALL statement to call a subroutine, whereas others use the RUN statement? Have you ever wondered why the SAS/IML language has two statements that do the same thing?

It turns out that the CALL statement and the RUN statement do not do the same thing! Read on to discover how they differ.

By the way, the RUN statement in PROC IML has only one purpose: to execute a subroutine. Many other SAS procedures (such as REG, GLM, and DATASETS) use the RUN statement to tell SAS to run the procedure. PROC IML is different. Never put the RUN statement at the end of a PROC IML program.

Built-in versus user-defined subroutines

A main feature of the SAS/IML language is that it enables you to define your own modules. A module is a user-defined function or subroutine that you can call from an IML program. A module that returns a value is called a function module; a module that does not return a value is called a subroutine module.

Although I do not recommend that you do so, it is possible to define a subroutine module that has the same name as a built-in subroutine. These two routines co-exist. If you use the RUN statement, SAS/IML will call the subroutine that you wrote. If you use the CALL statement, SAS/IML will call the built-in subroutine.

So that's the difference between the RUN and CALL statements: The RUN statement looks first for a user-defined module with the specified name. If it finds it, then it calls that module. Otherwise, it calls the built-in subroutine that has the specified name. The CALL statement looks for a built-in subroutine and immediately calls that routine if it is found.

An example: Overriding the EIGEN subroutine

As I said, I don't recommend that you routinely write user-defined modules that have the same name as a SAS/IML built-in, but let's examine a situation in which it might be convenient to do so. Let's assume that you often read data sets that represent symmetric matrices, and that these data sets are stored in lower triangular form with missing value in the upper triangular portion of the matrix. For example, this is the default storage method for distance matrices that are created by using the DISTANCE procedure in SAS/STAT software, as shown below:

proc distance data=Sashelp.Cars out=Dist method=Euclid;
   where Type="Truck";
   var interval(Horsepower--Length / std=Std);
   id Model;
proc iml;
use Dist;   read all var _NUM_ into D[r=Model];   close Dist;
call heatmapcont(D) title="Distance Matrix";

The program creates a 24 x 24 symmetric matrix of distances between 24 observations in a six-dimensional space of variables. The program reads the distance matrix into a SAS/IML matrix. A heat map shows the matrix values. The dark gray color shows that the upper triangular portion of the matrix is missing. The white diagonal elements show that the diagonal of the matrix is exactly zero. The remaining cells indicate the distance between observations. Nearly white cells indicate that two observations are close together. Dark cells indicate that observations are relatively far apart.

Suppose that you want to compute the eigenvalues and eigenvectors of this matrix. The built-in EIGEN subroutine in SAS/IML can compute these quantities, but it expects a matrix that does not have any missing values. Therefore to compute the eigenvalues you should first extract the lower triangular elements of the matrix and copy them into the upper triangular portion of the matrix.

You could write a separate subroutine that only copies the lower triangular elements, but for this example I will write a subroutine that has the same name as the built-in EIGEN subroutine. The following statements define a module that inspects the upper triangular elements of a matrix. If the upper triangular elements are all missing, it replaces those missing values with the lower triangular elements. It then calls the built-in EIGEN function to compute the eigenvalues and eigenvectors:

start StrictLowerTriangular(X);    /* return lower triangular elements */
   return( remove(vech(X), cusum(1 || (ncol(X):2))) );
/* define EIGEN module as a custom override of the built-in subroutine */
start Eigen(eval, evec, A);
   r = row(A); c = col(A);
   UpperIdx = loc(c>r);
   /* if upper triangular elements are missing, copy from lower */
   if all(A[UpperIdx]=.) then
      A[UpperIdx] = StrictLowerTriangular(A);
   call eigen(eval, evec, A);          /* CALL the built-in EIGEN subroutine */
run eigen(eval, evec, D);              /* RUN the user-defined EIGEN subroutine */

In summary, the CALL and RUN statements enable you to choose between a built-in subroutine and a custom subroutine that have the same name.

The situation is a little different for user-defined functions. Beginning with SAS/IML 13.1, you can define a function that has the same name as a built-in function, and the order of resolution for calling functions ensures that a user-defined function is found before a built-in function that has the same name. This means that if you choose to override a SAS/IML function, you lose the ability to call the built-in function until you quit PROC IML.

Have you ever had the need to override a built-in SAS/IML subroutine? Let me know the details by leaving a comment.

Post a Comment

The distribution of Pythagorean triples


When I studied high school geometry, I noticed that many homework problems involved right triangles whose side lengths were integers. The canonical example is the 3-4-5 right triangle, which has legs of length 3 and 4 and a hypotenuse of length 5. The triple (3, 4, 5) is called a Pythagorean triple because it satisfies the Pythagorean theorem: 32 + 42 = 52. Similarly, (5, 12, 13) and (7, 24, 25) are Pythagorean triples that sometimes appear in geometry textbooks.

In general, a triple of natural numbers (a, b, c) is a Pythagorean triple if a2 + b2 = c2. Obviously if (a, b, c) is a Pythagorean triple, then so is (ka, kb, kc) for any natural number k. Geometrically, this transformation is not very interesting because it simply scales a right triangle into a similar triangle that has the same shape. I did not learn any trick in high school that enabled me to start with a Pythagorean triple and generate a new, geometrically different, right triangle.

Generating Pythagorean triples

Because my high school geometry book reused the same three or four triples over and over again, I assumed that it is difficult to generate Pythagorean triples. In fact, that is not the case: There are several formulas for generating Pythagorean triples, some of which were known to the ancient Greeks.

One interesting algorithm comes from the 20th-century and involves linear transformations via matrices. Here's how it works. Write any Pythagorean triple, v, as a row vector. Then the linear transformation v*L is a new Pythagorean triple, where L is any of the three following linear transformations:


For example, if you represent v = (3, 4, 5) as a row vector, then v*L1 = (5, 12, 13). You can multiply on the right again to get another new triple, such as v*L1*L3 = (45, 28, 53).

In fact, more is true. A primitive Pythagorean triple is a triple such that the three numbers have no nontrivial divisors, that is, they are relatively prime. It turns out that all primitive Pythagorean triples can be obtained by iteratively applying one of three linear transformations to the triple (3, 4, 5). In other words, the triple (3, 4, 5) is the "parent" of all primitive Pythagorean triples!

That is a fabulous result. It means that you can write a program that uses matrix multiplication to produce arbitrarily many primitive Pythagorean triples, as follows:

  • Start with v = (3, 4, 5).
  • Apply L1, L2, and L3 to create the first generation of children.
  • Apply the three transformations to each of the children to create a generation of nine grandchildren.
  • Continue this process. The ith generation will contain 3i triples.

You can implement this algorithm in the SAS/IML matrix language as follows:

proc iml;
start GenerateTriples(depth=3);
   L1 = { 1  2  2,
         -2 -1 -2, 
          2  2  3};
   L2 = { 1  2  2,
          2  1  2,
          2  2  3};
   L3 = {-1 -2 -2,
          2  1  2,
          2  2  3};
   NumTriples = sum(3##(0:Depth));  /* total number of triples */
   triples = j(NumTriples,3,.);     /* allocate array for results */
   triples[1,] = {3 4 5};           /* parent triple */
   k = 1; n = 2;
   do i = 1 to Depth;               /* for each generation */
      do j = 1 to 3##(i-1);         /* generate 3##i children */
         x = triples[k,];
         triples[n,]   = x*L1;
         triples[n+1,] = x*L2; 
         triples[n+2,] = x*L3; 
         k = k + 1;   n = n + 3;
   return( triples );
p = GenerateTriples();
print p;

The parent and the first three generations account for 1 + 3 + 9 + 27 = 40 triples. The list contains some old familiar friends, as well as a few strangers that I haven't met before.

The distribution of Pythagorean triples

What do the Pythagorean triples look like if you plot their location in a Cartesian coordinate system? It turns out that they make fascinating patterns!

Because the hypotenuse length is a function of the leg lengths, it suffices to plot the locations of a and b in the coordinate plane. However, it turns out that when the algorithm produces a particular triple such as (5, 12, 13), it does not produce the symmetric triple (12, 5, 13). To make the pattern more symmetric, we can to manually add the triple (b, a, c) whenever the triple (a, b, c) appears.

The following SAS/IML statements construct many primitive Pythagorean triples and create a scatter plot of the ones for which a and b are both less than 4,500:

p = GenerateTriples(15);                  /* generate many triples */
m = p[ loc(p[,1]<=4500 & p[,2]<=4500), ]; /* exclude large ones */
mm = m // m[,{2 1 3}];                    /* generate symmetric pairs */
ods graphics / width=600px height=600px;
title "Primitive Pythagorean Triples";
title2 "Depth = 15, Symmetric";
call scatter(mm[,1], mm[,2]) procopt="aspect=1"  
     option="markerattrs=(size=3 symbol=SquareFilled)";

The image is shown at the beginning of this article. You can see certain parabolic curves that have a high density of points. There is a similar plot in Wikipedia that includes nonprimitive triples. The nonprimitive triples "fill in" some of the apparent "gaps" in the pattern.

These interesting patterns hint at deep relationships in number theory between Diophantine equations and the geometry of the solutions. Are you tempted to generate a similar image for the generalized equation a3 + b3 = c3? Don't bother: Fermat's Last Theorem states that this equation has no integer solutions!

Post a Comment

An efficient way to increment a matrix diagonal

I was recently asked about how to use the SAS/IML language to efficiently add a constant to every element of a matrix diagonal. Mathematically, the task is to form the matrix sum A + kI, where A is an n x n matrix, k is a scalar value, and I is the identity matrix. The person who asked the question was dealing with large matrices. He knew that explicitly forming the identity matrix is inefficient and that adding n2n zeros to A is unnecessary, but he wasn't sure how to avoid it.

I have previously blogged about how to assign the diagonal elements of a matrix. In that post I wrote a little helper function that will set the diagonal of a matrix to a given vector or scalar value. The function is repeated below:

proc iml;
start SetDiag(A, v);
   diagIdx = do(1,nrow(A)*ncol(A), ncol(A)+1); /* index diagonal elements */
   A[diagIdx] = v;                             /* set diagonal elements */

This function sets the diagonal elements of the matrix A to the value v, which can be a vector or a scalar. To increment (rather than replace) the diagonal, use the VECDIAG function to extract the current diagonal values of the matrix. Add a constant and call the function like this:

m = j(5000, 5000, 1);              /* create a 5000 x 5000 matrix */
run SetDiag(m, vecdiag(m) + 100);  /* increment diagonal elements by 100 */

Timing the performance


Recall that it is easy to use the TIME function in SAS to compare the performance of two competing algorithms. The plot at the left shows the time required to increment a matrix diagonal for a sequence of large matrices, up to 15,000 x 15,000. The blue line represents the naive computation
A + kI. Notice that the time increases quadratically with the size of the matrix because this algorithm adds n2 numbers to the matrix.

In contrast, the red line shows the time that it takes to form a vector that indexes the diagonal elements of a matrix and to replace only those elements. Theoretically, the time should be linear with the size of the matrix, but the computation is so fast that the elapsed time registers as instantaneous for even the largest matrix.

In summary, when working with large matrices, you can save lots of computational effort if you avoid unnecessary computations. Never multiply or add large diagonal matrices unless absolutely necessary. Instead, think about whether it is possible to index into the matrix and change only those numbers that you need to change.

By the way, increasing the diagonal of a matrix is sometimes called "ridging a matrix" and it is used in ridge regression. Ridging a matrix is also sometimes used in numerical computations that involve estimated covariance matrices. Sometimes an estimate is not positive definite, and ridging is one approach to try to obtain an estimate for a covariance matrix that has this important property.

Post a Comment

The distribution of blood types by country

My colleague Robert Allison has a knack for finding fascinating data. Last week he did it again by locating data about how blood types and Rh factors vary among countries.

He produced a series of eight world maps, each showing the prevalence of a blood type (A+, A-, B+, B-, AB+, AB-, O+, and O-) in various countries around the world. As I studied his maps, I noticed that the distribution of blood types in certain ethnic groups (Chinese, Japanese, Indians,...) was different than the distribution in Western Europe and former European colonies.

When dealing with multivariate data, a single visualization is rarely sufficient to answer all the questions that you can ask. Robert's maps answer the question, "What is the spatial distribution of each blood type?" I was curious about a different question: "Within each country, what is the distribution of blood types?" To answer my question, I needed a different visualization of the same multivariate data.


My attempt is shown to the left. (Click to enlarge.) The graph is a stacked bar chart of the percentage of blood type for 63 countries, sorted by the percentage of types that have positive Rh factors. Blood types with positive Rh factors are plotted on the right; negative Rh factors are plotted on the left. The A+ and A- types are plotted closest to the 0% reference line. The next types are AB, B, and O, in increasing distances from the 0% reference line.

A few ethnic differences are apparent. At the top of the chart are Western European countries and former European colonies such as Brazil, Australia, and New Zealand. A little lower on the list are countries in Eastern Europe and Scandinavia.

After that, the list starts to get geographically jumbled. The United States, Canada, and South Africa were all settled by people of multiple ethnicities. The middle of the list is dominated by countries from the Middle East, Northern Africa, and the Near East.

The next set of countries include South American countries such as Argentina and Bolivia, Caribbean countries, India, and African countries.

Finally, the bottom of the list features Asian populations such as China, Japan, Mongolia, and the Philippines. These populations have almost no negative Rh factors in their blood. The distribution of blood types in those countries are similar to each other, although regional differences appear, such as the relatively small number of A+ blood in Thailand.

This one dimensional ranking of countries by blood types reflects historical connections between peoples as a result of conquest, trade, and colonization.

A few countries seem "out of place" in their list order. Lebanon, Ireland and Iceland, and Peru and Chile, are some of the countries whose distribution of blood types differ from those adjacent to them in the list.

Relationships between countries

Some of the "out of place" countries are probably a result of the fact that it is hard to linearly order the countries when there are eight variables to consider. Principal component analysis (PCA) is a statistical technique that can group observations according to similar characterisics. In SAS software, you can use the PRINCOMP procedure to conduct a principal component analysis.

The analysis reveals that 81% of the variation in the data can be explained by the first two principal components. About 92% can be explained by using three principal components, which means that the eight variables (percentages of each blood type) fit well into these lower-dimensional linear subspaces.


The score plot from a two-dimensional PCA analysis is shown to the left. (Click to enlarge.) I added colors to the data to indicate a geographical region for the countries; the regions came from the United Nations list of countries and geographic regions. This plot shows the relationships between countries based on similarities in the distribution of blood types.

The middle of the plot contains African and West Asian nations. (West Asia is the UN name for the region that many people call the Middle East.) The right side of the plot is dominated by European countries and their former colonies. The upper left quadrant contains the Asian countries. The lower left quadrant includes Caribbean, Central American, and South American countries. This presentation once again shows that the distribution of blood types in Peru and Chile are different from other countries, but are similar to each other.

You can download the data and the SAS program that analyzes it and do additional analyses.

What interesting features can you find in these data? Are there other ways to view these data? Leave a comment.

Post a Comment

Binning data by quantiles? Beware of rounded data

In my article about how to create a quantile plot, I chose not to discuss a theoretical issue that occasionally occurs. The issue is that for discrete data (which includes rounded values), it might be impossible to use quantile values to split the data into k groups where each group has approximately the same number of elements. To put it bluntly, the algorithm that I proposed for creating a quantile bin plot can fail for large values of k.

The problem can occur for data that are rounded to the nearest unit, such as age, height, weight, and blood pressure. The algorithm assumes that the data quantiles are (mostly) unique and that they divide the empirical cumulative distribution function (ECDF) of N observations into k groups that each have approximately N/k observations. However, if there are more than N/k repeated values, the repeated value can occupy more than one quantile value. In fact, this will always happen if a particular value is repeated more than 2N/k times.

For example, suppose that you want to divide the diastolic blood pressure data in the Sashelp.Heart data set into 10 groups that each have approximately 10% of the data. The following statements draw the ECDF for the diastolic measurements and mark the deciles:

proc univariate data=Sashelp.Heart;
   var diastolic;
   cdfplot diastolic / vref=(10 to 90 by 10);
   ods select CDFPlot Quantiles;

The value 80 appears in 711 of the 5209 observations, which is about 14% of the data. Notice that the value 80 intersects two reference lines, one at 30 and the other at 40. This means that 80 is both the 30th percentile and the 40th percentile. Assuming that you do not want to split the value 80 across two bins, the deciles of the data produce at most nine bins.

Another way to see this graphically is to use the RANK procedure to try to group the data into 10 groups, as described in the article "Grouping observations based on quantiles." The following statements create a new variable called Group, which for continuous data would have the values 0–9. However, for this rounded data only nine groups exist:

proc sort data=Sashelp.Heart out=Heart;
   by Diastolic;
proc rank data=Heart out=Heart groups=10 ties=high;
   var Diastolic;        /* variable on which to group */
   ranks Group;          /* name of variable to contain groups 0,1,...,k-1 */
proc sgplot data=Heart;
   scatter x=Group y=Diastolic / group=Group;
   refline 80 / axis=y;
   xaxis values=(0 to 9);

As shown by the legend on the scatter plot, only nine groups were created, instead of the 10 that were asked for. Group 3 was not created.

I've described a problem, but what can you do about it? Not a lot. The problem is in the data. If you are flexible about the number of groups, you can try decreasing the value of k. Because each group contains approximately N/k observations, decreasing k might circumvent the problem. Another possibility is to jitter the data by adding a small amount of random noise. That will eliminate the problem of repeated values.

So I'll leave you with this warning: beware of using quantiles to bin rounded data into groups. Although the technique works great when almost all of the data values are distinct, you can run into problems if you ask for many bins and your data contain many repeated values.

Post a Comment

Recent additions to the SAS/IML file exchange

It has been three months since the introduction of the SAS/IML File Exchange, so I thought I'd give a short update on recent submissions and activity.

  • The web site administrators have created an easy-to-remember shortcut to the File Exchange:
  • There have been eight new articles submitted to the File Exchange. The application areas include
    • experimental design
    • discretization of a continuous variable
    • generalized linear models for complex sample data
    • multiple imputations
    • quality improvement

What I love about the SAS/IML File Exchange is that these programs were written by experts in their subject areas. Some programs are based on research that has been presented at conferences or in professional journals. Others are shorter examples and fun programs, such as Ian Wakeling's Sudoku Puzzle Generator! Together, they hint at the wide range of applications to which matrix computations and the SAS/IML language applies.

The SAS/IML file exchange is a valuable resource, but it relies on contributions from SAS/IML programmers who want to share their knowledge and programs. Have you written an interesting SAS/IML program? See the article "How to Contribute to the SAS/IML File Exchange" to find out how easy it is to share your work with others.

If you are more of a consumer than a creator of programs, the File Exchange still needs you. Download and try out the programs that are there. Use the interactive features of the SAS Community to rate the programs, to ask questions, to submit bug reports, or just to say thanks.

The community would love to hear from you. What are you waiting for?

Post a Comment

The next power of 2 and other tricks with logarithms

The other day I was doing some computations that caused me to wonder, "What is the smallest power of 2 that is greater than a given number?" The mathematics is straightforward. Given a number n, find the least value of k such that 2kn or, equivalently,
k ≥ log2(n).

It is easy to implement this formula in the SAS DATA step or in the SAS/IML language. The LOG2 function computes the base-2 logarithm of a number, and you can use the CEIL function to return the next-higher integer. With a little effort, you can even return a missing value if n is not a positive number. Here is a SAS/IML function that implements the formula:

proc iml;
start NextPow2(n);
   p = j(nrow(n), ncol(n), .);         /* allocate array to missing values */
   idx = loc(n>0);                     /* identify valid inputs */
   if ncol(idx)>0 then 
      p[idx] = 2##ceil(log2(n[idx]));  /* apply formula to valid inputs */
   return( p );
/* test the function on valid and invalid data */
y = {-2 0.01 0.124 0.49 2 4.5 50 8100}`;
pow2 = NextPow2(y);  /* next greater power of 2 */
k = log2(pow2);      /* just get the exponent   */
print y pow2[format=FRACT.] k;

Notice that the FRACTw.d format enables you to format decimals as fractions in SAS.

By changing the 2s to 10s, you can use the same algorithm (and the LOG10 function) to find the next power of 10.

However, if you want to find the next power of an arbitrary number b, then you need to use a little logarithm trick. If b is the base of a logarithm, then logb(x) = log10(x) / log10(b). Because of this identity, the LOG10 function is the only function you need to compute logarithms for any base! By using this handy mathematical identity, you can generalize the NextPow2 function and write a function that computes the next power of b, where b is any positive number.

The following SAS/IML function computes the next highest power of b for a given set of input values:

/* given n, compute the next highest power of b */
start NextPow(n, b);
   p = j(nrow(n), ncol(n), .);         /* allocate array to missing values */
   idx = loc(n>0);                     /* identify valid inputs */
   if ncol(idx)>0 then do;
      x = n[idx];
      p[idx] = b##ceil(log10(x)/log10(b));
   return( p );
pow2 = NextPow(y,2);  /* next greater power of 2 */
pow3 = NextPow(y,3);  /* next greater power of 3 */
pow5 = NextPow(y,5);  /* next greater power of 5 */
print y pow2[format=FRACT.] pow3[format=FRACT.] pow5[format=FRACT.];

I think this TRICK is a real TREAT!

Post a Comment

How to use frequency analysis to crack the Cryptoquote puzzle

Many people enjoy solving word games such as the daily Cryptoquote puzzle, which uses a simple substitution cipher to disguise a witty or wise quote by a famous person. A common way to attack the puzzle is frequency analysis. In frequency analysis you identify letters and pairs of letters (bigrams) that occur often in the enciphered text. It is a fact that certain letters (e, t, a, o, i, n,....) and bigrams (th, he, in, er, an, re, on,...) appear frequently in English text. You can use this fact to guess that the most frequently occurring symbol in the text might represent e, t, or a. You can use the bigram frequencies in the text to help you make statistically likely guesses.

Usually frequency analysis is used for the initial guesses. Soon recognizable snippets of words begin to appear in the partially deciphered text, and you can often guess small words such as 'the', 'and', 'of', and 'for'. Then, like pulling a string on a sweater, the puzzle unravels. Use the context of the message to guess the remaining letters.

This article shows how to use frequency analysis to solve a cryptogram. I use a computer program to count the frequency of letters and bigrams and to quickly substitute guessed letters into the puzzle. However, the computer program merely provides an implementation of a tedious and repetitive task. You can perform the same analysis with pencil and paper. The program does not make any guesses itself, but merely makes it easier for a human to make a guess. (Type 'cryptogram solver' into an internet seach engine if you want an online solver.)

I use the convention that capital letters are symbols for the enciphered text whereas lowercase letters indicate the plaintext. Thus the plaintext message "word games are fun" might be enciphered as "RHES ADLUI DEU VYP." A partially deciphered solution might appear as "RHrS AaLeI are VYn," where the uppercase letters represent yet-to-be-solved letters.

Some functions that help decipher Cryptoquotes

In this blog I usually discuss statistical programming in the SAS/IML language. This post uses statistics and programming, but the discussion is about deciphering the Cryptoquote. The program that I use in this article is available for download for SAS users who have access to SAS/IML software. I wrote the following SAS/IML functions:

  • The PrintFreqRef routine prints reference tables that show how often certain letters and pairs of letters appear in English text. The tables are the results from previous blog posts and show the most frequently appearing letters, the most frequent bigrams, and the most frequent double-letter bigrams.
  • The FreqAnalysis routine reports the distribution of letters, bigrams, and double letters in the enciphered quote.
  • The SplitByWords function constructs an array from a quote. Each word is on one row; each letter is in a separate column. The main purpose of this routine is so that I can say "look at the 7th word" and you will be able to find it easily.
  • The ApplySubs function enables you to quickly see the result of replacing one or more cipher symbols with a guess. You can use the function to replace 'Q' with 't' and 'F' with 'e' and see whether the substitutions appear to make sense in the quote. If the substitution results in nonsense—such as 'te' as a two-letter word—then the substitution was not good and you can try a different guess.

Solving the Cryptoquote by using frequency analysis

The following statements start PROC IML, load the utility modules, and print the reference tables that show the expected frequencies of letters and bigrams in English text.

proc iml;
load module=_all_;  /* load the utility modules */
run PrintFreqRef();

The reference tables show the relative frequencies of the most common letters, such as e, t, a, o, i, n, and s. The most frequent pairs of letters are th, he, in, er, and so forth. The most common double-letter bigrams are ll, ss, ee, oo, and so forth. If you like to solve word puzzles by pencil and paper, you can print this table and keep a copy in your wallet or purse. Click the table to get it to appear on a separate web page.

Now let's look at an enciphered quote and perform a frequency analysis on its symbols:

      "~ DSKSC GS FCPSH";
run FreqAnalysis(msg);

These tables look similar to the reference tables, but these tables are specific to this quote. You can create these table by hand if you have the time and patience. In this quote, almost 20% of the symbols are an S. The next most frequent symbols are K and B. Furthermore, the bigrams KB and BS appear frequently. This co-occurrence of three symbols and two bigrams is a "trifecta" that gives strong statistical evidence that S=e, K=t, and B=h. The S=e guess is further supported by the fact that SS appears twice in the message, and 'ee' is a common double-letter bigram.

Let's make the substitutions:

w = SplitByWords(msg);                /* split into array */
rownum = char(1:nrow(w));             /* row numbers, for reference */
w1 = ApplySubs(w, {K B S}, {t h e});  /* guess K=t, B=h, and S=e */
print w1[r=rownum];

It looks like we made a great guess! The first word is probably 'there', so the next substitution should include C=r. This guess is bolstered by the fact that CS appears several times in this message, and 're' is a frequent bigram. (The first word could also be 'these', but 'se' occurs less often than 're'.)

If that guess is correct, then we can further guess that O=a because that makes the second word 'are' and the 12th word 'that'.

To generate another guess, notice that the double-letter bigram TT appears in this quote in the 5th word. The bigram follows the plaintext bigram 'th', so TT must represent a double vowel. Because we have already deduced the symbol for 'e', we guess that TT is 'oo', which is the only other common double-vowel bigram. Therefore T=o.

Lastly, the 16th word is the author's first name. Since C=r, we guess that the first name might be 'peter'. The following statement makes these substitutions:

w2 = ApplySubs(w1, {O C}, {a r});
w3 = ApplySubs(w2, {T D}, {o p});
print w3[r=rownum];

Frequency analysis can suggest additional guesses. The symbols H, P, and W appear frequently in the enciphered quote. We also see that SH (=eH) is a frequent bigram. From the bigram frequency analysis, we might guess that H=n or H=s. By looking at the partially deciphered quote, it looks like W=n because the 5th word is probably 'parenthood', so we will guess H=s.

By looking at where the symbol P appears in the quote, you might guess that P is a vowel. The remaining vowel that occurs frequently is 'i'. This guess makes sense because the trigram PWJ appears twice, and this trigram could be 'ing'.

The following statement make these substitutions.

w4 = ApplySubs(w3, {H W G P J}, {s n d i g});
print w4[r=rownum];

Click this link to see the partially deciphered text that results from these substitutions. At this point, frequency analysis has revealed all but seven of the symbols in the enciphered quote. The remaining symbols are easily deduced by looking at the partially deciphered words and guessing the remaining letters. The deciphered quote is:

There are times when parenthood seems nothing but feeding the mouth that bites you.
     ~ Peter de Vries

Frequency analysis enabled us to guess values for certain frequently occurring symbols. By correctly guessing these key symbols, the puzzle was solved. Notice that the frequency of bigrams was essential in this process. If you use only the frequency of single letters, you might be able to guess the 'e' and 't', but the other letters are indistinguishable by looking at the empirical frequencies. However, by using bigrams, you can guess more symbols. For example, in this puzzle the bigrams helped us to discover the symbols for 'h', 'o', 's', and 'n'.

Of course, this process is not always linear. Sometimes you will make a guess that turns out to be wrong. In that case, eliminate the wrong guess and try another potential substitution.

To conclude this article, I will leave you with another Cryptoquote. I will generate the frequency analysis for you, in case you want to solve the puzzle by hand. You can download the SAS/IML program that deciphers this quote. The program contains comments that explain how frequency analysis is used to solve the puzzle.


Post a Comment

Video: Using heat maps to visualize matrices

One of my presentations at SAS Global Forum 2014 was about the new heat map functions in SAS/IML 13.1. Over the summer I created a short video of my presentation, which gives an overview of visualizing matrices with heat maps, and describes how to choose colors for heat maps:

If your browser does not support embedded video, you can go directly to the video on YouTube.

If you prefer to read articles about heat maps so that you can study the concepts and cut and paste examples, here are a few recent blog posts that are based on my SAS Global Forum presentation:

For a fun application of discrete heat maps, see my article about implementing a one-dimensional cellular automata in SAS.

Post a Comment