How to choose colors for maps and heat maps

Have you ever looked as a statistical graph that uses bright garish colors and thought, "Why in the world did that guy choose those awful colors?" Don't be "that guy"! Your choice of colors for a graph can make a huge difference in how well your visualization is perceived by the reader. This is especially true for heat maps and choropleth maps in which the colors of small area elements must be distinguishable. This article describes how you can choose good color schemes for heat maps and choropleth maps. Most of the article is language-independent, but at the end I show how to create graphs with good color schemes in SAS.

All color schemes are not created equal

I first became aware of the importance of color design in statistical graphics when I looked at the incredibly beautiful Atlas of United States Mortality by Linda Pickle and her colleagues (1997). Image from Pickle, et al (1997), Atlas of United States Mortality One of the reasons that the Atlas is so successful at conveying complex statistical data on maps is that it uses carefully designed color schemes and visualization strategies that were the result of pioneering research by Cynthia Brewer, Dan Carr, and many others. An example from the Atlas is shown to the left. The color schemes were designed so that one color does not visually dominate the others. Bright reds and yellows are nowhere to be found. Fully saturated colors are avoided in favor of slightly muted colors. The result is that graphs in the Atlas enable you to see variations in the data without being biased by visual artifacts that can result from poor color choices.

Choosing good colors for maps and graphs is not easy, which is why every statistician and graphics designer should know about Cynthia Brewer's web site. The web site enables you to choose from dozens of color schemes that were carefully chosen to be aesthetically pleasing as well as statistically unbiased.

Sequential, diverging, and qualitative schemes


Brewer's color schemes are implemented in SAS/IML 13.1 software through the PALETTE function. (Other statistical languages also provide access to these schemes, such as via the RColorBrewer package in R.) The PALETTE function enables you to get an array of colors by knowing the name of a color scheme. Some of the available color schemes are shown at the left. Brewer's "BLUES" color scheme is single-hue progression that is similar to the two-color ramp in the SAS HTMLBlue ODS style. The "YLORRD" color scheme is a yellow-orange-red sequential color ramp that shows a blended progression from yellow to red.

The color ramps at the left are appropriate for visualizing data when it is important to differentiate high values from low values. For these so-called sequential color schemes, the reference level is the lowest value.

There are other possible color schemes. When the reference value is in the middle of the data range (such as zero or an average value), you should use a diverging color scheme, which uses a neutral color for the reference value. Low values are represented by using one hue and high values by using a different hue.


Examples of diverging color schemes are shown to the right. The first diverging color scheme ("BRBG" for brown-blue-green) is the default color ramp that is used to construct heat maps of discrete data in SAS/IML software. For example, a five-color version of that color ramp was used to visualize a binned correlation matrix in my recent article about how to create heat maps with a discrete color ramp. The "BDBU" scheme (a red-blue bipolar progression) is often used when one extreme is "good" (blue) and the other extreme is "bad" (red). The "SPECTRAL" color scheme is a useful alternative to the often-seen "rainbow" color ramp, which is notoriously bad for visualization and should be avoided. A spectral progression is often used for weather-related intensities, such as storm severity.


Finally, there are qualitative color schemes, as shown to the left. These schemes are useful for visualizing data that have no inherent ordering, such as ethnicities of people, different diseases, types of music, and other nominal categories.

Using color schemes with heat maps in SAS

Although these color schemes were designed for choropleth maps, the same principles are useful for designing heat maps with a small number of discrete categories. You can use the PALETTE function in SAS/IML software to choose effective colors for heat maps and other statistical graphics.

The PALETTE function is easy to use: Specify the name of the color scheme and the number of colors that you want to generate. These schemes are designed for a small number of discrete categories, so the schemes support between eight and 12 distinct and distinguishable colors.

As an example, suppose that you have computed a binned correlation matrix, but you want to override the default color scheme. For your application, you want negative values to be red and positive to be blue, so you decide on the bipolar "RDBU" color scheme. The following SAS/IML statements continue the example from my previous post:

/* The disCorr matrix is defined in the previous example. */
ramp = palette("RDBU", 5);                 /* create diverging 5-color ramp */
call HeatmapDisc(disCorr) colorramp=ramp   /* use COLORRAMP= option */
     title="Binned Correlations" xvalues=varNames yvalues=varNames;

As a second example, the heat map at the left use a discrete set of colors to visualize the most frequently appearing bigrams (two-letter combinations) in an English corpus. The values of the matrix have been binned into five categories. Red cells indicate bigrms that appear more than 2% of the time. Dark orange cells indicate bigrams that appear between 1% and 2%. Light orange represents between 0.5% and 1%; light yellow represents less than 0.5% of the time. Gray cells represent bigrams that appear rarely or not at all. You can download the SAS program that computes this discrete heat map of bigram frequency.

Incidentally, you can also use the PALETTE function as an easy way to generate a continuous color ramp. You can use the PALETTE function to obtain three or five anchor colors and interpolate between those colors to visualize intermediate values. In my article about hexagonal bin plots, the continuous color map used for the density map of ZIP codes was created by using the colors from the call palette("YLORRD",5).

What method do you use to choose colors for maps or heat maps that display categories? Do you have a favorite color scheme? Leave a comment.

Post a Comment

Create discrete heat maps in SAS/IML

In a previous article I introduced the HEATMAPCONT subroutine in SAS/IML 13.1, which makes it easy to visualize matrices by using heat maps with continuous color ramps. This article introduces a companion subroutine. The HEATMAPDISC subroutine, which also requires SAS/IML 13.1, is designed to visualize matrices that have a small number of unique values. This includes zero-one matrices, design matrices, and matrices of binned data.

I have previously described how to use the Graph Template Language (GTL) in SAS to construct a heat map with a discrete color ramp. The HEATMAPDISC subroutine simplifies this process by providing an easy-to-use interface for creating a heat map of a SAS/IML matrix. The HEATMAPDISC routine supports a subset of the features that the GTL provides.

A two-value matrix: The Hadamard design matrix

In my previous article about discrete heat maps, I used a Hadamard matrix as an example. The Hadamard matrix is used to make orthogonal array experimental designs. The following SAS/IML statement creates a 64 x 64 matrix that contains the values 1 and –1 and calls the HEATMAPDISC subroutine to visualize the matrix:

proc iml;
h = hadamard(64);                      /* 64 x 64 Hadamard matrix */
ods graphics / width=600px height=600px;  /* make heat map square */
run HeatmapDisc(h);

The resulting heat map is shown to the left. (Click to enlarge.) You can see by inspection that the matrix is symmetric. Furthermore, with the exception of the first row and column, all rows and columns have an equal number of +1 and –1 values. There are also certain rows and columns (related to powers of 2) whose structure is extremely regular: a repeating sequence of +1s followed by a sequence of the same number of –1s. Clearly the heat map enables you to examine the patterns in the Hadamard matrix better than staring at a printed array of numbers.

A binned correlation matrix

In my previous article about how to create a heat map, I used a correlation matrix as an example of a matrix that researchers might want to visualize by using a heat map with a continuous color ramp. However, sometimes the exact values in the matrix are unimportant. You might not care that one pair of variables has a correlation of 0.78 whereas another pair has a correlation of 0.82. Instead, you might want to know which variables are positively correlated, which are uncorrelated, and which are negatively correlated.

To accomplish this task, you can bin the correlations. Suppose that you decide to bin the correlations into the following five categories:

  1. Very Negative: Correlations in the range [–1, –0.6).
  2. Negative: Correlations in the range [–0.6, –0.2).
  3. Neutral: Correlations in the range [–0.2, 0.2).
  4. Postive: Correlations in the range [0.2, 0.6).
  5. Very Postive: Correlations in the range [0.6, 1].

You can use the BIN function in SAS/IML to discover which elements of a correlation matrix belong to each category. You can then create a new matrix whose values are the categories. The following statements create a binned matrix from the correlations of 10 numerical variables in the Sashelp.Cars data set. The HEATMAPDISC subroutines creates a heat map of the result:

use Sashelp.Cars;
read all var _NUM_ into Y[c=varNames];
corr = corr(Y);
/* bin correlations into five categories */
Labl = {'1:V. Neg','2:Neg','3:Neutral','4:Pos','5:V. Pos'};        /* labels */
bins= bin(corr, {-1, -0.6, -0.2, 0.2, 0.6, 1});           /* BIN returns 1-5 */
disCorr = shape(Labl[bins], nrow(corr));               /* categorical matrix */
call HeatmapDisc(disCorr) title="Binned Correlations"
     xvalues=varNames yvalues=varNames;

The heat map of the binned correlations is shown to the left. The fuel economy variables are very negatively correlated (dark brown) with the variables that represent the size and power of the engine. The fuel economy variables are moderately negatively correlated (light brown) with the vehicle price and the physical size of the vehicles. The size of the vehicles and the price are practically uncorrelated (white). The light and dark shades of blue-green indicate pairs of vehicles that are moderately and strongly positively correlated.

The HEATMAPDISC subroutine is a powerful tool because it enables you to easily create a heat map for matrices that contain a small number of values. Do you have an application in mind for creating a heat map with a discrete color ramp? Let me know your story in the comments.

You might be wondering how the colors in the heat maps were chosen and whether you can control the colors in the color ramp. My next blog post will discuss choosing colors for a discrete color ramp.

Post a Comment

The frequency of bigrams in an English corpus

In last week's article about the distribution of letters in an English corpus, I presented research results by Peter Norvig who used Google's digitized library and tabulated the frequency of each letter. Norvig also tabulated the frequency of bigrams, which are pairs of letters that appear consecutively within a word. This article uses SAS to visualize the distribution of bigram frequencies.

Cryptanalysts use the frequency of bigrams as part of a statistical analysis of an encrypted. Certain pairs of letter appear more frequently than others. For example, TH, HE, IN, and ER are pairs of letters that appear most frequently in the 744 billion words that Norvig analyzed.

You can download the SAS program that contains the data and creates all the graphs for this article.

The frequency distribution of bigrams

Norvig tabulated the frequencies of each bigram in the corpus. I created a SAS data set named Bigrams that contains the data. The following SAS/IML statements read the data into a 26 x 26 matrix, which is then transposed so that I can visualize it later by using a heat map. In order to distinguish between bigrams that occur rarely (FC in briefcase; YF in boyfriend) from bigrams that never or almost never appear (JT, QW, ZB,...), I assign a missing value to any bigram that appeared in the corpus less than 0.001% of the time. I then create an index vector (idx) that sorts the bigrams in descending order. I then call the SCATTER subroutine to visualize the frequency distribution of the bigrams:

proc iml;
use Bigrams; read all var {"Prop"}; close;
M = T( shape(Prop, 26, 26) );      /* reshape and transpose data into matrix */
M[loc(M=0)] = .;                   /* replace structural zeros with missing */
V = colvec(M);                     /* vector of proportions */
call sortndx(idx, V, 1) descend=1; /* idx sorts V in descending order */
/* scatter plot of all frequencies */
title "Relative Frequency of All Bigrams";
ods graphics / width=800 height=500;
N = countN(M);                    /* count of nonmissing proportions */
call scatter(1:N, V[idx[1:N]]) grid={x y} label={"Rank" "Proportion"};

The scatter plot shows the relative frequencies of 495 bigrams that appear in the corpus. There are 23 bigrams that appear more than 1% of the time. The top 100 bigrams are responsible for about 76% of the bigram frequency. The distribution has a long tail. Bigrams like OX (number 300, 0.019%) and DT (number 400, 0.003%) do not appear in many words, but they appear often enough to make the list.

To a cryptanalyst, the important part of the plot is that there are a small number of bigrams that appear more frequently than others. That means that if you are trying to decrypt a coded message (or solve the daily Cryptoquote!), you can use frequency analysis of bigrams to help crack a simple substitution cipher. The following statements create a scatter plot of the top 23 bigrams and label each marker by the bigram it represents:

Letters = "A":"Z";                             /* labels */
Pairs = rowcat(expandgrid(Letters, Letters));  /* all 26 x 26 pairs */
/* scatter plot of top bigrams */
TopN = ncol(loc(M>=0.01));            /* count of proportion > 0.01 */
TopIdx = idx[1:TopN];
TopPairs = Pairs[TopIdx];
title "Bigrams with Relative Frequency > 1%";
call scatter(1:TopN, V[TopIdx]) grid={x y} 
            label={"Rank" "Proportion"} datalabel=TopPairs;
print (T(V[TopIdx]))[c=TopPairs L="Top Bigrams" F=percent7.2];

The top bigrams are shown in the scatter plot to the left. Click to enlarge the graph. The bigram TH is by far the most common bigram, accounting for 3.5% of the total bigrams in the corpus. The bigram HE, which is the second half of the common word THE, is the next most frequent. The next most frequently occurring bigrams are IN, ER, AN, RE, and ON.


A continuous heat map of the proportions of bigrams

Creating a plot or printing out the top bigrams is useful, but as a statistician I am also interested in visualizing the frequency distribution of all 26 x 26 bigrams in a way that makes it clear whether a particular bigram is common, rare, or impossible. The best concise visualization of that information is a heat map of the 26 x 26 matrix of proportions.

By using the new HEATMAPCONT subroutine in SAS/IML 13.1, you can create heat maps that use a continuous color ramp to show values in a matrix. The following statement creates a heat map of the bigram proportions:

/* heat map of bigram frequency with continuous color ramp */
ods graphics / width = 900 height=800;
call HeatmapCont(M) xValues=Letters yValues=Letters 
          xaxistop=1 legendtitle="Proportion";

The relative frequencies are encoded by using the default two-color color ramp. The grey cells are bigrams that were not found in the corpus or were extremely rare. To find a particular bigram, trace down the heat map to find the row that corresponds to the first letter, then trace over until you reach the column of the second letter. Here are some observations about the relative frequencies of bigrams:

  • The vowels associate with almost all letters. Only the bigrams IY, UQ, and UW were not found in the corpus when a vowel is the first letter of a bigram. For bigrams that have a vowel as the second letter, only QA, QE, QI, and QO were not found in the corpus.
  • Vowels begin more than half of the most common bigrams.
  • The consonants N and R start many bigrams. All possible bigrams that begin with these consonants were found in the corpus. The consonants D and S are also frequently found at the beginning of a bigram. The consonants L, H, R, S, and T are often found as the second letter in a bigram. It surprised me that there are so many bigrams with H as the second letter. I can't think of many words that contain BH (abhor), DH (adhere), or HH (hitchhiker).
  • The letters J, Q, V, X, and Z have only a few friends that they hang out with. There are few bigrams that begin or end with those letters.
  • Some letters like to lead, other like to follow. The letter K is a leader: it begins bigrams frequently, but is less likely to follow another letter. In contrast, the letter F is happy follow other letters, but is reluctant to take the lead.

The heat map visually emphasizes the most frequent and the impossible bigrams. If you want to see the very rare bigrams, create a heat map of the log-counts. You can also bin the values in the matrix and use a discrete set of colors to visualize the data. Both of these ideas are implemented in the SAS program for this article.

What interesting facts about bigrams can you find in the heat map? Are there any common or rare bigrams that surprise you? Leave a comment.

Post a Comment

Designing a quantile bin plot

While at JSM 2014 in Boston, a statistician asked me whether it was possible to create a "customized bin plot" in SAS. When I asked for more information, she told me that she has a large data set. She wants to visualize the data, but a scatter plot is not very useful because of severe overplotting. I suggested several visualization options:

Unfortunately, she rejected these standard visualizations. She and her colleagues had decided that they needed a new type of binned plot that is constructed as follows:

  1. Divide the X and Y variables into quantiles. This creates a nonuniform grid in (X,Y) space. Each vertical strip and each horizontal strip contain about the same number of points. In particular, she wanted to use deciles, so that each strip contains about 10% of the total number of observations.
  2. The resulting grid has rectangles that vary in size, but you can still count the number of points in each rectangular bin.
  3. For each bin, put a marker at the mean X and mean Y value of the observations within the bin.

This is exactly the sort of computation that SAS/IML is intended for: a custom algorithm for multivariate data that is not built into any SAS procedure. From previous blog posts, I already have many of the building blocks for this algorithm:

The following program reads the Sashelp.bweight data and loads the birth weight of 50,000 babies and the relative weight gain of their mothers. The subsequent SAS/IML statements carry out the algorithm that the statistician requested:

proc iml;
use sashelp.bweight;                     /* read birthweight data */
   read all var {momwtgain weight} into Z;
start bin2D(u, cutX, cutY);         /* define 2-D bin function */
   bX = bin(u[,1], cutX);           /* bins in X direction: 1,2,...,kx */
   bY = bin(u[,2], cutY);           /* bins in Y direction: 1,2,...,ky */
   bin = bX + (ncol(cutX)-1)*(bY-1);      /* assign bins 1,2,...,kx*ky */
k = 10;                             /* divide vars into deciles  */
prob = (1:k-1)/k;                   /* vector 0.1, 0.2,..., 0.9  */
call qntl(qX, Z[,1], prob);         /* empirical quantiles for X */
call qntl(qY, Z[,2], prob);         /*    ...and Y               */
/* divide 2-D data into the k*k bins specified by the quantiles  */
cutX = .M || qX` || .P;
cutY = .M || qY` || .P;
b = bin2D(Z, cutX, cutY);          /* bin numbers 1-k^2*/
u = unique(b);                     /* which bins are occupied?   */
means = j(ncol(u), 2, .);          /* allocate matrix for means  */
count = j(ncol(u), 1, .);          /* allocate matrix for counts */
do i = 1 to ncol(u);               /* for each bin:              */
   idx = loc(b = u[i]);            /*   find obs in the i_th bin */
   count[i] = ncol(idx);           /*   how many obs in bin?     */
   means[i,] = mean( Z[idx,] );    /*   mean position within bin */
title "Scatter Plot of Mean Values in Each Bin";
call scatter(means[,1], means[,2]) xvalues=qX` yvalues=qY` grid={x y};

The scatter plot shows the location of the mean value for each of the 100 bins. For most bins, the markers are not centered in the rectangle, but are offset, which indicates that the distribution of points within the rectangles is not uniform. For example the mean values in the fourth vertical column are all far to the left in their bins.

Sometimes you cannot predict the advantages and disadvantages of a new graph type until you begin using it with real data. For this quantile bin plot, there is one glaring weakness: the graph does not indicate how many observations are in each bin. The next section addresses this weakness.

Improving the quantile bin plot

For these data, the number of points in the bins vary greatly. One bin contains only 184 observations (0.3% of the 50,000 observations) whereas another has 972 observations (1.9%). The median number of observations in a bin is 465. Yet none of that information is available on the graph.

It seems like it would be useful to know how many observations are in each bin. Even if the exact numbers are not needed, it would be nice to know which bins have more observations than others. I propose adding a fourth requirements for the quantile bin plot: a way to visualize the number of points in each bin.

Possible approaches include representing size by using the area of the mean marker. Another would be to overlay the scatter plot on a heat map of the counts. I implemented both of these methods. You can download the SAS program that creates all the graphs in this article.


In SAS software, you can use the BUBBLE statement in the SGPLOT procedure to create markers whose areas are proportional to the bin counts. This is shown to the left. (Click to enlarge.) I also used data labels to indicate the exact count. Notice that the area of the smallest bubble (located about halfway across the bottom row) is about five times smaller than the area of the largest bubble (located near the upper right corner). If you have SAS 9.4m2, you can even use the new COLORRESPONSE= and COLORMODEL= options to draw each bubble in a color that reflects its count (not shown).

I think this bubble plot is an improvement over the original design. You can quickly see the bin counts in two ways, which is an advantage. However, this bubble plot makes it harder to determine the mean within each bin, which is a disadvantage. I don't how important the mean positions are to the person who proposed this graph.


Another visualization option is to overlay the scatter plot of mean positions on a heat map of the counts in each bin. This is shown to the left. Notice that this graph shows the full range of the data, whereas the range for the scatter plots was determined only by the mean values. This plot indicates that the bin counts are positively correlated. The bins in the upper left and lower right have fewer observations than the bins in the lower left and upper right. The graph also shows that the third, sixth, and ninth vertical strips contain more observations than the other strips. This is because many of the original maternal weights were binned to the nearest 5 pounds. Histograms and other bin plots often show artifacts when applied to data that are themselves binned.

I'd like some feedback on this design. Can you think of an application for this plot? Do you like the design choices I made? Download the SAS program and modify it to plot your own data. Leave a comment to let me know what you think.

Post a Comment

Skew this

The skewness of a distribution indicates whether a distribution is symmetric or not. A distribution that is symmetric about its mean has zero skewness. In contrast, if the right tail of a unimodal distribution has more mass than the left tail, then the distribution is said to be "right skewed" or to have positive skewness. Similarly, negatively skewed distributions have long (or heavy) left tails.

Many of us were taught that positive skewness implies that the mean value is to the right of the median. This rule-of-thumb holds for many familiar continuous distributions such as the exponential, lognormal, and chi-square distributions. However, not everyone is aware that this relationship does not hold in general, as discussed in an easy-to-read paper by P. Hipple (2005, J. Stat. Ed.).

In SAS software, the formula for the skewness of a sample is given in the "Simple Statistics" section of the documentation. It is easy to compute the sample skewness Base SAS and in the SAS/IML language. This article shows how.

Use PROC MEANS to compute skewness

The MEANS procedure supports many univariate descriptive statistics and is my go-to procedure for computing skewness. The following statements compute the mean, median, and skewness for two variables in the Sashelp.Cars data set. The variables are the length of the vehicles and the average fuel economy (miles per gallon) for the vehicles in city driving:

proc means N mean median skew;
var Length MPG_City;

The MEANS procedure does not create graphs, but the following box plots show a distribution of the two variables:


The Length variable has very small skewness (0.18). In the box plot, the diamond symbol represents the mean and the vertical line that divides the box is the median. The box plot indicates that the distribution of the vehicle lengths is approximately symmetric, and that the mean and median are approximately equal.

In contrast, the MPG_City variable has large positive skewness (2.78). The box plot indicates that the data distribution has a short left tail and a long right tail. The mean is to the right of the median, as is often the case for right-skewed distributions.

Computing the skewness in SAS/IML software

The SAS/IML 13.1 release introduced the SKEWNESS function, which makes it easy to compute the skewness for each column of a matrix. The following SAS/IML statements read in the Length and MPG_City variables from the Sashelp.Cars data into a matrix. The SKEWNESS function compute the skewness of each column. The results are identical to the skewness values that were reported by PROC MEANS, and are not shown:

proc iml;
varNames = {"Length" "MPG_City"};
   read all var varNames into X;
skew = skewness(X);
print skew[c=varNames];

I've always thought that it is curious that many "natural" distributions are positively skewed. Many variables that we care about (income, time spent waiting, the size of things,...) are distributions of positive quantities. As such, they are bounded below by zero, but the maximum value is often unbounded. This results in many variables that are positively skewed, but few variables that are negatively skewed. For example, there are 10 numerical variables in the Sashelp.Cars data, and all 10 have positive skewness. Similarly, most of the classical distributions that we learn about in statistics courses are either symmetric or positively skewed.

A canonical example of a negatively skewed distribution is the distribution of grades in a class—especially at schools that have grade inflation. The following data presents hypothetical grades for 30 students who took a test. The test might have been too easy because the median grade is 94.5:

scores = {100 97 67 84 98 100 100 96 87 72 90 100 95 96 89
          100 90 98 85 95 100  96 92 83 90 90 100 66 92 94};
skew = skewness(scores`);
print skew;
title "Class Test Scores";
title2 "Skewness = -1.52";
ods graphics / width=400 height=250;
call box(scores) type="HBox";

The skewness for this data is -1.52. The box plot indicates that a few students did not score well on the test, although three-quarters of the students received high marks. The box plot shows outliers in the left tail, and the mean is less than the median value.

Why you should care about skewness

Variables that have large skewness (positive or negative) might need special treatment during visualization and modeling. A large skewness value can indicate that the data vary over several orders of magnitude. Some statistics (such as confidence intervals and standard errors) assume that data is approximately normally distributed. These statistics might not be valid for small samples of heavily skewed data.

If you do encounter heavily skewed data, some statisticians recommend applying a normalizing transformation prior to analyzing the data. I have done this in several previous posts, such as the logarithmic transformation of the number of comments made on SAS blogs. In that analysis, the "Number of Responses" variable had a skewness of 4.12 and varied over several orders of magnitude. After applying a log transformation, the new variable had a skewness of 0.5. As a consequence, it was much easier to visualize the transformed data.

I'll end this article by challenging you to examine a mathematical brain teaser. Suppose that you have a data for a positive variable X that exhibits large positive skewness. You can reduce the skewness by applying a log transform to the data. However, there are three common log transformations: You can form Y2 = log2(X), Ye = ln(X), or Y10 = log10(X). Which transformed variable will have the smallest skewness? Mathematically oriented readers can determine the answer by looking at the definition of the sample skewness in the "Simple Statistics" section of the documentation. Readers who prefer programming can use the DATA step to apply the three transformations to some skewed data and use PROC MEANS to determine which transformation leads to the smallest skewness. Good luck!

Post a Comment

The frequency of letters in an English corpus

It's time for another blog post about ciphers. As I indicated in my previous blog post about substitution ciphers, the classical substitution cipher is no longer used to encrypt ultra-secret messages because the enciphered text is prone to a type of statistical attack known as frequency analysis.

At the root of frequency analysis is the fact that certain letters in English prose appear more frequently than other letters. In a simple substitution cipher, each letters is disguised by replacing it with some other symbol. However, a leopard cannot change its spots, and the frequencies of the symbols in an enciphered text can reveal which symbols correspond to certain letters. In a sufficiently long message, the symbols for common letters such as E, T, and A will appear often in the ciphertext, whereas the symbols for uncommon letters such as J, Q, and Z will occur rarely in the ciphertext.

The Wikipedia article on frequency analysis contains a nice example of how frequency analysis can be used to decrypt simple ciphers. This type of frequency analysis is usually one step in solving recreational puzzles, such the popular Cryptoquote puzzle that appears in many newspapers.

The frequency of letters in English text

The first task for a budding cryptanalyst is to determine the frequency distribution of letters in the language that he is deciphering. This is done by counting the number of times that each letter appears in a corpus: a large set of text (articles, books, web documents,...) that is representative of the written language. In today's world of digitized texts and rapid computing, it is easier than ever to compute the distribution of letters in a large corpus. This was done recently by Peter Norvig who used Google's digitized library as a corpus. I highly recommend Norvig's fascinating presentation of the frequency distribution of letters, words, and N-grams.

Since Norvig did the hard work, it is simple to enter the relative frequencies of each letter into SAS and visualize the results, as follows:

/* frequency of letters in English corpus (from Google digital library) */
data Letters;
length Letter $1;
informat Prop PERCENT8.;
input Letter $ Prop @@;
E  12.49% T   9.28% A   8.04% O   7.64% I   7.57% N   7.23% 
S   6.51% R   6.28% H   5.05% L   4.07% D   3.82% C   3.34% 
U   2.73% M   2.51% F   2.40% P   2.14% G   1.87% W   1.68% 
Y   1.66% B   1.48% V   1.05% K   0.54% X   0.23% J   0.16% 
Q   0.12% Z   0.09%
title "Frequency of Letters in an English Corpus";
footnote H=8pt J=L "Data from P. Norvig:";
ods graphics / height=900 width=600;
proc sgplot data=Letters;
  format Prop percent8.2;
  xaxis grid label="Frequency" offsetmax=0.15;
  yaxis grid discreteorder=data reverse; 
  scatter y=Letter x=Prop / datalabel=Prop datalabelpos=right 
title; footnote;

The graph shows that the letter E is the most common letter in the English language (12.5%), followed by T and A (9.3% and 8.0%, respectively). After the Big Three, the letters O, I, and N appear with similar frequencies. The letters S and R round out the list of the most frequent letters.

At the other end of the list, we see that X, J, Q, and Z are the least common letters in the corpus.

Although a simple substitution cipher can disguise the letters, it does not disguise the frequencies. Read on to see how this fact helps a cryptanalyst decipher messages that use a simple substitution cipher.

The distribution of letters in a short text

Suppose that I try to send a secret message by using a simple substitution cipher. In case the message is intercepted, I strip out all blanks and punctuation to make the ciphertext even harder to decipher. Here is the ciphertext presented as an array of long character string in SAS/IML:

proc iml;

The message looks formidable, but it isn't secure. Statistics—in the form of frequency analysis—can be used to suggest possible values for the letters in the text that occur most often. The following SAS/IML function counts the number of appearances of each letter in the string. It uses the SUBSTR function to convert the string into a SAS/IML vector, and then uses the LOC function to count the number of elements in the vector that matches each letter. The function returns a 26-element array whose first element is the proportion of the string that matches 'A', whose second element is the proportion that matches 'B', and so on:

start FreqLetters(_msg);
   msg = upcase(rowcat( shape(_msg,1) )); /* create one long string */
   m = substr(msg,1:nleng(msg),1);        /* break into array */
   letters = "A":"Z";
   freq = j(1,ncol(letters),0);
   do i = 1 to ncol(letters);
      freq[i] = ncol(loc(m=letters[i]));
   return(freq / sum(freq));   /* standardize as proportion */
p = FreqLetters(msg);          /* proportion of A, B, C, ..., Z in message */

You can print out the raw numbers, but it is easier to visualize the proportions if you sort the data and make a graph. The following SAS/IML statements create the same kind of graph that appeared earlier, but this time the frequencies are for this particular short encrypted text:

/* compute frequencies; sort in descending order */
call sortndx(ndx, p`, 1, 1); 
pct = p[,ndx]; let = ("A":"Z")[,ndx];
print pct[c=let format=percent7.2];
title "Frequency of Letters in Cipher Text";
ods graphics / height=900 width=600;
call scatter(pct, let) grid={x} label="Frequency" datalabel=Pct
     option="datalabelpos=right markerattrs=(symbol=CircleFilled)"
     Other="format pct DataLabel PERCENT7.1;"+
           "yaxis grid discreteorder=data reverse "+
           "label='Cipher Letter';";

First, notice that SAS/IML 12.3 enables you to create this ODS graph entirely within SAS/IML. Notice that the distribution of (enciphered) letters in the short message looks similar to the previous graph (the distribution of letters in a corpus). You can think of the ciphertext as a sample of English text. As such, the frequency of letters in the cipher text is expected to be close to (but not identical to) the frequency of letters in the corpus. The frequencies in the small sample suggest that the most common letter, plaintext 'e', is probably being disguised as the ciphertext 'J', 'D', or 'U'. Similarly, the plaintext 't' and 'a' are probably disguised as 'J', 'D', or 'U', although they could also be 'B' or 'I'.

The reason frequency analysis is successful is not because it identifies certain letters with certainty—it doesn't! However, it indicates the probable identities of certain letters, and this greatly simplifies the number of possibilities that the cryptanalysts needs to check.

Using frequency analysis as a method of cryptanalysis

For the purpose of solving a recreational puzzle such as the Cryptoquote, this kind of single-letter frequency analysis is often sufficienct to begin breaking the code. The puzzle solver uses the frequency analysis to guess one or two of the substitutions (perhaps 'J'='e' and 'D'='t'). If the message still looks like gibberish after the guess, the guess was probably wrong, and the puzzler can make a different guess (perhaps 'J'='t' and 'U'='e'). Usually only a few steps of this guessing game are necessary until a substitution results in a partially deciphered text that looks like it might contain recognizable words. Then, like pulling a string on a sweater, the puzzle unravels as the puzzler uses existing knowledge to guide subsequent substitutions.

However, sometimes a single-letter frequency analysis is only the first step in attacking a cipher. A more sophisticated attack relies not only on the distribution of single letters, but also the frequency distribution of bigrams—pairs of adjacent letters that appear together. In my next blog post about ciphers, I will discuss bigrams and visualize the distribution of bigrams in SAS.

Post a Comment

Read from one data set and write to another with SAS/IML

Many people know that the SAS/IML language enables you to read data from and write results to multiple SAS data sets. When you open a new data set, it is a good programming practice to close the previous data set. But did you know that you can have two data sets open simultaneously, one for reading and one for writing? The trick is to use the SETIN and SETOUT statements to specify the data sets that are used for reading and writing. Among other advantages, this technique enables you to use SAS/IML to analyze data that might be larger than can fit into the RAM on your computer.

This article is motivated by a question posted to the SAS/IML Support Community. The SAS/IML programmer was essentially running a BY-group analysis in SAS/IML. The user wanted to use the WHERE clause on the READ statement to read the data for each BY-group.

To illustrate the technique, consider the following BY-group regression analysis of the vehicles in the Sashelp.Cars data set. For each value of the Make variable (Acura, Audi, BMW,...), PROC REG performs an ordinary least-squares (OLS) analysis that uses the Wheelbase and Length variables to model the Weight of a vehicle. The parameter estimates are saved to the OutEst data set:

proc reg noprint outest=OutEst;
   by Make;
   model Weight = Wheelbase Length;

Let's create a similar analysis in the SAS/IML language. Although SAS/IML is often used to perform analyses that are not available from any SAS procedure, this simple and familiar analysis will enable us to concentrate on how to read and write data sets simultaneously. The following SAS/IML statements define a module that computes the parameter estimates for an OLS model. Because some of the BY groups result in singular linear systems, the GINV function is used to solve for the parameter estimates, rather than the more efficient SOLVE function.

proc iml;
start OLSParamEst(Y, _X);
   X = j(nrow(_X),1,1) || _X;   /* add intercept column to design matrix */
   XPX = X` * X;                /* cross-products      */
   /* handle singular systems by using GINV rather than SOLVE */
   Estimate = ginv(XPX) * X`*Y; /* parameter estimates */
   return( T(Estimate) );       /* return as row vector */

The BY-group analysis will be carried out one category at a time. The following statements open the input data set and read the BY-group categories. The output data set (which will contain the parameter estimates) is also opened for writing. Notice that you need to specify the names and types of the output variables on the CREATE statement, including the length of any character variables. Lastly, the SETIN and SETOUT statements are used to specify that the data sets that will be used for subsequent reading and writing operations:

use;                   /* input data */
read all var {"Make"};              /* read BY var */
byGroup = unique( Make );           /* compute unique levels */
ByVal = BlankStr(nleng(Make));      /* set length of ByVal variable (12.3) */
OutVarNames = {"Intercept" "Wheelbase" "Length"}; 
Estimate = {. . .};                 /* output variables are numeric */
create RegOut from Estimate[r=ByVal c=OutVarNames];
setin Sashelp.Cars;                 /* make current for reading */
setout RegOut;                      /* make current for writing */

The SETIN statement means "every time you want to read, use the Sashelp.Cars data set." The SETOUT statement means "every time you want to write, use the RegOut data set." The following statements loop over the unique categories of the Make variable, and perform an OLS analysis for the vehicles that have a common make. You can use the INDEX statement to speed up reading the BY groups.

InVarNames = {"Weight" "Wheelbase" "Length"};    /* input variables */
index Make;                           /* index by BY group variable */
do i = 1 to ncol(byGroup);
   ByVal = byGroup[i];
   read all var InVarNames where(Make=(ByVal));  /* read from Cars */ 
   Estimate = OLSParamEst(Weight, Wheelbase||Length);
   append from Estimate[r=ByVal];                /* write to RegOut */
close Sashelp.Cars RegOut;

The program performs 38 parameter estimates, one for each unique value of the Make variable. Because these data will fit entirely into RAM, you could also perform this analysis by using the UNIQUE-LOC or UNIQUEBY techniques.

In summary, use the SETIN and SETOUT statements when you want to have two SAS data sets open simultaneously, one for reading and one for writing. By the way, you can also use the SETOUT statement to write to several data sets, such as writing predicted and residual values to one while writing parameter estimates to another.

Post a Comment

Handling run-time errors in user-defined modules

I received the following email from a SAS/IML programmer:

I am getting an error in a PROC IML module that I wrote. The SAS Log says
NOTE: Paused in module NAME
When I submit other commands, PROC IML doesn't seem to understand them. How can I continue the program? The only thing that works is to QUIT and resubmit the entire PROC IML program.

I'm glad you asked this question. Handling run-time errors that occur in user-defined modules can be frustrating unless you understand what the "Paused in module" message means. If you haven't already read my article on scope, local, and global variables in SAS/IML, you might want to read it now. It explains about local scope (inside a module) versus main scope (outside of any module). Also, familiarize yourself with the difference between a syntax error and a run-time error.

A run-time error in a module causes PROC IML to pause

Remember that the "I" in "IML" stands for interactive. When there is a run-time error in a user-defined module, PROC IML calls the PAUSE statement. This causes the program execution to pause while still inside the module. While the program is paused, you have the opportunity to debug the module, primarily by using PRINT statements to display the values of the local variables in the module. When you "submit other commands," those statements are being executed from within the module, rather than at main scope, which is probably why you say that PROC IML "doesn't seem to understand" the commands.

When a run-time error occurs in a user-defined module in PROC IML, which includes modules in the IMLMLIB library, you have three options:

  • Fix the error and resubmit the entire PROC IML program.
  • Debug the problem and try to fix it from inside the module. For example, suppose the error occurs while trying to compute some quantity, Z, at Line 2 in the module. While the module is paused, you can assign Z a value and submit the RESUME statement. The program will continue executing from the next statement in the module. That is, from Line 3.
  • Submit the STOP statement to leave all modules and return to the main scope of the program. This is especially useful if the cause of the error was a bad parameter value. You can immediately call the module again with different parameter values.

The first option is easy, but if the program has just finished a long-running computation, you might want to try to salvage those computations, rather than restart the program.

Debug, fix, and RESUME

The following statements define a function that evaluates the quantity sqrt(x – 1). If you call the function with a value of x that is less than 1, a run-time error will occur. Let's intentionally cause an error so that the program pauses in the SQRTM1 module:

proc iml;
start Sqrtm1(x);
   y = x - 1;    /* Line 1: y is defined inside the module */
   z = sqrt(y);  /* Line 2 */
   s = 17;       /* Line 3 */
   return( z );  /* Line 4 */
A = 10;         /* A is defined outside the module */
B = Sqrtm1(0);  /* create run-time error on Line 2 of module */

The SAS Log shows that an error occurred on Line 2 of the module. The Log also says NOTE: Paused in module SQRTM1. The module has executed Line 2 but has not executed Line 3, so you can submit statements that refer to the local variables x and y, which were define before the error occurred. The statements cannot refer to the local variable s, because s has not yet been assigned. The statements also cannot refer to the variable A, because that variable exists only at main scope; the symbol is not known from within the module.

You can convince yourself that the program is paused within the module by submitting the following PRINT statements:

/* the program is paused inside the module */
print x y;   /* you can print local vars inside the module */
print A;     /* ERROR: this var is not defined inside the module */

If you want, you can assign a value to z and then resume execution of the program. The RESUME statement causes the program to resume execution beginning with Line 3 of the module. If the rest of the program executes without error, the SAS/IML environment returns to main scope, as shown by the following statements:

z = .;        /* within module: assign local variable z */
resume;       /* resume execution of program at Line 3  */
print A B;    /* now at main scope: A and B are defined */

STOP the program and return to main scope

Often it is impractical to "fix" the error from within a module. For example, you might have called a module that called a second module that called a third module, and the error occurred in the third module. Sometimes the wisest course of action is to get out of all the modules and return to the main calling environment. You can use the STOP statement to "pop" the stack of modules and return to main scope. The STOP statement is like a super-duper RETURN statement: it not only returns from the module with the error, but it returns from the complete chain of modules that were being executed. The STOP statement does not exit PROC IML, so use STOP when you want to retain the results from a prior long-running computation.

For example, the following statements once again create a run-time error in the SQRTM1 module. The module execution pauses after the error. If you execute the STOP statement (which supports an optional message statement in SAS/IML 13.1), you will return to the main scope of the program.

B = Sqrtm1(0);                  /* create run-time error on Line 2 */
stop "Return from the module";  /* SAS/IML 13.1 supports optional message */
print A;                        /* now at main scope: A is defined */

Debugging in SAS/IML Studio

Errors in the SAS/IML Studio application are handled differently. SAS/IML Studio contains point-and-click features for finding and fixing run-time errors. For an example, see the section "Run-Time Error" in the article "How to find and fix programming errors."


This article describes how to deal with errors in user-defined modules in PROC IML. When a run-time error occurs in a module, the program pauses execution, but does not exit PROC IML. In fact, the program is paused inside the module that experienced the error. You have three choices to handle the error: you can use the QUIT statement to exit PROC IML, you can debug and fix the problem and then use the RESUME statement to continue the program, or you can use the STOP statement to pop out of all modules and return to the main scope of the program.

Post a Comment

An exploratory technique for visualizing the distributions of 100 variables

In a previous blog post I showed how to order a set of variables by a statistic. After reshaping data, you can create a graph that contains box plots for many variables. Ordering the variables by some statistic (mean, median, variance,...) helps to differentiate and distinguish the variables.

You can use this as an exploratory technique. Suppose that you are given a new set of data that has 100 variables. Before you begin to analyze these variables, it is useful to visualize their distributions. Are the data normal or skewed? Are there outliers for some variables? You could use PROC UNIVARIATE to create 100 histograms and 100 sets of descriptive statistics, but it would be tedious to slog through hundreds of pages of outputs, and difficult to use the histograms to compare the variables.

In contrast, you can fit 100 box plots on a single graph. The resulting graph enables you to see all 100 distributions at a glance and to compare the distributions to each other. If the variables are measured on vastly different scales, you might want to standardize the variables so that the box plots are comparable in scale.

Creating 100 variables

As an example, I will simulate 100 variables with 1,000 observations in each variable. The following SAS/IML statements loop over each column of a matrix X. The SAMPLE function randomly chooses a distribution (normal, lognormal, exponential, or uniform) for the data in the column. I use the RANDFUN function to generate the data for each column. The 100 variables are then written to a SAS data set and given the names X1–X100.

/* create 100 variables with random data from a set of distributions */
proc iml;
N = 1000;  p = 100;
distrib = {"Normal" "Lognormal" "Exponential" "Uniform"};
call randseed(1);
/* each column of X is from a random distribution */
X = j(N,p);
do i = 1 to p;
   X[,i] = randfun(N, sample(distrib,1));
varNames = "x1":("x"+strip(char(p)));
create ManyVars from X[colname=varNames]; append from X; close;

Visualizing the 100 distributions

Suppose that you are asked to visualize the distributions in the ManyVars data set. Previously I showed how to use Base SAS to standardize the variables, reshape the data, and create the box plots. The following statement show how to implement the technique by using SAS/IML software. For no particular reason, I order these variables by the third quartile, rather than by the median:

proc iml;
use ManyVars;
read all var _NUM_ into X[colname=varNames];
close ManyVars;
N = nrow(X);
MinX = X[><, ]; MaxX = X[<>, ];
X = (X-MinX)/ (MaxX-MinX);       /* standardize vars into [0,1] */
call qntl(q3, X, 0.75);          /* compute 3rd quartile */
r = rank(q3);                    /* sort variables by that statistic */
temp = X;        X[,r] = temp;
temp = varNames; varNames[,r] = temp;
_Value_ = shapecol(X,0,1);      /* convert from wide form to long form */
varName = shapecol(repeat(varNames, N),0,1);
title "Variables ordered by Q3";          /* plot the distributions */
ods graphics / width =2400 height=600;
call box(_Value_) category=varName 
     other="xaxis discreteorder=data display=(nolabel);";

The result is shown above; click to enlarge. You can see the four different distributional shapes, although you have to look closely to differentiate the lognormal variables (which end with variable X83) and the exponential variables (which begin with variable X41). The changes in shape between the normal data (X69) and the uniform data (X99) are apparent. Also, notice that the skewed distributions have mean larger than the median, whereas for the symmetric distributions the sample means and the medians are approximately equal. The graph provides a quick-and-dirty view of the data, which is always useful when you encounter new data.

I standardized the variables into the interval [0,1], but other standardizations are possible. For example, to standardize the variable to mean zero and unit variance, use the statement X = (X-mean(X))/ std(X).

The SAS/IML program is remarkable for its compactness, especially when compared to the Base SAS implementation. Three statements are used to standardize the data, but I could have inlined the computations into a single statement. One statement is used to compute the quantiles. Reordering the variables is accomplished by using "the rank trick". The SHAPECOL function makes it easy to convert the data from wide form to long form. Finally, the BOX call makes it easy to create the box plot without even leaving the SAS/IML environment. These statements could be easily encapsulated into a single SAS/IML module that could be reused to visualize other data sets.

I think the visualization technique is powerful, although it is certainly not a panacea. When confronted with new data, this technique can help you to quickly learn about the distribution of the variables.

What do you think? Would you use this visualization as one step in an exploratory analysis? What are some of the strengths and weaknesses of this technique? Leave a comment.

Post a Comment

Order variables by values of a statistic


When I create a graph of data that contains a categorical variable, I rarely want to display the categories in alphabetical order. For example, the box plot to the left is a plot of 10 standardized variables where the variables are ordered by their median value. The ordering makes it apparent that the variables on the left (Invoice, MSRP) are highly skewed with many outliers, whereas the rightmost variable (length) appears to be symmetric and perhaps normally distributed.

When you use the SG procedures in SAS, there are several ways to change the order in which categories appear in a graph. One extremely powerful method is to use the DISCRETEORDER=DATA option on the XAXIS statement of the SGPLOT procedure. The DISCRETEORDER=DATA option tells the SGPLOT procedure to display the categories in the order in which they appear in the data set. In other words, if you can figure out some way to order the observations (DATA step, PROC SORT, PROC SQL,....), then you will see that same order when you create a graph of the data.

To illustrate this idea, let's create the box plot at the top of this article. The data are standardized versions of the numerical variables in the Sashelp.Cars data set. The statistic of interest is the median. The following statements create the data and compute the medians:

%let DSName =;
%let Stat = Median; /* or mean, stddev, qrange, skew, etc */
/* standardize variables to [0,1] */
proc stdize method=range data=&DSName out=MyData; run;
proc means data=MyData &Stat STACKODSOUTPUT; 
ods output Summary=StatOut;

The ODS OUTPUT statement is used to create a data set with the median statistic for each variable. The STACKODSOUTPUT option is used so that the output data set has the same structure as the printed table.

The next step is to get the list of variables ordered by the median statistic. My favorite method is to use PROC SQL's useful SELECT INTO :MACROVAR syntax. I thanks my colleague Gordon for reminding me I do not need to use PROC SORT to sort the variables, I can use the ORDER BY clause in PROC SQL instead.

/* put ordered variable names into macro variable */
proc sql noprint;                              
 select Variable into :varlist separated by ' '
 from StatOut order by &Stat;

The macro variable &VARLIST now contains a list of the variables, ordered by their medians. In order to create a single graph that has multiple box plots, transpose the data from wide form (many varables) to long form (two variables that specify the variable name and value). The following statements add an "observation number" variable to the data and use the RETAIN statement to reorder the variables. Then PROC TRANSPOSE converts the data set from wide form to long form. In the Long data set, the variable names are ordered by the value of the statistic. (Alternatively, you could merge the statistics and the original data and then sort the merged data by the statistic and the observation number.)

data Wide / view=Wide;     
retain &varlist;           /* reorder vars by statistic */
set MyData;
obsNum = _N_;              /* add ID for subject (observation) */
keep obsNum &varlist;
/* transpose from wide to long data format; VARNAME is a categorical var */
proc transpose data=Wide name=VarName 
               out=Long(rename=(Col1=_Value_) drop=_LABEL_);
by obsNum;

Data in this form is ideal for graphing by using SGPLOT statements that support a GROUP= or CATEGORY= option. To create the graph at the beginning of this article, use the VBOX statement with the CATEGORY= option. If you also use the DISCRETEORDER=DATA option on the XAXIS statement, the box plots are ordered by their median values, as shown in the following:

title "Box Plots of Standardized Variables, Ordered by &Stat";
proc sgplot data=Long;
   label _Value_ = "Standardized Value" VarName="Variable";
   vbox _Value_ / category=VarName;
   xaxis discreteorder=data display=(nolabel);        /* respect ordering */

Obviously, you can use this same technique to create box plots that are ordered by other statistics.

What technique do you use to order groups or categories in a graph? Can you think of applications for which this technique would be useful? Leave a comment.

Post a Comment