Skew this

The skewness of a distribution indicates whether a distribution is symmetric or not. A distribution that is symmetric about its mean has zero skewness. In contrast, if the right tail of a unimodal distribution has more mass than the left tail, then the distribution is said to be "right skewed" or to have positive skewness. Similarly, negatively skewed distributions have long (or heavy) left tails.

Many of us were taught that positive skewness implies that the mean value is to the right of the median. This rule-of-thumb holds for many familiar continuous distributions such as the exponential, lognormal, and chi-square distributions. However, not everyone is aware that this relationship does not hold in general, as discussed in an easy-to-read paper by P. Hipple (2005, J. Stat. Ed.).

In SAS software, the formula for the skewness of a sample is given in the "Simple Statistics" section of the documentation. It is easy to compute the sample skewness Base SAS and in the SAS/IML language. This article shows how.

Use PROC MEANS to compute skewness

The MEANS procedure supports many univariate descriptive statistics and is my go-to procedure for computing skewness. The following statements compute the mean, median, and skewness for two variables in the Sashelp.Cars data set. The variables are the length of the vehicles and the average fuel economy (miles per gallon) for the vehicles in city driving:

proc means data=sashelp.cars N mean median skew;
var Length MPG_City;
run;
t_skewness

The MEANS procedure does not create graphs, but the following box plots show a distribution of the two variables:

skewness

The Length variable has very small skewness (0.18). In the box plot, the diamond symbol represents the mean and the vertical line that divides the box is the median. The box plot indicates that the distribution of the vehicle lengths is approximately symmetric, and that the mean and median are approximately equal.

In contrast, the MPG_City variable has large positive skewness (2.78). The box plot indicates that the data distribution has a short left tail and a long right tail. The mean is to the right of the median, as is often the case for right-skewed distributions.

Computing the skewness in SAS/IML software

The SAS/IML 13.1 release introduced the SKEWNESS function, which makes it easy to compute the skewness for each column of a matrix. The following SAS/IML statements read in the Length and MPG_City variables from the Sashelp.Cars data into a matrix. The SKEWNESS function compute the skewness of each column. The results are identical to the skewness values that were reported by PROC MEANS, and are not shown:

proc iml;
varNames = {"Length" "MPG_City"};
use sashelp.cars;
   read all var varNames into X;
close;
 
skew = skewness(X);
print skew[c=varNames];

I've always thought that it is curious that many "natural" distributions are positively skewed. Many variables that we care about (income, time spent waiting, the size of things,...) are distributions of positive quantities. As such, they are bounded below by zero, but the maximum value is often unbounded. This results in many variables that are positively skewed, but few variables that are negatively skewed. For example, there are 10 numerical variables in the Sashelp.Cars data, and all 10 have positive skewness. Similarly, most of the classical distributions that we learn about in statistics courses are either symmetric or positively skewed.

A canonical example of a negatively skewed distribution is the distribution of grades in a class—especially at schools that have grade inflation. The following data presents hypothetical grades for 30 students who took a test. The test might have been too easy because the median grade is 94.5:

scores = {100 97 67 84 98 100 100 96 87 72 90 100 95 96 89
          100 90 98 85 95 100  96 92 83 90 90 100 66 92 94};
skew = skewness(scores`);
print skew;
 
title "Class Test Scores";
title2 "Skewness = -1.52";
ods graphics / width=400 height=250;
call box(scores) type="HBox";
skewness2

The skewness for this data is -1.52. The box plot indicates that a few students did not score well on the test, although three-quarters of the students received high marks. The box plot shows outliers in the left tail, and the mean is less than the median value.

Why you should care about skewness

Variables that have large skewness (positive or negative) might need special treatment during visualization and modeling. A large skewness value can indicate that the data vary over several orders of magnitude. Some statistics (such as confidence intervals and standard errors) assume that data is approximately normally distributed. These statistics might not be valid for small samples of heavily skewed data.

If you do encounter heavily skewed data, some statisticians recommend applying a normalizing transformation prior to analyzing the data. I have done this in several previous posts, such as the logarithmic transformation of the number of comments made on SAS blogs. In that analysis, the "Number of Responses" variable had a skewness of 4.12 and varied over several orders of magnitude. After applying a log transformation, the new variable had a skewness of 0.5. As a consequence, it was much easier to visualize the transformed data.

I'll end this article by challenging you to examine a mathematical brain teaser. Suppose that you have a data for a positive variable X that exhibits large positive skewness. You can reduce the skewness by applying a log transform to the data. However, there are three common log transformations: You can form Y2 = log2(X), Ye = ln(X), or Y10 = log10(X). Which transformed variable will have the smallest skewness? Mathematically oriented readers can determine the answer by looking at the definition of the sample skewness in the "Simple Statistics" section of the documentation. Readers who prefer programming can use the DATA step to apply the three transformations to some skewed data and use PROC MEANS to determine which transformation leads to the smallest skewness. Good luck!

Post a Comment

The frequency of letters in an English corpus

It's time for another blog post about ciphers. As I indicated in my previous blog post about substitution ciphers, the classical substitution cipher is no longer used to encrypt ultra-secret messages because the enciphered text is prone to a type of statistical attack known as frequency analysis.

At the root of frequency analysis is the fact that certain letters in English prose appear more frequently than other letters. In a simple substitution cipher, each letters is disguised by replacing it with some other symbol. However, a leopard cannot change its spots, and the frequencies of the symbols in an enciphered text can reveal which symbols correspond to certain letters. In a sufficiently long message, the symbols for common letters such as E, T, and A will appear often in the ciphertext, whereas the symbols for uncommon letters such as J, Q, and Z will occur rarely in the ciphertext.

The Wikipedia article on frequency analysis contains a nice example of how frequency analysis can be used to decrypt simple ciphers. This type of frequency analysis is usually one step in solving recreational puzzles, such the popular Cryptoquote puzzle that appears in many newspapers.

The frequency of letters in English text

The first task for a budding cryptanalyst is to determine the frequency distribution of letters in the language that he is deciphering. This is done by counting the number of times that each letter appears in a corpus: a large set of text (articles, books, web documents,...) that is representative of the written language. In today's world of digitized texts and rapid computing, it is easier than ever to compute the distribution of letters in a large corpus. This was done recently by Peter Norvig who used Google's digitized library as a corpus. I highly recommend Norvig's fascinating presentation of the frequency distribution of letters, words, and N-grams.

Since Norvig did the hard work, it is simple to enter the relative frequencies of each letter into SAS and visualize the results, as follows:

/* frequency of letters in English corpus (from Google digital library) */
data Letters;
length Letter $1;
informat Prop PERCENT8.;
input Letter $ Prop @@;
datalines;
E  12.49% T   9.28% A   8.04% O   7.64% I   7.57% N   7.23% 
S   6.51% R   6.28% H   5.05% L   4.07% D   3.82% C   3.34% 
U   2.73% M   2.51% F   2.40% P   2.14% G   1.87% W   1.68% 
Y   1.66% B   1.48% V   1.05% K   0.54% X   0.23% J   0.16% 
Q   0.12% Z   0.09%
;
 
title "Frequency of Letters in an English Corpus";
footnote H=8pt J=L "Data from P. Norvig: http://norvig.com/mayzner.html";
ods graphics / height=900 width=600;
proc sgplot data=Letters;
  format Prop percent8.2;
  xaxis grid label="Frequency" offsetmax=0.15;
  yaxis grid discreteorder=data reverse; 
  scatter y=Letter x=Prop / datalabel=Prop datalabelpos=right 
                            markerattrs=(symbol=CircleFilled);
run;
title; footnote;
freqletterscorpus

The graph shows that the letter E is the most common letter in the English language (12.5%), followed by T and A (9.3% and 8.0%, respectively). After the Big Three, the letters O, I, and N appear with similar frequencies. The letters S and R round out the list of the most frequent letters.

At the other end of the list, we see that X, J, Q, and Z are the least common letters in the corpus.

Although a simple substitution cipher can disguise the letters, it does not disguise the frequencies. Read on to see how this fact helps a cryptanalyst decipher messages that use a simple substitution cipher.

The distribution of letters in a short text

Suppose that I try to send a secret message by using a simple substitution cipher. In case the message is intercepted, I strip out all blanks and punctuation to make the ciphertext even harder to decipher. Here is the ciphertext presented as an array of long character string in SAS/IML:

proc iml;
msg ={"JWUAHDOHVJWBICHHKBIJHFEHQBSUJBFIJUZWPBTYUIDPSUXDLFOUIVHEUVVBZB",
      "UPJONIBLYODJBPASDJDBPIDIIHVJRDEUSDJDIBLYODJBHPBIDVYPSDLUPJDOJ",
      "UZWPBTYUBPIJDJBIJBZDOFEHAEDLLBPADPSEUIUDEZWJHUQDOYDJUIJDJBIJBZ",
      "DOLUJWHSINHYHVJUPPUUSJHZEUDJUSDJDRBJWKPHRPFEHFUEJBUICHJWEDPSHL",
      "DPSPHPEDPSHLJWBICHHKZHPJDBPILHEUJWDPHPUWYPSEUSDPPHJDJUSFEHAEDL",
      "IJWDJIBLYODJUSDJDRBJWIFUZBVBUSZWDEDZJUEBIJBZINHYZDPYIUIBLYODJU",
      "SSDJDJHUIJBLDJUJWUFEHCDCBOBJNHVDPUQUPJJHUIJBLDJUJWUIDLFOBPASBI",
      "JEBCYJBHPHVDIJDJBIJBZJHUIJBLDJUJWUZHQUEDAUFEHCDCBOBJBUIHVZHPVB",
      "SUPZUBPJUEQDOIDPSJHUQDOYDJUJWUEHCYIJPUIIHVDIJDJBIJBZDOJUIJIHLU",
      "FEHAEDLIDEUFEUIUPJUSBPJRHVHELIVBEIJYIBPAJWUSDJDIJUFDPSJWUPDADB",
      "PYIBPAJWUIDIBLOODPAYDAUCNFEUIUPJBPAJWUIDLUDOAHEBJWLBPJRHSBVVUE",
      "UPJRDNIJWUPHQBZUIDIBLOFEHAEDLLUEZDPOUDEPWHRJHREBJUIBLYODJBHPIB",
      "PJWUIDIBLOODPAYDAUODJUEZWDFJUEIJWDJSBIZYIILYOJBQDEBDJUIBLYODJB",
      "HPYIUJWUIDIBLOODPAYDAUWUDQBONBVNHYDEUIUEBHYIDCHYJIBLYODJBHPNHY",
      "IWHYOSBPQUIJJWUJBLUJHOUDEPWHRJHYIUJWUIDIBLOODPAYDAUUVVBZBUPJON"};

The message looks formidable, but it isn't secure. Statistics—in the form of frequency analysis—can be used to suggest possible values for the letters in the text that occur most often. The following SAS/IML function counts the number of appearances of each letter in the string. It uses the SUBSTR function to convert the string into a SAS/IML vector, and then uses the LOC function to count the number of elements in the vector that matches each letter. The function returns a 26-element array whose first element is the proportion of the string that matches 'A', whose second element is the proportion that matches 'B', and so on:

start FreqLetters(_msg);
   msg = upcase(rowcat( shape(_msg,1) )); /* create one long string */
   m = substr(msg,1:nleng(msg),1);        /* break into array */
   letters = "A":"Z";
   freq = j(1,ncol(letters),0);
   do i = 1 to ncol(letters);
      freq[i] = ncol(loc(m=letters[i]));
   end;
   return(freq / sum(freq));   /* standardize as proportion */
finish;
 
p = FreqLetters(msg);          /* proportion of A, B, C, ..., Z in message */

You can print out the raw numbers, but it is easier to visualize the proportions if you sort the data and make a graph. The following SAS/IML statements create the same kind of graph that appeared earlier, but this time the frequencies are for this particular short encrypted text:

/* compute frequencies; sort in descending order */
call sortndx(ndx, p`, 1, 1); 
pct = p[,ndx]; let = ("A":"Z")[,ndx];
print pct[c=let format=percent7.2];
 
title "Frequency of Letters in Cipher Text";
ods graphics / height=900 width=600;
call scatter(pct, let) grid={x} label="Frequency" datalabel=Pct
     option="datalabelpos=right markerattrs=(symbol=CircleFilled)"
     Other="format pct DataLabel PERCENT7.1;"+
           "yaxis grid discreteorder=data reverse "+
           "label='Cipher Letter';";
freqlettersmsg

First, notice that SAS/IML 12.3 enables you to create this ODS graph entirely within SAS/IML. Notice that the distribution of (enciphered) letters in the short message looks similar to the previous graph (the distribution of letters in a corpus). You can think of the ciphertext as a sample of English text. As such, the frequency of letters in the cipher text is expected to be close to (but not identical to) the frequency of letters in the corpus. The frequencies in the small sample suggest that the most common letter, plaintext 'e', is probably being disguised as the ciphertext 'J', 'D', or 'U'. Similarly, the plaintext 't' and 'a' are probably disguised as 'J', 'D', or 'U', although they could also be 'B' or 'I'.

The reason frequency analysis is successful is not because it identifies certain letters with certainty—it doesn't! However, it indicates the probable identities of certain letters, and this greatly simplifies the number of possibilities that the cryptanalysts needs to check.

Using frequency analysis as a method of cryptanalysis

For the purpose of solving a recreational puzzle such as the Cryptoquote, this kind of single-letter frequency analysis is often sufficienct to begin breaking the code. The puzzle solver uses the frequency analysis to guess one or two of the substitutions (perhaps 'J'='e' and 'D'='t'). If the message still looks like gibberish after the guess, the guess was probably wrong, and the puzzler can make a different guess (perhaps 'J'='t' and 'U'='e'). Usually only a few steps of this guessing game are necessary until a substitution results in a partially deciphered text that looks like it might contain recognizable words. Then, like pulling a string on a sweater, the puzzle unravels as the puzzler uses existing knowledge to guide subsequent substitutions.

However, sometimes a single-letter frequency analysis is only the first step in attacking a cipher. A more sophisticated attack relies not only on the distribution of single letters, but also the frequency distribution of bigrams—pairs of adjacent letters that appear together. In my next blog post about ciphers, I will discuss bigrams and visualize the distribution of bigrams in SAS.

Post a Comment

Read from one data set and write to another with SAS/IML

Many people know that the SAS/IML language enables you to read data from and write results to multiple SAS data sets. When you open a new data set, it is a good programming practice to close the previous data set. But did you know that you can have two data sets open simultaneously, one for reading and one for writing? The trick is to use the SETIN and SETOUT statements to specify the data sets that are used for reading and writing. Among other advantages, this technique enables you to use SAS/IML to analyze data that might be larger than can fit into the RAM on your computer.

This article is motivated by a question posted to the SAS/IML Support Community. The SAS/IML programmer was essentially running a BY-group analysis in SAS/IML. The user wanted to use the WHERE clause on the READ statement to read the data for each BY-group.

To illustrate the technique, consider the following BY-group regression analysis of the vehicles in the Sashelp.Cars data set. For each value of the Make variable (Acura, Audi, BMW,...), PROC REG performs an ordinary least-squares (OLS) analysis that uses the Wheelbase and Length variables to model the Weight of a vehicle. The parameter estimates are saved to the OutEst data set:

proc reg data=sashelp.cars noprint outest=OutEst;
   by Make;
   model Weight = Wheelbase Length;
run;

Let's create a similar analysis in the SAS/IML language. Although SAS/IML is often used to perform analyses that are not available from any SAS procedure, this simple and familiar analysis will enable us to concentrate on how to read and write data sets simultaneously. The following SAS/IML statements define a module that computes the parameter estimates for an OLS model. Because some of the BY groups result in singular linear systems, the GINV function is used to solve for the parameter estimates, rather than the more efficient SOLVE function.

proc iml;
start OLSParamEst(Y, _X);
   X = j(nrow(_X),1,1) || _X;   /* add intercept column to design matrix */
   XPX = X` * X;                /* cross-products      */
   /* handle singular systems by using GINV rather than SOLVE */
   Estimate = ginv(XPX) * X`*Y; /* parameter estimates */
   return( T(Estimate) );       /* return as row vector */
finish;

The BY-group analysis will be carried out one category at a time. The following statements open the input data set and read the BY-group categories. The output data set (which will contain the parameter estimates) is also opened for writing. Notice that you need to specify the names and types of the output variables on the CREATE statement, including the length of any character variables. Lastly, the SETIN and SETOUT statements are used to specify that the data sets that will be used for subsequent reading and writing operations:

use Sashelp.cars;                   /* input data */
read all var {"Make"};              /* read BY var */
byGroup = unique( Make );           /* compute unique levels */
 
ByVal = BlankStr(nleng(Make));      /* set length of ByVal variable (12.3) */
OutVarNames = {"Intercept" "Wheelbase" "Length"}; 
Estimate = {. . .};                 /* output variables are numeric */
create RegOut from Estimate[r=ByVal c=OutVarNames];
 
setin Sashelp.Cars;                 /* make current for reading */
setout RegOut;                      /* make current for writing */

The SETIN statement means "every time you want to read, use the Sashelp.Cars data set." The SETOUT statement means "every time you want to write, use the RegOut data set." The following statements loop over the unique categories of the Make variable, and perform an OLS analysis for the vehicles that have a common make. You can use the INDEX statement to speed up reading the BY groups.

InVarNames = {"Weight" "Wheelbase" "Length"};    /* input variables */
index Make;                           /* index by BY group variable */
do i = 1 to ncol(byGroup);
   ByVal = byGroup[i];
   read all var InVarNames where(Make=(ByVal));  /* read from Cars */ 
   Estimate = OLSParamEst(Weight, Wheelbase||Length);
   append from Estimate[r=ByVal];                /* write to RegOut */
end;
 
close Sashelp.Cars RegOut;

The program performs 38 parameter estimates, one for each unique value of the Make variable. Because these data will fit entirely into RAM, you could also perform this analysis by using the UNIQUE-LOC or UNIQUEBY techniques.

In summary, use the SETIN and SETOUT statements when you want to have two SAS data sets open simultaneously, one for reading and one for writing. By the way, you can also use the SETOUT statement to write to several data sets, such as writing predicted and residual values to one while writing parameter estimates to another.

Post a Comment

Handling run-time errors in user-defined modules

I received the following email from a SAS/IML programmer:

I am getting an error in a PROC IML module that I wrote. The SAS Log says
NOTE: Paused in module NAME
When I submit other commands, PROC IML doesn't seem to understand them. How can I continue the program? The only thing that works is to QUIT and resubmit the entire PROC IML program.

I'm glad you asked this question. Handling run-time errors that occur in user-defined modules can be frustrating unless you understand what the "Paused in module" message means. If you haven't already read my article on scope, local, and global variables in SAS/IML, you might want to read it now. It explains about local scope (inside a module) versus main scope (outside of any module). Also, familiarize yourself with the difference between a syntax error and a run-time error.

A run-time error in a module causes PROC IML to pause

Remember that the "I" in "IML" stands for interactive. When there is a run-time error in a user-defined module, PROC IML calls the PAUSE statement. This causes the program execution to pause while still inside the module. While the program is paused, you have the opportunity to debug the module, primarily by using PRINT statements to display the values of the local variables in the module. When you "submit other commands," those statements are being executed from within the module, rather than at main scope, which is probably why you say that PROC IML "doesn't seem to understand" the commands.

When a run-time error occurs in a user-defined module in PROC IML, which includes modules in the IMLMLIB library, you have three options:

  • Fix the error and resubmit the entire PROC IML program.
  • Debug the problem and try to fix it from inside the module. For example, suppose the error occurs while trying to compute some quantity, Z, at Line 2 in the module. While the module is paused, you can assign Z a value and submit the RESUME statement. The program will continue executing from the next statement in the module. That is, from Line 3.
  • Submit the STOP statement to leave all modules and return to the main scope of the program. This is especially useful if the cause of the error was a bad parameter value. You can immediately call the module again with different parameter values.

The first option is easy, but if the program has just finished a long-running computation, you might want to try to salvage those computations, rather than restart the program.

Debug, fix, and RESUME

The following statements define a function that evaluates the quantity sqrt(x – 1). If you call the function with a value of x that is less than 1, a run-time error will occur. Let's intentionally cause an error so that the program pauses in the SQRTM1 module:

proc iml;
start Sqrtm1(x);
   y = x - 1;    /* Line 1: y is defined inside the module */
   z = sqrt(y);  /* Line 2 */
   s = 17;       /* Line 3 */
   return( z );  /* Line 4 */
finish;
 
A = 10;         /* A is defined outside the module */
B = Sqrtm1(0);  /* create run-time error on Line 2 of module */

The SAS Log shows that an error occurred on Line 2 of the module. The Log also says NOTE: Paused in module SQRTM1. The module has executed Line 2 but has not executed Line 3, so you can submit statements that refer to the local variables x and y, which were define before the error occurred. The statements cannot refer to the local variable s, because s has not yet been assigned. The statements also cannot refer to the variable A, because that variable exists only at main scope; the symbol is not known from within the module.

You can convince yourself that the program is paused within the module by submitting the following PRINT statements:

/* the program is paused inside the module */
print x y;   /* you can print local vars inside the module */
print A;     /* ERROR: this var is not defined inside the module */

If you want, you can assign a value to z and then resume execution of the program. The RESUME statement causes the program to resume execution beginning with Line 3 of the module. If the rest of the program executes without error, the SAS/IML environment returns to main scope, as shown by the following statements:

z = .;        /* within module: assign local variable z */
resume;       /* resume execution of program at Line 3  */
print A B;    /* now at main scope: A and B are defined */

STOP the program and return to main scope

Often it is impractical to "fix" the error from within a module. For example, you might have called a module that called a second module that called a third module, and the error occurred in the third module. Sometimes the wisest course of action is to get out of all the modules and return to the main calling environment. You can use the STOP statement to "pop" the stack of modules and return to main scope. The STOP statement is like a super-duper RETURN statement: it not only returns from the module with the error, but it returns from the complete chain of modules that were being executed. The STOP statement does not exit PROC IML, so use STOP when you want to retain the results from a prior long-running computation.

For example, the following statements once again create a run-time error in the SQRTM1 module. The module execution pauses after the error. If you execute the STOP statement (which supports an optional message statement in SAS/IML 13.1), you will return to the main scope of the program.

B = Sqrtm1(0);                  /* create run-time error on Line 2 */
stop "Return from the module";  /* SAS/IML 13.1 supports optional message */
print A;                        /* now at main scope: A is defined */

Debugging in SAS/IML Studio

Errors in the SAS/IML Studio application are handled differently. SAS/IML Studio contains point-and-click features for finding and fixing run-time errors. For an example, see the section "Run-Time Error" in the article "How to find and fix programming errors."

Summary

This article describes how to deal with errors in user-defined modules in PROC IML. When a run-time error occurs in a module, the program pauses execution, but does not exit PROC IML. In fact, the program is paused inside the module that experienced the error. You have three choices to handle the error: you can use the QUIT statement to exit PROC IML, you can debug and fix the problem and then use the RESUME statement to continue the program, or you can use the STOP statement to pop out of all modules and return to the main scope of the program.

Post a Comment

An exploratory technique for visualizing the distributions of 100 variables

In a previous blog post I showed how to order a set of variables by a statistic. After reshaping data, you can create a graph that contains box plots for many variables. Ordering the variables by some statistic (mean, median, variance,...) helps to differentiate and distinguish the variables.

You can use this as an exploratory technique. Suppose that you are given a new set of data that has 100 variables. Before you begin to analyze these variables, it is useful to visualize their distributions. Are the data normal or skewed? Are there outliers for some variables? You could use PROC UNIVARIATE to create 100 histograms and 100 sets of descriptive statistics, but it would be tedious to slog through hundreds of pages of outputs, and difficult to use the histograms to compare the variables.

In contrast, you can fit 100 box plots on a single graph. The resulting graph enables you to see all 100 distributions at a glance and to compare the distributions to each other. If the variables are measured on vastly different scales, you might want to standardize the variables so that the box plots are comparable in scale.

Creating 100 variables

As an example, I will simulate 100 variables with 1,000 observations in each variable. The following SAS/IML statements loop over each column of a matrix X. The SAMPLE function randomly chooses a distribution (normal, lognormal, exponential, or uniform) for the data in the column. I use the RANDFUN function to generate the data for each column. The 100 variables are then written to a SAS data set and given the names X1–X100.

/* create 100 variables with random data from a set of distributions */
proc iml;
N = 1000;  p = 100;
distrib = {"Normal" "Lognormal" "Exponential" "Uniform"};
call randseed(1);
 
/* each column of X is from a random distribution */
X = j(N,p);
do i = 1 to p;
   X[,i] = randfun(N, sample(distrib,1));
end;
 
varNames = "x1":("x"+strip(char(p)));
create ManyVars from X[colname=varNames]; append from X; close;
quit;

Visualizing the 100 distributions

Suppose that you are asked to visualize the distributions in the ManyVars data set. Previously I showed how to use Base SAS to standardize the variables, reshape the data, and create the box plots. The following statement show how to implement the technique by using SAS/IML software. For no particular reason, I order these variables by the third quartile, rather than by the median:

proc iml;
use ManyVars;
read all var _NUM_ into X[colname=varNames];
close ManyVars;
N = nrow(X);
 
MinX = X[><, ]; MaxX = X[<>, ];
X = (X-MinX)/ (MaxX-MinX);       /* standardize vars into [0,1] */
call qntl(q3, X, 0.75);          /* compute 3rd quartile */
 
r = rank(q3);                    /* sort variables by that statistic */
temp = X;        X[,r] = temp;
temp = varNames; varNames[,r] = temp;
 
_Value_ = shapecol(X,0,1);      /* convert from wide form to long form */
varName = shapecol(repeat(varNames, N),0,1);
 
title "Variables ordered by Q3";          /* plot the distributions */
ods graphics / width =2400 height=600;
call box(_Value_) category=varName 
     other="xaxis discreteorder=data display=(nolabel);";
viz100

The result is shown above; click to enlarge. You can see the four different distributional shapes, although you have to look closely to differentiate the lognormal variables (which end with variable X83) and the exponential variables (which begin with variable X41). The changes in shape between the normal data (X69) and the uniform data (X99) are apparent. Also, notice that the skewed distributions have mean larger than the median, whereas for the symmetric distributions the sample means and the medians are approximately equal. The graph provides a quick-and-dirty view of the data, which is always useful when you encounter new data.

I standardized the variables into the interval [0,1], but other standardizations are possible. For example, to standardize the variable to mean zero and unit variance, use the statement X = (X-mean(X))/ std(X).

The SAS/IML program is remarkable for its compactness, especially when compared to the Base SAS implementation. Three statements are used to standardize the data, but I could have inlined the computations into a single statement. One statement is used to compute the quantiles. Reordering the variables is accomplished by using "the rank trick". The SHAPECOL function makes it easy to convert the data from wide form to long form. Finally, the BOX call makes it easy to create the box plot without even leaving the SAS/IML environment. These statements could be easily encapsulated into a single SAS/IML module that could be reused to visualize other data sets.

I think the visualization technique is powerful, although it is certainly not a panacea. When confronted with new data, this technique can help you to quickly learn about the distribution of the variables.

What do you think? Would you use this visualization as one step in an exploratory analysis? What are some of the strengths and weaknesses of this technique? Leave a comment.

Post a Comment

Order variables by values of a statistic

ordervars1

When I create a graph of data that contains a categorical variable, I rarely want to display the categories in alphabetical order. For example, the box plot to the left is a plot of 10 standardized variables where the variables are ordered by their median value. The ordering makes it apparent that the variables on the left (Invoice, MSRP) are highly skewed with many outliers, whereas the rightmost variable (length) appears to be symmetric and perhaps normally distributed.

When you use the SG procedures in SAS, there are several ways to change the order in which categories appear in a graph. One extremely powerful method is to use the DISCRETEORDER=DATA option on the XAXIS statement of the SGPLOT procedure. The DISCRETEORDER=DATA option tells the SGPLOT procedure to display the categories in the order in which they appear in the data set. In other words, if you can figure out some way to order the observations (DATA step, PROC SORT, PROC SQL,....), then you will see that same order when you create a graph of the data.

To illustrate this idea, let's create the box plot at the top of this article. The data are standardized versions of the numerical variables in the Sashelp.Cars data set. The statistic of interest is the median. The following statements create the data and compute the medians:

%let DSName = sashelp.cars;
%let Stat = Median; /* or mean, stddev, qrange, skew, etc */
/* standardize variables to [0,1] */
proc stdize method=range data=&DSName out=MyData; run;
 
proc means data=MyData &Stat STACKODSOUTPUT; 
ods output Summary=StatOut;
run;
t_ordervar1

The ODS OUTPUT statement is used to create a data set with the median statistic for each variable. The STACKODSOUTPUT option is used so that the output data set has the same structure as the printed table.

The next step is to get the list of variables ordered by the median statistic. My favorite method is to use PROC SQL's useful SELECT INTO :MACROVAR syntax. I thanks my colleague Gordon for reminding me I do not need to use PROC SORT to sort the variables, I can use the ORDER BY clause in PROC SQL instead.

/* put ordered variable names into macro variable */
proc sql noprint;                              
 select Variable into :varlist separated by ' '
 from StatOut order by &Stat;
quit;

The macro variable &VARLIST now contains a list of the variables, ordered by their medians. In order to create a single graph that has multiple box plots, transpose the data from wide form (many varables) to long form (two variables that specify the variable name and value). The following statements add an "observation number" variable to the data and use the RETAIN statement to reorder the variables. Then PROC TRANSPOSE converts the data set from wide form to long form. In the Long data set, the variable names are ordered by the value of the statistic. (Alternatively, you could merge the statistics and the original data and then sort the merged data by the statistic and the observation number.)

data Wide / view=Wide;     
retain &varlist;           /* reorder vars by statistic */
set MyData;
obsNum = _N_;              /* add ID for subject (observation) */
keep obsNum &varlist;
run;
 
/* transpose from wide to long data format; VARNAME is a categorical var */
proc transpose data=Wide name=VarName 
               out=Long(rename=(Col1=_Value_) drop=_LABEL_);
by obsNum;
run;
t_ordervar2

Data in this form is ideal for graphing by using SGPLOT statements that support a GROUP= or CATEGORY= option. To create the graph at the beginning of this article, use the VBOX statement with the CATEGORY= option. If you also use the DISCRETEORDER=DATA option on the XAXIS statement, the box plots are ordered by their median values, as shown in the following:

title "Box Plots of Standardized Variables, Ordered by &Stat";
proc sgplot data=Long;
   label _Value_ = "Standardized Value" VarName="Variable";
   vbox _Value_ / category=VarName;
   xaxis discreteorder=data display=(nolabel);        /* respect ordering */
run;

Obviously, you can use this same technique to create box plots that are ordered by other statistics.

What technique do you use to order groups or categories in a graph? Can you think of applications for which this technique would be useful? Leave a comment.

Post a Comment

Ciphers, keys, and cryptoquotes

Today is my fourth blog-iversary: the anniversary of my first blog post in 2010. To celebrate, I am going to write a series of fun posts based on The Code Book by Simon Singh, a fascinating account of the history of cryptography from ancient times until the present. While reading the book, I was struck by the number of statistical concepts that are involved in cryptography.

I will blog about cryptography once or twice a month (Tag: Ciphers). My focus will be statistical ideas in classical cryptography, from ancient Rome through World War II. Topics will include substitution ciphers, frequency analysis of letters and pairs of letters, and how to construct and solve recreational puzzles, such the popular Cryptoquote puzzle that appears in many newspapers.

As usual, each article will be accompanied by a SAS program that demonstrates a technique in statistical programming. Even though most of us will never encounter secret messages as part of our professional work, these posts will demonstrate useful techniques for writing statistical programs in SAS/IML language.

But enough talk! Let's get started by defining some basic terms:

  • A substitution cipher is a permutation of an alphabet. Each letter of the alphabet is transformed under the permutation into another (not necessarily different) letter, and the ordered set of permuted letters is the cipher. Enciphering is the process of replacing each letter in a text by its counterpart in the cipher. Deciphering is the process of applying the inverse permutation to an enciphered message to recover the original message.
  • The message prior to encryption is called the plaintext message. The alphabet in its natural order is the plaintext alphabet. Traditionally, plaintext is written in all lowercase letters.
  • The encrypted message is called the ciphertext. Traditionally, ciphertext is written in all uppercase letters.

Substitution ciphers and keys

The substitution cipher has been used to encipher messages since the time of Julius Caesar. Its simplicity kept it in use until the Renaissance, even though Arab mathematicians developed frequency analysis in the ninth century and the Europeans knew how to break substitution ciphers for long messages in the 1500s. Today the substitution cipher is used mainly to construct recreational puzzles.

A substitution cipher is determined by a key that, in conjunction with some algorithm, transforms a plaintext alphabet into a cipher alphabet. A special case of the substitution cipher is the simple Caesar cipher (or shift cipher), which has only 25 possible keys and is therefore susceptible to a brute force attack (just check all possible shifts). However, there are 26! = 4 x 1026 permutations of 26 letters, so the general substitution cipher is resistant to a naive brute force attack.

Before the modern computer, general substitution ciphers were not used often because the agents would need to remember the complete 26-element permutation, and the process of encipherment by hand was slow and subject to error. However, if you have access to a programming language that includes a random number generator, the key can be any easily remembered seed for the generator.

Here's how a programming language like SAS/IML makes it simple to encipher a message. First, choose a random number seed as the encryption key. Then create a cipher by using the SAMPLE function to sample without replacement ("WOR") from the plaintext alphabet. For example, I might choose 432013 as the secret key because I can remember that my book Simulating Data with SAS was published on April 3, 2013. The following SAS/IML program constructs the cipher:

proc iml;
key = 432013;            
call randseed(key, 1);   /* reset random number stream; initialize with key */
alphabet = "a":"z";      /* plaintext alphabet */
cipher = upcase(sample(alphabet, 26, "WOR") );  /* permutation of alphabet */
print (alphabet//cipher)[r={"alphabet" "cipher"}];
t_subcipher1

The output shows the cipher alphabet beneath the plaintext alphabet. For the key 432013, the letter 'a' is enciphered as 'D', 'b' goes to 'C', 'c' goes to 'Z', and so on. For this permutation, 'x' is enciphered as 'X', so not every letter is changed by the encipherment.

The program uses a trick that you might not have seen before. The RANDSEED subroutine in SAS/IML has an optional second argument. If the second argument has the value 1, then the random number generator is reset. By calling the RANDSEED subroutine in this way, you can ensure that the encipherment does not depend on the current state of the random number generator.

You can use the TRANSLATE function in Base SAS to replace every element in the plaintext alphabet by the associated letter in the enciphered alphabet. The ROWCAT function in SAS/IML concatenates the 26-element arrays into a single character string with 26 letters. The following statements enciphers one of my favorite quotes by Isaac Newton:

msg = {"If I have seen further than others, ",
       "it is by standing upon the ",
       "shoulders of giants. ~Isaac Newton"};
plaintext = lowcase(msg);
ciphertext = translate(plaintext, rowcat(cipher), rowcat(alphabet));
print ciphertext;
t_subcipher3

The output of the program is an enciphered message that is in the spirit of the Cryptoquote puzzle. Use this technique to construct puzzles for your puzzle-loving friends!

Modules to encipher and decipher messages

If you and your friend want to exchange messages by using a substitution cipher, you need to agree on a key and an algorithm for constructing a permutation of the alphabet. The following SAS/IML modules use the SAMPLE function in SAS/IML to construct a cipher from the key. The Encipher module applies the permutation to a message, thus encrypting it. The Decipher module applies the inverse permutation, thus decrypting the enciphered text.

start Encipher(key, msg);
   call randseed(key, 1);         /* reset and initialize random number stream */
   alphabet = "a":"z";            
   cipher = upcase( sample(alphabet, 26, "WOR") );  /* permutation of alphabet */
   ciphertext = translate(lowcase(msg), rowcat(cipher), rowcat(alphabet));
   return (ciphertext);
finish;
 
start Decipher(key, msg);
   call randseed(key, 1);         /* reset and initialize random number stream */
   alphabet = "a":"z";           
   cipher = upcase( sample(alphabet, 26, "WOR") );  /* permutation of alphabet */
   plaintext = translate(upcase(msg), rowcat(alphabet), rowcat(cipher));
   return (plaintext);
finish;
 
msg = {"If I have seen further than others, ",
       "it is by standing upon the ",
       "shoulders of giants. ~Isaac Newton"};
cryptogram = Encipher(key, msg);
soln = Decipher(key, cryptogram);
print soln;
t_subcipher2

Now you and your friends can exchange enciphered messages such as "Want to meet for lunch today at eleven forty-five?" Try it out and let me know what you think. But remember not to say nasty things about your boss: the substitution cipher is not a secure means of communication, as I will demonstrate in a future blog post.

Post a Comment

How to create a hexagonal bin plot in SAS

While I was working on my recent blog post about two-dimensional binning, a colleague asked whether I would be discussing "the new hexagonal binning method that was added to the SURVEYREG procedure in SAS/STAT 13.2." I was intrigued: I was not aware that hexagonal binning had been added to a SAS procedure!

Hexagonal binning is an alternate to the usual rectangular binning, which I blogged about last week. It has been promoted in the statistical community by Daniel Carr (JASA, 1987), who recommends it for visualizing the density of bivariate data in a scatter plot. Some research suggests that using hexagonal bins results in less visual bias than rectangular bins, especially for clouds of points that are roughly elliptical. A hexagonal bin plot is created by covering the data range with a regular array of hexagons and coloring each hexagon according to the number of observations it covers. As with all bin plots, the hex-binned plots are good for visualizing large data sets for which a scatter plot would suffer from overplotting. The bin counts estimate the density of the observations. (In SAS, you can also use PROC KDE to compute a kernel density estimate.)

When I looked at the documentation for the SURVEYREG procedure in SAS/STAT 13.2, I found the feature that my friend mentioned. The PLOTS= option on the PROC SURVEYREG statement supports creating a plot that overlays a regression line on a hex-binned heat map of two-dimensional data. Although there is no option in PROC SURVEYREG to remove the regression line, you can still use the procedure to output the counts in each hexagonal bin.

Creating a heat map of counts for hexagonal bins

You can obtain the counts and the vertices of the hexagonal bins by using a trick that I blogged about a few years ago: Use ODS to create a SAS data set that contains the data underlying the graph. You can then use the POLYGON statement in PROC SGPLOT to create a hex-binned plot of the counts.

Because I want to create several hexagonal heat maps, I will wrap these two steps into a single macro call:

/* Use hexagonal binning to estimate the bivariate density.
   Requires SAS/STAT 13.2 (SAS 9.4m2) */
%macro HexBin(dsName, xName, yName, nBins=20, colorramp=TwoColorRamp);
  ods select none;
  ods output fitplot=_HexMap;  /* write graph data to a data set */
  proc surveyreg data=&dsname plots(nbins=&nBins weight=heatmap)=fit(shape=hex);
     model &yName = &xName;
  run;
  ods select all;
 
  proc sgplot data=_HexMap;
     polygon x=XVar y=YVar ID=hID / colorresponse=WVar fill 
                                    colormodel=&colorramp;
  run;
%mend;

The %HexBin macro calls PROC SURVEYREG to create the heat map with hexagonal binning. However, the graph is not displayed on the screen, but is redirected to a data set. The data set is in a format that can be directly rendered by using the POLYGON statement in PROC SGPLOT.

To run the macro, you have to specify a data set name, the name of an X variable, and the name of a Y variable. By default, the data are binned into approximately 20 bins in both directions. You can control the number of bins by using the NBINS= option. Unlike for rectangular bins, you don't get 400 hexagons when you specify NBINS=20. Instead, the procedure computes the size of the hexagons so that they "have approximately the same area as the same number of rectangular bins would have." However, as you would expect, a large value for the NBINS= option results in many small bins, whereas a small value results in a few large bins.

By default, a two-color color ramp is used to visualize the counts, but you can specify other color ramps by using the COLORRAMP= option. Inside the macro, I chose not to use the OUTLINE option on the POLYGON statement. If you prefer to see outlines for the hexagons, feel free to modify the macro (or just tack it to the end of the COLORRAMP= option).

The following statement calls the %HexBin macro to create a hexagonal bin plot of the birth weight of 50,000 babies versus the relative weight gain of their mothers for data in the Sashelp.bweight data set. My previous blog post shows a heat map of the same data by using rectangular bins.

title "Birth Weight vs. Mother's Weight gain";
%HexBin(sashelp.bweight, MomWtGain, Weight);
hexbin1

The hex-binned heat map reminds me of the effect of using transparency to overcome overplotting. You see a faint light-blue haze where the density of points is low. Dark colors indicate regions for which the density of points is high (more than 3,000 points per bin, for this example).

A fun set of data to visualize is the sashelp.Zipcode data because if you plot the longitude and latitude variables, the hexagons create a pixelated version of the US! Bins with high counts indicate regions for which there are many zip codes packed closely together, such as major metropolitan areas. Rather than use the default color ramp, the following example uses the PALETTE function in SAS/IML software to create a yellow-orange-red color ramp. Those colors are pasted into the COLORRAMP= option; be sure to use parentheses if you specify your own colors for a color ramp. I thought it would be appropriate to use hex values to specify the colors for the hexagons, but you can also define a color ramp by using color names such as colorramp=(lightblue white lightred red).

proc iml;
   c = palette("YLORRD", 5); print c;
quit;
 
data zips;
set sashelp.Zipcode(where=(State<=56 & State^=2 & State^=15 & X<0));
run;
 
title "Density of US Zip Codes";
%HexBin(zips, X, Y, nBins=51,
        colorramp=(CXFFFFB2 CXFECC5C CXFD8D3C CXF03B20 CXBD0026));
hexbin2

The heat map of hex-binned counts shows "hot spots" of density along the East Coast (Washington/Baltimore, Philadelphia, New York), along the West Coast (Los Angeles, the Bay Area, Seattle), and at various other major cities (Denver, Dallas, Chicago, ...). Chicago is also visible near (X,Y)=(-88, 42), although at this scale of binning Lake Michigan is not visible as a collection of empty hexagons, which makes it hard to locate midwest cities.

This hex-binned heat map can be a useful alternative to a scatter plot when the scatter plot suffers from overplotting or when you want to estimate the density of the observations. It requires SAS/STAT 13.2. Between hexagonal binning, the rectangular bin plot from my previous blog post, and PROC KDE, there are now multiple ways in SAS to visualize the density of bivariate data.

Post a Comment

Counting observations in two-dimensional bins

Last Monday I discussed how to choose the bin width and location for a histogram in SAS. The height of each histogram bar shows the number of observations in each bin. Although my recent article didn't mention it, you can also use the IML procedure to count the number of observations in each bin. The BIN function associates a bin to each observation, and the TABULATE subroutine computes the count for each bin. I have previously blogged about how to use this technique to count the observations in bins of equal or unequal width.

bin2d2

You can also count the number of observations in two-dimensional bins. Specifically, you can divide the X direction into kx bins, divide the Y direction into ky bins, and count the number of observations in each of the kx x ky rectangles. In a previous article, I described how to bin 2-D data by using the SAS/IML language and produced the scatter plot at the left. The graph displays the birth weight of 50,000 babies versus the relative weight gain of their mothers for data in the Sashelp.bweight data set.

I recently remembered that you can also count observations in 2-D bins by using the KDE procedure. The BIVAR statement in the KDE procedure supports an OUT= option, which writes a SAS data set that contains the counts in each bin. You can specify the number of bins in each direction by using the NGRID= option on the BIVAR statement, as shown by the following statements:

ods select none;
proc kde data=sashelp.bweight;
   bivar MomWtGain(ngrid=11) Weight(ngrid=7) / out=KDEOut;
run;
ods select all;
 
proc print data=KDEOut(obs=10 drop=density);
run;
hist2bin1

As shown in the output, the the VALUE1 and VALUE2 variables contain the coordinates of the center of each bin. The COUNT variable contains the bin count. You can read the counts into SAS/IML and print it out as a matrix. In the data set, the Y variable changes the fastest as you go down the rows. Consequently, you need to use the SHAPECOL function rather than the SHAPE function to form the 7 x 11 matrix of counts:

proc iml;
kX = 11; kY = 7;
use KDEOut; read all var {Value1 Value2 Count}; close;
M = shapecol(Count, kY, kX);       /* reshape into kY x kX matrix */
/* create labels for cells in X and Y directions */
idx = do(1,nrow(Value1),kY);        /* location of X labels */
labelX = putn(Value1[idx], "5.1");
labelY = putn(Value2[1:kY], "5.0"); 
print M[c=labelX r=labelY];
hist2bin2

As I explained in Monday's blog post, there are two standard ways to choose bins for a histogram: the "midpoints" method and the "endpoints" method. If you compare the bin counts from PROC KDE to the bin counts from my SAS/IML program, you will see small differences. That is because the KDE procedure uses the "midpoints" algorithm for subdividing the data range, whereas I used the "endpoints" algorithm for my SAS/IML program.

Last week I showed how to use heat maps to visualize matrices in SAS/IML. This matrix of counts is begging to be visualized with a heat map. Because the counts vary across several orders of magnitude (from 100 to more than 104), a linear color ramp will not be an effective way to visualize the raw counts. Instead, transform the counts to a log scale and create a heat map of the log-counts:

call HeatmapCont(log10(1+M)) 
                 xvalues=labelX yvalues=labelY
                 colorramp="ThreeColor" legendtitle="Log(Count)"
                 title="Counts in Each Bin";
hist2bin3

The heat map shows the log-count of each bin. If you prefer a light-to-dark heatmap for the density, use the "TwoColor" value for the COLORRAMP= option. The heat map is a great way to see the distribution of counts at a glance. It enables you to see the approximate values of most cells, and you can easily determine cells where there are many or few observations. Of course, if you want to know the exact value of the count in each rectangular cell, look at the tabular output.

Post a Comment

Choosing bins for histograms in SAS

When you create a histogram with statistical software, the software uses the data (including the sample size) to automatically choose the width and location of the histogram bins. The resulting histogram is an attempt to balance statistical considerations, such as estimating the underlying density, and "human considerations," such as choosing "round numbers" for the location and width of bins. Common "round" bin widths include 1, 2, 2.5, and 5, as well as these numbers multiplied by a power of 10.

The default bin width and locations tend to work well for 95% of the data that I plot, but sometimes I decide to override the default choices. This article describes how to set the width and location of bins in histograms that are created by the UNIVARIATE and SGPLOT procedures in SAS.

Why override the default bin locations?

The most common reason to override the default bin locations is because the data have special properties. For example, sometimes the data are measured in units for which the common "round numbers" are not optimal:

  • For a histogram of time measured in minutes, a bin width of 60 is a better choice than a width of 50. Bin widths of 15 and 30 are also useful.
  • For a histogram of time measured in hours, 6, 12, and 24 are good bin widths.
  • For days, a bin width of 7 is a good choice.
  • For a histogram of age (or other values that are rounded to integers), the bins should align with integers.

You might also want to override the default bin locations when you know that the data come from a bounded distribution. If you are plotting a positive quantity, you might want to force the histogram to use 0 as the leftmost endpoint. If you are plotting percentages, you might want to force the histogram to choose 100 as the rightmost endpoint.

To illustrate these situations, let's manufacture some data with special properties. The following DATA step creates two variables. The T variable represents time measured in minutes. The program generates times that are normally distributed with a mean of 120 minutes, then rounds these times to the nearest five-minute mark. The U variable represents a proportion between 0 and 1; it is uniformly distributed and rounded to two decimal places.

data Hist(drop=i);
label T = "Time (minutes)" U = "Uniform";
call streaminit(1);
do i = 1 to 100;
   T = rand("Normal", 120, 30); /* normal with mean 120  */
   T = round(T, 5);             /* round to nearest five-minute mark */
   U = rand("Uniform");         /* uniform on (0,1) */
   U = floor(100*U) / 100;      /* round down to nearest 0.01 */
   output;
end;
run;

How do we control the location of histogram bins in SAS? Read on!

Custom bins with PROC UNIVARIATE: An example of a time variable

I create histograms with PROC UNIVARIATE when I am interested in also computing descriptive statistics such as means and quantiles, or when I want to fit a parametric distribution to the data. The following statements create the default histogram for the time variable, T:

title "Time Data (N=100)";
ods select histogram(PERSIST);  /* show ONLY the histogram until further notice */
proc univariate data=Hist;
histogram T / odstitle=title odstitle2="Default Bins";
run;
histbin1

The default bin width is 20 minutes, which is not horrible, but not as convenient as 15 or 30 minutes. The first bin is centered at 70 minutes; a better choice would be 60 minutes.

The HISTOGRAM statement in PROC UNIVARIATE supports two options for specifying the locations of bins. The ENDPOINTS= option specifies the endpoints of the bins; the MIDPOINTS= option specifies the midpoints of the bins. The following statements use these options to create two customize histograms for which the bin widths are 30 minutes:

proc univariate data=Hist;
histogram T / midpoints=(60 to 210 by 30)  odstitle=title odstitle2="Midpoints";
run;
 
proc univariate data=Hist;
histogram T / endpoints=(60 to 210 by 30)  odstitle=title odstitle2="Endpoints";
run;
histbin2

The histogram on the left has bins that are centered at 30-minute intervals. This histogram makes it easy to estimate that about 40 observations are approximately 120 minutes. The counts for other half-hour increments are similarly easy to estimate. In contrast, the histogram on the right has bins whose endpoints are 60, 90, 120,... minutes. With this histogram, it easy to see that about 35 observations have times that are between 90 and 120 minutes. Similarly, you can estimate the number of observations that are greater than three hours or less than 90 minutes.

Both histograms are equally correct. The one you choose should depend on the questions that you want to ask about the data. Use midpoints if you want to know how many observations have a value; use endpoints if you want to know how many observations are between two values.

If you run the SAS statements that create the histogram on the right, you will see the warning message
WARNING: The ENDPOINTS= list was extended to accommodate the data.
This message informs you that you specified the last endpoint as 210, but that additional bins were created to display all of the data.

Custom bins for a bounded variable

As mentioned earlier, if you know that values are constrained within some interval, you might want to choose histogram bins that incorporate that knowledge. The U variable has values that are in the interval [0,1), but of course PROC UNIVARIATE does not know that. The following statement create a histogram of the U variable with the default bin locations:

title "Bounded Data (N=100)";
proc univariate data=Hist;
histogram U / odstitle=title odstitle2="Default Bins";
run;
histbin3

The default histogram shows seven bins with a bin width of 0.15. From a statistical point of view, this is an adequate histogram. The histogram indicates that the data are uniformly distributed and, although it is not obvious, the left endpoint of the first bin is at 0. However, from a "human readable" perspective, this histogram can be improved. The following statements use the MIDPOINTS= and ENDPOINTS= options to create histograms that have bin widths of 0.2 units:

proc univariate data=Hist;
histogram U / midpoints=(0 to 1 by 0.2)  odstitle=title odstitle2="Midpoints";
run;
 
proc univariate data=Hist;
histogram U / endpoints=(0 to 1 by 0.2)  odstitle=title odstitle2="Endpoints";
run;
ods select all;  /* turn off PERSITS; restore normal output */
histbin4

The histogram on the left is not optimal for these data. Because we created uniformly distributed data in [0,1], we know that the expected count in the leftmost bin (which is centered at 0) is half the expected count of an inner bin. Similarly, the expected count in the rightmost bin (which is centered at 1) is half the count of an inner bins because no value can exceed 1. Consequently, this choice of midpoints is not very good. For these data, the histogram on the right is better at revealing that the data are uniformly distributed and are within the interval [0,1).

Custom bins with PROC SGPLOT

If you do not need the statistical power of the UNIVARIATE procedure, you might choose to create histograms with PROC SGPLOT. The SGPLOT procedure supports the BINWIDTH= and BINSTART= options on the HISTOGRAM statement. The BINWIDTH= option specifies the width for the bins. The BINSTART= option specifies the center of the first bin.

I recommend that you specify both the BINWIDTH= and BINSTART= options, and that you choose the bin width first. Be aware that not all specifications result a valid histogram. If you make a mistake when specifying the bins, you might get the following error
WARNING: The specified BINWIDTH= value will be ignored in order to accommodate the data.
That message usually means that the minimum value of the data was not contained in a bin. For a bin width of h, the BINSTART= value must be less than xmin + h/2, where xmin is the minimum value of the data.

By default, the axis does not show a tick mark for every bin, but you can force that behavior by using the SHOWBINS option. The following statements call the SGPLOT procedure to create histograms for the time-like variable, T. The results are again similar to the custom histograms that are shown in the previous section:

title "Time Data (N=100)";
title2 "Midpoints";
proc sgplot data=Hist;
histogram T / binwidth=15 binstart=60 showbins; /* center first bin at 60 */
run;
 
title2 "Endpoints";
proc sgplot data=Hist;
histogram T / binwidth=15 binstart=52.5;  /* 52.5 = 45 + h/2 */
xaxis values=(45 to 180 by 15);           /* alternative way to place ticks */
run;

The following statements call the SGPLOT procedure to create histograms for the bounded variable, U. The results are similar to those created by the UNIVARIATE procedure:

title "Bounded Data (N=100)";
title2 "Midpoints";
proc sgplot data=Hist;
histogram U / binwidth=0.2 binstart=0 showbins;  /* center first bin at 0 */
run;
 
title2 "Endpoints";
proc sgplot data=Hist;
histogram U / binwidth=0.2 binstart=0.1;      /* center first bin at h/2 */
xaxis values=(0 to 1 by 0.2);       
run;

In summary, for most data the default bin width and location result in a histogram that is both statistically useful and easy to read. However, the default choices can lead to a less-than-optimal visualization if the data have special properties, such as being time intervals or being bounded. In those cases, it makes sense to choose a bin width and a location of the first bin such that reveals your data's special properties. For the UNIVARIATE procedure, use the MIDPOINTS= or ENDPOINTS= options on the HISTOGRAM statement. For the SGPLOT procedure, use the BINWIDTH= and BINSTART= options to create a histogram with custom bins.

Post a Comment