Read from one data set and write to another with SAS/IML

Many people know that the SAS/IML language enables you to read data from and write results to multiple SAS data sets. When you open a new data set, it is a good programming practice to close the previous data set. But did you know that you can have two data sets open simultaneously, one for reading and one for writing? The trick is to use the SETIN and SETOUT statements to specify the data sets that are used for reading and writing. Among other advantages, this technique enables you to use SAS/IML to analyze data that might be larger than can fit into the RAM on your computer.

This article is motivated by a question posted to the SAS/IML Support Community. The SAS/IML programmer was essentially running a BY-group analysis in SAS/IML. The user wanted to use the WHERE clause on the READ statement to read the data for each BY-group.

To illustrate the technique, consider the following BY-group regression analysis of the vehicles in the Sashelp.Cars data set. For each value of the Make variable (Acura, Audi, BMW,...), PROC REG performs an ordinary least-squares (OLS) analysis that uses the Wheelbase and Length variables to model the Weight of a vehicle. The parameter estimates are saved to the OutEst data set:

proc reg data=sashelp.cars noprint outest=OutEst;
   by Make;
   model Weight = Wheelbase Length;
run;

Let's create a similar analysis in the SAS/IML language. Although SAS/IML is often used to perform analyses that are not available from any SAS procedure, this simple and familiar analysis will enable us to concentrate on how to read and write data sets simultaneously. The following SAS/IML statements define a module that computes the parameter estimates for an OLS model. Because some of the BY groups result in singular linear systems, the GINV function is used to solve for the parameter estimates, rather than the more efficient SOLVE function.

proc iml;
start OLSParamEst(Y, _X);
   X = j(nrow(_X),1,1) || _X;   /* add intercept column to design matrix */
   XPX = X` * X;                /* cross-products      */
   /* handle singular systems by using GINV rather than SOLVE */
   Estimate = ginv(XPX) * X`*Y; /* parameter estimates */
   return( T(Estimate) );       /* return as row vector */
finish;

The BY-group analysis will be carried out one category at a time. The following statements open the input data set and read the BY-group categories. The output data set (which will contain the parameter estimates) is also opened for writing. Notice that you need to specify the names and types of the output variables on the CREATE statement, including the length of any character variables. Lastly, the SETIN and SETOUT statements are used to specify that the data sets that will be used for subsequent reading and writing operations:

use Sashelp.cars;                   /* input data */
read all var {"Make"};              /* read BY var */
byGroup = unique( Make );           /* compute unique levels */
 
ByVal = BlankStr(nleng(Make));      /* set length of ByVal variable (12.3) */
OutVarNames = {"Intercept" "Wheelbase" "Length"}; 
Estimate = {. . .};                 /* output variables are numeric */
create RegOut from Estimate[r=ByVal c=OutVarNames];
 
setin Sashelp.Cars;                 /* make current for reading */
setout RegOut;                      /* make current for writing */

The SETIN statement means "every time you want to read, use the Sashelp.Cars data set." The SETOUT statement means "every time you want to write, use the RegOut data set." The following statements loop over the unique categories of the Make variable, and perform an OLS analysis for the vehicles that have a common make. You can use the INDEX statement to speed up reading the BY groups.

InVarNames = {"Weight" "Wheelbase" "Length"};    /* input variables */
index Make;                           /* index by BY group variable */
do i = 1 to ncol(byGroup);
   ByVal = byGroup[i];
   read all var InVarNames where(Make=(ByVal));  /* read from Cars */ 
   Estimate = OLSParamEst(Weight, Wheelbase||Length);
   append from Estimate[r=ByVal];                /* write to RegOut */
end;
 
close Sashelp.Cars RegOut;

The program performs 38 parameter estimates, one for each unique value of the Make variable. Because these data will fit entirely into RAM, you could also perform this analysis by using the UNIQUE-LOC or UNIQUEBY techniques.

In summary, use the SETIN and SETOUT statements when you want to have two SAS data sets open simultaneously, one for reading and one for writing. By the way, you can also use the SETOUT statement to write to several data sets, such as writing predicted and residual values to one while writing parameter estimates to another.

Post a Comment

Handling run-time errors in user-defined modules

I received the following email from a SAS/IML programmer:

I am getting an error in a PROC IML module that I wrote. The SAS Log says
NOTE: Paused in module NAME
When I submit other commands, PROC IML doesn't seem to understand them. How can I continue the program? The only thing that works is to QUIT and resubmit the entire PROC IML program.

I'm glad you asked this question. Handling run-time errors that occur in user-defined modules can be frustrating unless you understand what the "Paused in module" message means. If you haven't already read my article on scope, local, and global variables in SAS/IML, you might want to read it now. It explains about local scope (inside a module) versus main scope (outside of any module). Also, familiarize yourself with the difference between a syntax error and a run-time error.

A run-time error in a module causes PROC IML to pause

Remember that the "I" in "IML" stands for interactive. When there is a run-time error in a user-defined module, PROC IML calls the PAUSE statement. This causes the program execution to pause while still inside the module. While the program is paused, you have the opportunity to debug the module, primarily by using PRINT statements to display the values of the local variables in the module. When you "submit other commands," those statements are being executed from within the module, rather than at main scope, which is probably why you say that PROC IML "doesn't seem to understand" the commands.

When a run-time error occurs in a user-defined module in PROC IML, which includes modules in the IMLMLIB library, you have three options:

  • Fix the error and resubmit the entire PROC IML program.
  • Debug the problem and try to fix it from inside the module. For example, suppose the error occurs while trying to compute some quantity, Z, at Line 2 in the module. While the module is paused, you can assign Z a value and submit the RESUME statement. The program will continue executing from the next statement in the module. That is, from Line 3.
  • Submit the STOP statement to leave all modules and return to the main scope of the program. This is especially useful if the cause of the error was a bad parameter value. You can immediately call the module again with different parameter values.

The first option is easy, but if the program has just finished a long-running computation, you might want to try to salvage those computations, rather than restart the program.

Debug, fix, and RESUME

The following statements define a function that evaluates the quantity sqrt(x – 1). If you call the function with a value of x that is less than 1, a run-time error will occur. Let's intentionally cause an error so that the program pauses in the SQRTM1 module:

proc iml;
start Sqrtm1(x);
   y = x - 1;    /* Line 1: y is defined inside the module */
   z = sqrt(y);  /* Line 2 */
   s = 17;       /* Line 3 */
   return( z );  /* Line 4 */
finish;
 
A = 10;         /* A is defined outside the module */
B = Sqrtm1(0);  /* create run-time error on Line 2 of module */

The SAS Log shows that an error occurred on Line 2 of the module. The Log also says NOTE: Paused in module SQRTM1. The module has executed Line 2 but has not executed Line 3, so you can submit statements that refer to the local variables x and y, which were define before the error occurred. The statements cannot refer to the local variable s, because s has not yet been assigned. The statements also cannot refer to the variable A, because that variable exists only at main scope; the symbol is not known from within the module.

You can convince yourself that the program is paused within the module by submitting the following PRINT statements:

/* the program is paused inside the module */
print x y;   /* you can print local vars inside the module */
print A;     /* ERROR: this var is not defined inside the module */

If you want, you can assign a value to z and then resume execution of the program. The RESUME statement causes the program to resume execution beginning with Line 3 of the module. If the rest of the program executes without error, the SAS/IML environment returns to main scope, as shown by the following statements:

z = .;        /* within module: assign local variable z */
resume;       /* resume execution of program at Line 3  */
print A B;    /* now at main scope: A and B are defined */

STOP the program and return to main scope

Often it is impractical to "fix" the error from within a module. For example, you might have called a module that called a second module that called a third module, and the error occurred in the third module. Sometimes the wisest course of action is to get out of all the modules and return to the main calling environment. You can use the STOP statement to "pop" the stack of modules and return to main scope. The STOP statement is like a super-duper RETURN statement: it not only returns from the module with the error, but it returns from the complete chain of modules that were being executed. The STOP statement does not exit PROC IML, so use STOP when you want to retain the results from a prior long-running computation.

For example, the following statements once again create a run-time error in the SQRTM1 module. The module execution pauses after the error. If you execute the STOP statement (which supports an optional message statement in SAS/IML 13.1), you will return to the main scope of the program.

B = Sqrtm1(0);                  /* create run-time error on Line 2 */
stop "Return from the module";  /* SAS/IML 13.1 supports optional message */
print A;                        /* now at main scope: A is defined */

Debugging in SAS/IML Studio

Errors in the SAS/IML Studio application are handled differently. SAS/IML Studio contains point-and-click features for finding and fixing run-time errors. For an example, see the section "Run-Time Error" in the article "How to find and fix programming errors."

Summary

This article describes how to deal with errors in user-defined modules in PROC IML. When a run-time error occurs in a module, the program pauses execution, but does not exit PROC IML. In fact, the program is paused inside the module that experienced the error. You have three choices to handle the error: you can use the QUIT statement to exit PROC IML, you can debug and fix the problem and then use the RESUME statement to continue the program, or you can use the STOP statement to pop out of all modules and return to the main scope of the program.

Post a Comment

An exploratory technique for visualizing the distributions of 100 variables

In a previous blog post I showed how to order a set of variables by a statistic. After reshaping data, you can create a graph that contains box plots for many variables. Ordering the variables by some statistic (mean, median, variance,...) helps to differentiate and distinguish the variables.

You can use this as an exploratory technique. Suppose that you are given a new set of data that has 100 variables. Before you begin to analyze these variables, it is useful to visualize their distributions. Are the data normal or skewed? Are there outliers for some variables? You could use PROC UNIVARIATE to create 100 histograms and 100 sets of descriptive statistics, but it would be tedious to slog through hundreds of pages of outputs, and difficult to use the histograms to compare the variables.

In contrast, you can fit 100 box plots on a single graph. The resulting graph enables you to see all 100 distributions at a glance and to compare the distributions to each other. If the variables are measured on vastly different scales, you might want to standardize the variables so that the box plots are comparable in scale.

Creating 100 variables

As an example, I will simulate 100 variables with 1,000 observations in each variable. The following SAS/IML statements loop over each column of a matrix X. The SAMPLE function randomly chooses a distribution (normal, lognormal, exponential, or uniform) for the data in the column. I use the RANDFUN function to generate the data for each column. The 100 variables are then written to a SAS data set and given the names X1–X100.

/* create 100 variables with random data from a set of distributions */
proc iml;
N = 1000;  p = 100;
distrib = {"Normal" "Lognormal" "Exponential" "Uniform"};
call randseed(1);
 
/* each column of X is from a random distribution */
X = j(N,p);
do i = 1 to p;
   X[,i] = randfun(N, sample(distrib,1));
end;
 
varNames = "x1":("x"+strip(char(p)));
create ManyVars from X[colname=varNames]; append from X; close;
quit;

Visualizing the 100 distributions

Suppose that you are asked to visualize the distributions in the ManyVars data set. Previously I showed how to use Base SAS to standardize the variables, reshape the data, and create the box plots. The following statement show how to implement the technique by using SAS/IML software. For no particular reason, I order these variables by the third quartile, rather than by the median:

proc iml;
use ManyVars;
read all var _NUM_ into X[colname=varNames];
close ManyVars;
N = nrow(X);
 
MinX = X[><, ]; MaxX = X[<>, ];
X = (X-MinX)/ (MaxX-MinX);       /* standardize vars into [0,1] */
call qntl(q3, X, 0.75);          /* compute 3rd quartile */
 
r = rank(q3);                    /* sort variables by that statistic */
temp = X;        X[,r] = temp;
temp = varNames; varNames[,r] = temp;
 
_Value_ = shapecol(X,0,1);      /* convert from wide form to long form */
varName = shapecol(repeat(varNames, N),0,1);
 
title "Variables ordered by Q3";          /* plot the distributions */
ods graphics / width =2400 height=600;
call box(_Value_) category=varName 
     other="xaxis discreteorder=data display=(nolabel);";
viz100

The result is shown above; click to enlarge. You can see the four different distributional shapes, although you have to look closely to differentiate the lognormal variables (which end with variable X83) and the exponential variables (which begin with variable X41). The changes in shape between the normal data (X69) and the uniform data (X99) are apparent. Also, notice that the skewed distributions have mean larger than the median, whereas for the symmetric distributions the sample means and the medians are approximately equal. The graph provides a quick-and-dirty view of the data, which is always useful when you encounter new data.

I standardized the variables into the interval [0,1], but other standardizations are possible. For example, to standardize the variable to mean zero and unit variance, use the statement X = (X-mean(X))/ std(X).

The SAS/IML program is remarkable for its compactness, especially when compared to the Base SAS implementation. Three statements are used to standardize the data, but I could have inlined the computations into a single statement. One statement is used to compute the quantiles. Reordering the variables is accomplished by using "the rank trick". The SHAPECOL function makes it easy to convert the data from wide form to long form. Finally, the BOX call makes it easy to create the box plot without even leaving the SAS/IML environment. These statements could be easily encapsulated into a single SAS/IML module that could be reused to visualize other data sets.

I think the visualization technique is powerful, although it is certainly not a panacea. When confronted with new data, this technique can help you to quickly learn about the distribution of the variables.

What do you think? Would you use this visualization as one step in an exploratory analysis? What are some of the strengths and weaknesses of this technique? Leave a comment.

Post a Comment

Order variables by values of a statistic

ordervars1

When I create a graph of data that contains a categorical variable, I rarely want to display the categories in alphabetical order. For example, the box plot to the left is a plot of 10 standardized variables where the variables are ordered by their median value. The ordering makes it apparent that the variables on the left (Invoice, MSRP) are highly skewed with many outliers, whereas the rightmost variable (length) appears to be symmetric and perhaps normally distributed.

When you use the SG procedures in SAS, there are several ways to change the order in which categories appear in a graph. One extremely powerful method is to use the DISCRETEORDER=DATA option on the XAXIS statement of the SGPLOT procedure. The DISCRETEORDER=DATA option tells the SGPLOT procedure to display the categories in the order in which they appear in the data set. In other words, if you can figure out some way to order the observations (DATA step, PROC SORT, PROC SQL,....), then you will see that same order when you create a graph of the data.

To illustrate this idea, let's create the box plot at the top of this article. The data are standardized versions of the numerical variables in the Sashelp.Cars data set. The statistic of interest is the median. The following statements create the data and compute the medians:

%let DSName = sashelp.cars;
%let Stat = Median; /* or mean, stddev, qrange, skew, etc */
/* standardize variables to [0,1] */
proc stdize method=range data=&DSName out=MyData; run;
 
proc means data=MyData &Stat STACKODSOUTPUT; 
ods output Summary=StatOut;
run;
t_ordervar1

The ODS OUTPUT statement is used to create a data set with the median statistic for each variable. The STACKODSOUTPUT option is used so that the output data set has the same structure as the printed table.

The next step is to get the list of variables ordered by the median statistic. My favorite method is to use PROC SQL's useful SELECT INTO :MACROVAR syntax. I thanks my colleague Gordon for reminding me I do not need to use PROC SORT to sort the variables, I can use the ORDER BY clause in PROC SQL instead.

/* put ordered variable names into macro variable */
proc sql noprint;                              
 select Variable into :varlist separated by ' '
 from StatOut order by &Stat;
quit;

The macro variable &VARLIST now contains a list of the variables, ordered by their medians. In order to create a single graph that has multiple box plots, transpose the data from wide form (many varables) to long form (two variables that specify the variable name and value). The following statements add an "observation number" variable to the data and use the RETAIN statement to reorder the variables. Then PROC TRANSPOSE converts the data set from wide form to long form. In the Long data set, the variable names are ordered by the value of the statistic. (Alternatively, you could merge the statistics and the original data and then sort the merged data by the statistic and the observation number.)

data Wide / view=Wide;     
retain &varlist;           /* reorder vars by statistic */
set MyData;
obsNum = _N_;              /* add ID for subject (observation) */
keep obsNum &varlist;
run;
 
/* transpose from wide to long data format; VARNAME is a categorical var */
proc transpose data=Wide name=VarName 
               out=Long(rename=(Col1=_Value_) drop=_LABEL_);
by obsNum;
run;
t_ordervar2

Data in this form is ideal for graphing by using SGPLOT statements that support a GROUP= or CATEGORY= option. To create the graph at the beginning of this article, use the VBOX statement with the CATEGORY= option. If you also use the DISCRETEORDER=DATA option on the XAXIS statement, the box plots are ordered by their median values, as shown in the following:

title "Box Plots of Standardized Variables, Ordered by &Stat";
proc sgplot data=Long;
   label _Value_ = "Standardized Value" VarName="Variable";
   vbox _Value_ / category=VarName;
   xaxis discreteorder=data display=(nolabel);        /* respect ordering */
run;

Obviously, you can use this same technique to create box plots that are ordered by other statistics.

What technique do you use to order groups or categories in a graph? Can you think of applications for which this technique would be useful? Leave a comment.

Post a Comment

Ciphers, keys, and cryptoquotes

Today is my fourth blog-iversary: the anniversary of my first blog post in 2010. To celebrate, I am going to write a series of fun posts based on The Code Book by Simon Singh, a fascinating account of the history of cryptography from ancient times until the present. While reading the book, I was struck by the number of statistical concepts that are involved in cryptography.

I will blog about cryptography once or twice a month (Tag: Ciphers). My focus will be statistical ideas in classical cryptography, from ancient Rome through World War II. Topics will include substitution ciphers, frequency analysis of letters and pairs of letters, and how to construct and solve recreational puzzles, such the popular Cryptoquote puzzle that appears in many newspapers.

As usual, each article will be accompanied by a SAS program that demonstrates a technique in statistical programming. Even though most of us will never encounter secret messages as part of our professional work, these posts will demonstrate useful techniques for writing statistical programs in SAS/IML language.

But enough talk! Let's get started by defining some basic terms:

  • A substitution cipher is a permutation of an alphabet. Each letter of the alphabet is transformed under the permutation into another (not necessarily different) letter, and the ordered set of permuted letters is the cipher. Enciphering is the process of replacing each letter in a text by its counterpart in the cipher. Deciphering is the process of applying the inverse permutation to an enciphered message to recover the original message.
  • The message prior to encryption is called the plaintext message. The alphabet in its natural order is the plaintext alphabet. Traditionally, plaintext is written in all lowercase letters.
  • The encrypted message is called the ciphertext. Traditionally, ciphertext is written in all uppercase letters.

Substitution ciphers and keys

The substitution cipher has been used to encipher messages since the time of Julius Caesar. Its simplicity kept it in use until the Renaissance, even though Arab mathematicians developed frequency analysis in the ninth century and the Europeans knew how to break substitution ciphers for long messages in the 1500s. Today the substitution cipher is used mainly to construct recreational puzzles.

A substitution cipher is determined by a key that, in conjunction with some algorithm, transforms a plaintext alphabet into a cipher alphabet. A special case of the substitution cipher is the simple Caesar cipher (or shift cipher), which has only 25 possible keys and is therefore susceptible to a brute force attack (just check all possible shifts). However, there are 26! = 4 x 1026 permutations of 26 letters, so the general substitution cipher is resistant to a naive brute force attack.

Before the modern computer, general substitution ciphers were not used often because the agents would need to remember the complete 26-element permutation, and the process of encipherment by hand was slow and subject to error. However, if you have access to a programming language that includes a random number generator, the key can be any easily remembered seed for the generator.

Here's how a programming language like SAS/IML makes it simple to encipher a message. First, choose a random number seed as the encryption key. Then create a cipher by using the SAMPLE function to sample without replacement ("WOR") from the plaintext alphabet. For example, I might choose 432013 as the secret key because I can remember that my book Simulating Data with SAS was published on April 3, 2013. The following SAS/IML program constructs the cipher:

proc iml;
key = 432013;            
call randseed(key, 1);   /* reset random number stream; initialize with key */
alphabet = "a":"z";      /* plaintext alphabet */
cipher = upcase(sample(alphabet, 26, "WOR") );  /* permutation of alphabet */
print (alphabet//cipher)[r={"alphabet" "cipher"}];
t_subcipher1

The output shows the cipher alphabet beneath the plaintext alphabet. For the key 432013, the letter 'a' is enciphered as 'D', 'b' goes to 'C', 'c' goes to 'Z', and so on. For this permutation, 'x' is enciphered as 'X', so not every letter is changed by the encipherment.

The program uses a trick that you might not have seen before. The RANDSEED subroutine in SAS/IML has an optional second argument. If the second argument has the value 1, then the random number generator is reset. By calling the RANDSEED subroutine in this way, you can ensure that the encipherment does not depend on the current state of the random number generator.

You can use the TRANSLATE function in Base SAS to replace every element in the plaintext alphabet by the associated letter in the enciphered alphabet. The ROWCAT function in SAS/IML concatenates the 26-element arrays into a single character string with 26 letters. The following statements enciphers one of my favorite quotes by Isaac Newton:

msg = {"If I have seen further than others, ",
       "it is by standing upon the ",
       "shoulders of giants. ~Isaac Newton"};
plaintext = lowcase(msg);
ciphertext = translate(plaintext, rowcat(cipher), rowcat(alphabet));
print ciphertext;
t_subcipher3

The output of the program is an enciphered message that is in the spirit of the Cryptoquote puzzle. Use this technique to construct puzzles for your puzzle-loving friends!

Modules to encipher and decipher messages

If you and your friend want to exchange messages by using a substitution cipher, you need to agree on a key and an algorithm for constructing a permutation of the alphabet. The following SAS/IML modules use the SAMPLE function in SAS/IML to construct a cipher from the key. The Encipher module applies the permutation to a message, thus encrypting it. The Decipher module applies the inverse permutation, thus decrypting the enciphered text.

start Encipher(key, msg);
   call randseed(key, 1);         /* reset and initialize random number stream */
   alphabet = "a":"z";            
   cipher = upcase( sample(alphabet, 26, "WOR") );  /* permutation of alphabet */
   ciphertext = translate(lowcase(msg), rowcat(cipher), rowcat(alphabet));
   return (ciphertext);
finish;
 
start Decipher(key, msg);
   call randseed(key, 1);         /* reset and initialize random number stream */
   alphabet = "a":"z";           
   cipher = upcase( sample(alphabet, 26, "WOR") );  /* permutation of alphabet */
   plaintext = translate(upcase(msg), rowcat(alphabet), rowcat(cipher));
   return (plaintext);
finish;
 
msg = {"If I have seen further than others, ",
       "it is by standing upon the ",
       "shoulders of giants. ~Isaac Newton"};
cryptogram = Encipher(key, msg);
soln = Decipher(key, cryptogram);
print soln;
t_subcipher2

Now you and your friends can exchange enciphered messages such as "Want to meet for lunch today at eleven forty-five?" Try it out and let me know what you think. But remember not to say nasty things about your boss: the substitution cipher is not a secure means of communication, as I will demonstrate in a future blog post.

Post a Comment

How to create a hexagonal bin plot in SAS

While I was working on my recent blog post about two-dimensional binning, a colleague asked whether I would be discussing "the new hexagonal binning method that was added to the SURVEYREG procedure in SAS/STAT 13.2." I was intrigued: I was not aware that hexagonal binning had been added to a SAS procedure!

Hexagonal binning is an alternate to the usual rectangular binning, which I blogged about last week. It has been promoted in the statistical community by Daniel Carr (JASA, 1987), who recommends it for visualizing the density of bivariate data in a scatter plot. Some research suggests that using hexagonal bins results in less visual bias than rectangular bins, especially for clouds of points that are roughly elliptical. A hexagonal bin plot is created by covering the data range with a regular array of hexagons and coloring each hexagon according to the number of observations it covers. As with all bin plots, the hex-binned plots are good for visualizing large data sets for which a scatter plot would suffer from overplotting. The bin counts estimate the density of the observations. (In SAS, you can also use PROC KDE to compute a kernel density estimate.)

When I looked at the documentation for the SURVEYREG procedure in SAS/STAT 13.2, I found the feature that my friend mentioned. The PLOTS= option on the PROC SURVEYREG statement supports creating a plot that overlays a regression line on a hex-binned heat map of two-dimensional data. Although there is no option in PROC SURVEYREG to remove the regression line, you can still use the procedure to output the counts in each hexagonal bin.

Creating a heat map of counts for hexagonal bins

You can obtain the counts and the vertices of the hexagonal bins by using a trick that I blogged about a few years ago: Use ODS to create a SAS data set that contains the data underlying the graph. You can then use the POLYGON statement in PROC SGPLOT to create a hex-binned plot of the counts.

Because I want to create several hexagonal heat maps, I will wrap these two steps into a single macro call:

/* Use hexagonal binning to estimate the bivariate density.
   Requires SAS/STAT 13.2 (SAS 9.4m2) */
%macro HexBin(dsName, xName, yName, nBins=20, colorramp=TwoColorRamp);
  ods select none;
  ods output fitplot=_HexMap;  /* write graph data to a data set */
  proc surveyreg data=&dsname plots(nbins=&nBins weight=heatmap)=fit(shape=hex);
     model &yName = &xName;
  run;
  ods select all;
 
  proc sgplot data=_HexMap;
     polygon x=XVar y=YVar ID=hID / colorresponse=WVar fill 
                                    colormodel=&colorramp;
  run;
%mend;

The %HexBin macro calls PROC SURVEYREG to create the heat map with hexagonal binning. However, the graph is not displayed on the screen, but is redirected to a data set. The data set is in a format that can be directly rendered by using the POLYGON statement in PROC SGPLOT.

To run the macro, you have to specify a data set name, the name of an X variable, and the name of a Y variable. By default, the data are binned into approximately 20 bins in both directions. You can control the number of bins by using the NBINS= option. Unlike for rectangular bins, you don't get 400 hexagons when you specify NBINS=20. Instead, the procedure computes the size of the hexagons so that they "have approximately the same area as the same number of rectangular bins would have." However, as you would expect, a large value for the NBINS= option results in many small bins, whereas a small value results in a few large bins.

By default, a two-color color ramp is used to visualize the counts, but you can specify other color ramps by using the COLORRAMP= option. Inside the macro, I chose not to use the OUTLINE option on the POLYGON statement. If you prefer to see outlines for the hexagons, feel free to modify the macro (or just tack it to the end of the COLORRAMP= option).

The following statement calls the %HexBin macro to create a hexagonal bin plot of the birth weight of 50,000 babies versus the relative weight gain of their mothers for data in the Sashelp.bweight data set. My previous blog post shows a heat map of the same data by using rectangular bins.

title "Birth Weight vs. Mother's Weight gain";
%HexBin(sashelp.bweight, MomWtGain, Weight);
hexbin1

The hex-binned heat map reminds me of the effect of using transparency to overcome overplotting. You see a faint light-blue haze where the density of points is low. Dark colors indicate regions for which the density of points is high (more than 3,000 points per bin, for this example).

A fun set of data to visualize is the sashelp.Zipcode data because if you plot the longitude and latitude variables, the hexagons create a pixelated version of the US! Bins with high counts indicate regions for which there are many zip codes packed closely together, such as major metropolitan areas. Rather than use the default color ramp, the following example uses the PALETTE function in SAS/IML software to create a yellow-orange-red color ramp. Those colors are pasted into the COLORRAMP= option; be sure to use parentheses if you specify your own colors for a color ramp. I thought it would be appropriate to use hex values to specify the colors for the hexagons, but you can also define a color ramp by using color names such as colorramp=(lightblue white lightred red).

proc iml;
   c = palette("YLORRD", 5); print c;
quit;
 
data zips;
set sashelp.Zipcode(where=(State<=56 & State^=2 & State^=15 & X<0));
run;
 
title "Density of US Zip Codes";
%HexBin(zips, X, Y, nBins=51,
        colorramp=(CXFFFFB2 CXFECC5C CXFD8D3C CXF03B20 CXBD0026));
hexbin2

The heat map of hex-binned counts shows "hot spots" of density along the East Coast (Washington/Baltimore, Philadelphia, New York), along the West Coast (Los Angeles, the Bay Area, Seattle), and at various other major cities (Denver, Dallas, Chicago, ...). Chicago is also visible near (X,Y)=(-88, 42), although at this scale of binning Lake Michigan is not visible as a collection of empty hexagons, which makes it hard to locate midwest cities.

This hex-binned heat map can be a useful alternative to a scatter plot when the scatter plot suffers from overplotting or when you want to estimate the density of the observations. It requires SAS/STAT 13.2. Between hexagonal binning, the rectangular bin plot from my previous blog post, and PROC KDE, there are now multiple ways in SAS to visualize the density of bivariate data.

Post a Comment

Counting observations in two-dimensional bins

Last Monday I discussed how to choose the bin width and location for a histogram in SAS. The height of each histogram bar shows the number of observations in each bin. Although my recent article didn't mention it, you can also use the IML procedure to count the number of observations in each bin. The BIN function associates a bin to each observation, and the TABULATE subroutine computes the count for each bin. I have previously blogged about how to use this technique to count the observations in bins of equal or unequal width.

bin2d2

You can also count the number of observations in two-dimensional bins. Specifically, you can divide the X direction into kx bins, divide the Y direction into ky bins, and count the number of observations in each of the kx x ky rectangles. In a previous article, I described how to bin 2-D data by using the SAS/IML language and produced the scatter plot at the left. The graph displays the birth weight of 50,000 babies versus the relative weight gain of their mothers for data in the Sashelp.bweight data set.

I recently remembered that you can also count observations in 2-D bins by using the KDE procedure. The BIVAR statement in the KDE procedure supports an OUT= option, which writes a SAS data set that contains the counts in each bin. You can specify the number of bins in each direction by using the NGRID= option on the BIVAR statement, as shown by the following statements:

ods select none;
proc kde data=sashelp.bweight;
   bivar MomWtGain(ngrid=11) Weight(ngrid=7) / out=KDEOut;
run;
ods select all;
 
proc print data=KDEOut(obs=10 drop=density);
run;
hist2bin1

As shown in the output, the the VALUE1 and VALUE2 variables contain the coordinates of the center of each bin. The COUNT variable contains the bin count. You can read the counts into SAS/IML and print it out as a matrix. In the data set, the Y variable changes the fastest as you go down the rows. Consequently, you need to use the SHAPECOL function rather than the SHAPE function to form the 7 x 11 matrix of counts:

proc iml;
kX = 11; kY = 7;
use KDEOut; read all var {Value1 Value2 Count}; close;
M = shapecol(Count, kY, kX);       /* reshape into kY x kX matrix */
/* create labels for cells in X and Y directions */
idx = do(1,nrow(Value1),kY);        /* location of X labels */
labelX = putn(Value1[idx], "5.1");
labelY = putn(Value2[1:kY], "5.0"); 
print M[c=labelX r=labelY];
hist2bin2

As I explained in Monday's blog post, there are two standard ways to choose bins for a histogram: the "midpoints" method and the "endpoints" method. If you compare the bin counts from PROC KDE to the bin counts from my SAS/IML program, you will see small differences. That is because the KDE procedure uses the "midpoints" algorithm for subdividing the data range, whereas I used the "endpoints" algorithm for my SAS/IML program.

Last week I showed how to use heat maps to visualize matrices in SAS/IML. This matrix of counts is begging to be visualized with a heat map. Because the counts vary across several orders of magnitude (from 100 to more than 104), a linear color ramp will not be an effective way to visualize the raw counts. Instead, transform the counts to a log scale and create a heat map of the log-counts:

call HeatmapCont(log10(1+M)) 
                 xvalues=labelX yvalues=labelY
                 colorramp="ThreeColor" legendtitle="Log(Count)"
                 title="Counts in Each Bin";
hist2bin3

The heat map shows the log-count of each bin. If you prefer a light-to-dark heatmap for the density, use the "TwoColor" value for the COLORRAMP= option. The heat map is a great way to see the distribution of counts at a glance. It enables you to see the approximate values of most cells, and you can easily determine cells where there are many or few observations. Of course, if you want to know the exact value of the count in each rectangular cell, look at the tabular output.

Post a Comment

Choosing bins for histograms in SAS

When you create a histogram with statistical software, the software uses the data (including the sample size) to automatically choose the width and location of the histogram bins. The resulting histogram is an attempt to balance statistical considerations, such as estimating the underlying density, and "human considerations," such as choosing "round numbers" for the location and width of bins. Common "round" bin widths include 1, 2, 2.5, and 5, as well as these numbers multiplied by a power of 10.

The default bin width and locations tend to work well for 95% of the data that I plot, but sometimes I decide to override the default choices. This article describes how to set the width and location of bins in histograms that are created by the UNIVARIATE and SGPLOT procedures in SAS.

Why override the default bin locations?

The most common reason to override the default bin locations is because the data have special properties. For example, sometimes the data are measured in units for which the common "round numbers" are not optimal:

  • For a histogram of time measured in minutes, a bin width of 60 is a better choice than a width of 50. Bin widths of 15 and 30 are also useful.
  • For a histogram of time measured in hours, 6, 12, and 24 are good bin widths.
  • For days, a bin width of 7 is a good choice.
  • For a histogram of age (or other values that are rounded to integers), the bins should align with integers.

You might also want to override the default bin locations when you know that the data come from a bounded distribution. If you are plotting a positive quantity, you might want to force the histogram to use 0 as the leftmost endpoint. If you are plotting percentages, you might want to force the histogram to choose 100 as the rightmost endpoint.

To illustrate these situations, let's manufacture some data with special properties. The following DATA step creates two variables. The T variable represents time measured in minutes. The program generates times that are normally distributed with a mean of 120 minutes, then rounds these times to the nearest five-minute mark. The U variable represents a proportion between 0 and 1; it is uniformly distributed and rounded to two decimal places.

data Hist(drop=i);
label T = "Time (minutes)" U = "Uniform";
call streaminit(1);
do i = 1 to 100;
   T = rand("Normal", 120, 30); /* normal with mean 120  */
   T = round(T, 5);             /* round to nearest five-minute mark */
   U = rand("Uniform");         /* uniform on (0,1) */
   U = floor(100*U) / 100;      /* round down to nearest 0.01 */
   output;
end;
run;

How do we control the location of histogram bins in SAS? Read on!

Custom bins with PROC UNIVARIATE: An example of a time variable

I create histograms with PROC UNIVARIATE when I am interested in also computing descriptive statistics such as means and quantiles, or when I want to fit a parametric distribution to the data. The following statements create the default histogram for the time variable, T:

title "Time Data (N=100)";
ods select histogram(PERSIST);  /* show ONLY the histogram until further notice */
proc univariate data=Hist;
histogram T / odstitle=title odstitle2="Default Bins";
run;
histbin1

The default bin width is 20 minutes, which is not horrible, but not as convenient as 15 or 30 minutes. The first bin is centered at 70 minutes; a better choice would be 60 minutes.

The HISTOGRAM statement in PROC UNIVARIATE supports two options for specifying the locations of bins. The ENDPOINTS= option specifies the endpoints of the bins; the MIDPOINTS= option specifies the midpoints of the bins. The following statements use these options to create two customize histograms for which the bin widths are 30 minutes:

proc univariate data=Hist;
histogram T / midpoints=(60 to 210 by 30)  odstitle=title odstitle2="Midpoints";
run;
 
proc univariate data=Hist;
histogram T / endpoints=(60 to 210 by 30)  odstitle=title odstitle2="Endpoints";
run;
histbin2

The histogram on the left has bins that are centered at 30-minute intervals. This histogram makes it easy to estimate that about 40 observations are approximately 120 minutes. The counts for other half-hour increments are similarly easy to estimate. In contrast, the histogram on the right has bins whose endpoints are 60, 90, 120,... minutes. With this histogram, it easy to see that about 35 observations have times that are between 90 and 120 minutes. Similarly, you can estimate the number of observations that are greater than three hours or less than 90 minutes.

Both histograms are equally correct. The one you choose should depend on the questions that you want to ask about the data. Use midpoints if you want to know how many observations have a value; use endpoints if you want to know how many observations are between two values.

If you run the SAS statements that create the histogram on the right, you will see the warning message
WARNING: The ENDPOINTS= list was extended to accommodate the data.
This message informs you that you specified the last endpoint as 210, but that additional bins were created to display all of the data.

Custom bins for a bounded variable

As mentioned earlier, if you know that values are constrained within some interval, you might want to choose histogram bins that incorporate that knowledge. The U variable has values that are in the interval [0,1), but of course PROC UNIVARIATE does not know that. The following statement create a histogram of the U variable with the default bin locations:

title "Bounded Data (N=100)";
proc univariate data=Hist;
histogram U / odstitle=title odstitle2="Default Bins";
run;
histbin3

The default histogram shows seven bins with a bin width of 0.15. From a statistical point of view, this is an adequate histogram. The histogram indicates that the data are uniformly distributed and, although it is not obvious, the left endpoint of the first bin is at 0. However, from a "human readable" perspective, this histogram can be improved. The following statements use the MIDPOINTS= and ENDPOINTS= options to create histograms that have bin widths of 0.2 units:

proc univariate data=Hist;
histogram U / midpoints=(0 to 1 by 0.2)  odstitle=title odstitle2="Midpoints";
run;
 
proc univariate data=Hist;
histogram U / endpoints=(0 to 1 by 0.2)  odstitle=title odstitle2="Endpoints";
run;
ods select all;  /* turn off PERSITS; restore normal output */
histbin4

The histogram on the left is not optimal for these data. Because we created uniformly distributed data in [0,1], we know that the expected count in the leftmost bin (which is centered at 0) is half the expected count of an inner bin. Similarly, the expected count in the rightmost bin (which is centered at 1) is half the count of an inner bins because no value can exceed 1. Consequently, this choice of midpoints is not very good. For these data, the histogram on the right is better at revealing that the data are uniformly distributed and are within the interval [0,1).

Custom bins with PROC SGPLOT

If you do not need the statistical power of the UNIVARIATE procedure, you might choose to create histograms with PROC SGPLOT. The SGPLOT procedure supports the BINWIDTH= and BINSTART= options on the HISTOGRAM statement. The BINWIDTH= option specifies the width for the bins. The BINSTART= option specifies the center of the first bin.

I recommend that you specify both the BINWIDTH= and BINSTART= options, and that you choose the bin width first. Be aware that not all specifications result a valid histogram. If you make a mistake when specifying the bins, you might get the following error
WARNING: The specified BINWIDTH= value will be ignored in order to accommodate the data.
That message usually means that the minimum value of the data was not contained in a bin. For a bin width of h, the BINSTART= value must be less than xmin + h/2, where xmin is the minimum value of the data.

By default, the axis does not show a tick mark for every bin, but you can force that behavior by using the SHOWBINS option. The following statements call the SGPLOT procedure to create histograms for the time-like variable, T. The results are again similar to the custom histograms that are shown in the previous section:

title "Time Data (N=100)";
title2 "Midpoints";
proc sgplot data=Hist;
histogram T / binwidth=15 binstart=60 showbins; /* center first bin at 60 */
run;
 
title2 "Endpoints";
proc sgplot data=Hist;
histogram T / binwidth=15 binstart=52.5;  /* 52.5 = 45 + h/2 */
xaxis values=(45 to 180 by 15);           /* alternative way to place ticks */
run;

The following statements call the SGPLOT procedure to create histograms for the bounded variable, U. The results are similar to those created by the UNIVARIATE procedure:

title "Bounded Data (N=100)";
title2 "Midpoints";
proc sgplot data=Hist;
histogram U / binwidth=0.2 binstart=0 showbins;  /* center first bin at 0 */
run;
 
title2 "Endpoints";
proc sgplot data=Hist;
histogram U / binwidth=0.2 binstart=0.1;      /* center first bin at h/2 */
xaxis values=(0 to 1 by 0.2);       
run;

In summary, for most data the default bin width and location result in a histogram that is both statistically useful and easy to read. However, the default choices can lead to a less-than-optimal visualization if the data have special properties, such as being time intervals or being bounded. In those cases, it makes sense to choose a bin width and a location of the first bin such that reveals your data's special properties. For the UNIVARIATE procedure, use the MIDPOINTS= or ENDPOINTS= options on the HISTOGRAM statement. For the SGPLOT procedure, use the BINWIDTH= and BINSTART= options to create a histogram with custom bins.

Post a Comment

Analyzing activity-tracker data: How many steps per day do YOU take?

My wife got one of those electronic activity trackers a few months ago and has been diligently walking every day since then. At the end of the day she sometimes reads off how many steps she walked, as measured by her activity tracker. I am always impressed at how many steps she racks up during the day through a combination of regular activity and scheduled walks.

There has been a lot written about the accuracy of electronic activity trackers. Personally, I don't worry about accuracy as long as the errors are in the 10% range. The purpose of the tools are to give people an idea of how much they are moving and to encourage them to get off the couch. Whether someone walked 4.2 miles or 3.8 isn't as important as the fact that she walked about 4 miles.

Because my wife's daily numbers seem so impressive, I decided to download some data from other people who use the same device. The device can be linked to a person's Twitter account and programmed to tweet a summary of the person's activity level each day. The Tweets all have a common hashtag and format, so it is easy to download a few hundred tweets and prepare them for data analysis. In an act of statistical voyeurism, I present a descriptive analysis of the activity level of 231 random people who use a particular brand of activity tracker, as reported by their tracker for Sunday, August 17, 2014. You can download the SAS program that analyzes these data.

Distribution of steps

trackerhist The trackers records steps. The histogram at the left shows the distribution of the number of steps taken for the 231 subjects. (Click to enlarge.) A kernel density estimate is overlaid, and various quantiles are displayed in the inset. The histogram shows that about 25% of the people in the sample walked fewer than 4,000 steps, and about half of the people walked fewer than 7,100 steps. About a quarter of the people walked more than 11,600 steps, and the I-am-very-active award goes to those people who walked more than 18,000 steps—they are the upper 95th percentile of this sample.

The tail of the distribution falls off rapidly, which means that there is a low probability of finding someone who walks more than 30,000 steps per day. In other words, this is not a fat-tailed distribution. On the contrary, a person in the upper percentiles is working her tail off!

How many steps does it take to walk a mile?

trackerfit

The tracker also reports distance in miles. The basic device uses an algorithm to convert those steps into an approximate distances so, as they say, your distance may vary (nyuck-nyuck!). The device does not know your exact stride length.

The scatter plot to the left shows the relationship between distance and steps. For people who take many steps, there is substantial variation in the reported distance. Probably some of those people were running or jogging, which changes the length of the stride. The line indicates predicted distance for a given number of steps, as calculated by a robust regression algorithm.

These data can help to answer the question, "How many steps does it take to walk a mile?" Based on these data, it takes an average person 2,272 steps to walk a mile. Of course, shorter people will require more steps and taller people fewer, and there is the whole debate about how accurate these trackers are at estimating distance. Nevertheless, 2,272 steps is a good estimate for a mile. For a simpler number, you can estimate that 10,000 steps is about four miles.

These data also enable you to estimate the length of the average stride, which is 2.32 feet, or about 71 centimeters per step.

What do you think of these data? If you use a fitness tracking device, how many steps do you take each day? Leave a comment.

Post a Comment

Creating heat maps in SAS/IML

In a previous blog post, I showed how to use the graph template language (GTL) in SAS to create heat maps with a continuous color ramp. SAS/IML 13.1 includes the HEATMAPCONT subroutine, which makes it easy to create heat maps with continuous color ramps from SAS/IML matrices. Typical usage includes creating heat maps of data matrices, correlation matrices, or covariance matrices. A more advanced usage is using matrices to visualize the results of computational algorithms or computer experiments.

Data matrices

Heat maps provide a convenient way to visualize many variables and/or many variables in a single glance. For example, suppose that you want to compare and contrast the 24 trucks in the Sashelp.Cars data by using the 10 numerical variables in the data set. The following SAS/IML statements read the data into a matrix and sort the matrix according to the value of the Invoice variable (that is, by price). The Model variable, which identifies each truck, is sorted by using the same criterion:

proc iml;
use Sashelp.Cars where(type="Truck");
read all var _NUM_ into Trucks[c=varNames r=Model];
close Sashelp.Cars;
 
call sortndx(idx, Trucks[,"Invoice"]);     /* find indices that sort the data */
Trucks = Trucks[idx,]; Model = Model[idx]; /* sort the data and the models    */

The original variables are all measured on different scales. For example, the Invoice variable has a mean value of $23,000, whereas the EngineSize variable has a mean value of 4 liters. Most of the values in the matrix are less than 300. Consequently, a heat map of the raw values would not be illuminating: The price variables would be displayed as dark cells (high values) and the rest of the matrix would be almost uniformly light (small values, relative to the prices). Fortunately, the HEATMAPCONT subroutine supports the SCALE="Col" option to standardize each variable to have mean 0 and unit variance. With that option, a heat map reveals the high, middle, and low values of each variable:

call HeatmapCont(Trucks) scale="Col"; /* standardize cols */
heatmaptrucks1

The HEATMAPCONT output shows the following default values:

  • The default color ramp is the 'TwoColor' color ramp. With my ODS style, white is used to display small values and dark blue is used to display large values. The color ramp indicates that the standardized values are approximately in the range [-2, 3], which probably indicates that several variables have positive skewness.
  • The axes are labeled by using the row numbers of the 24 x 10 matrix.
  • The Y axes "points down." In other words, the heat map displays the matrix as it would be printed: the top of the heat map displays the first few rows and the bottom of the heat map displays the last few rows.

A problem with this initial heat map is that we don't know which trucks correspond to the rows, nor do we know which variables corresponds to the columns. You can use the XVALUES= and YVALUES= options to add that information to the heat map. Also, let's use a three-color color ramp and center the range of the color ramp so that white represents the mean value for each variable and red (blue) represents positive (negative) deviations from the mean. The resulting image is shown below (click to enlarge):

call HeatmapCont(Trucks) scale="Col"           /* standardize cols    */
     xvalues=varNames yvalues=Model            /* label rows and cols */
     colorramp="ThreeColor" range={-3 3}       /* color range [-3,3]  */
     legendtitle = "Standardized values" title="Truck Data";
heatmaptrucks2

This visualization of the data enables us to draw several conclusions about the data. In general, expensive cars are powerful (large values of EngineSize/, Cylinders, and Horsepower), large in size (large values of Weight, Wheelbase, and Length), and get poor fuel economy (small values of MPG_City and MPG_Highway). However, two vehicles seem unusual. Compared with similarly priced trucks, the Baja is smaller and more fuel efficient. Compared with similarly priced trucks, the Tundra is more powerful and larger in size.

Correlation matrices

Because the data matrix is sorted by price, it looks like price variables are positively correlated with all variables except for the fuel economy variables. Let's see if that is the case by computing the correlation matrix and displaying it as a heat map:

corr = corr(Trucks);
call HeatmapCont(corr)
     xvalues=varNames yvalues=varNames
     colorramp="ThreeColor" range={-1 1}
     title="Correlations in Truck Data";
heatmaptrucks3

The correlation matrix confirms our initial impression. The price variables are strongly correlated with the variables that indicate the power and size of the trucks. The price is negatively correlated with the fuel efficiency variables.

One of the advantages of computing correlation analysis in the SAS/IML language is that you can easily change the order of the variables. For example, the following statements move the fuel efficiency variables to the end, so that all of the positively correlated variables are grouped together:

newOrder = (1:5) || (8:10) || (6:7);
corr = corr[newOrder, newOrder];
varNames = varNames[newOrder];

Can you think of other applications of visualizing matrices by using the HEATMAPCONT subroutine? Leave a comment.

Post a Comment