An exploratory technique for visualizing the distributions of 100 variables

In a previous blog post I showed how to order a set of variables by a statistic. After reshaping data, you can create a graph that contains box plots for many variables. Ordering the variables by some statistic (mean, median, variance,...) helps to differentiate and distinguish the variables.

You can use this as an exploratory technique. Suppose that you are given a new set of data that has 100 variables. Before you begin to analyze these variables, it is useful to visualize their distributions. Are the data normal or skewed? Are there outliers for some variables? You could use PROC UNIVARIATE to create 100 histograms and 100 sets of descriptive statistics, but it would be tedious to slog through hundreds of pages of outputs, and difficult to use the histograms to compare the variables.

In contrast, you can fit 100 box plots on a single graph. The resulting graph enables you to see all 100 distributions at a glance and to compare the distributions to each other. If the variables are measured on vastly different scales, you might want to standardize the variables so that the box plots are comparable in scale.

Creating 100 variables

As an example, I will simulate 100 variables with 1,000 observations in each variable. The following SAS/IML statements loop over each column of a matrix X. The SAMPLE function randomly chooses a distribution (normal, lognormal, exponential, or uniform) for the data in the column. I use the RANDFUN function to generate the data for each column. The 100 variables are then written to a SAS data set and given the names X1–X100.

/* create 100 variables with random data from a set of distributions */
proc iml;
N = 1000;  p = 100;
distrib = {"Normal" "Lognormal" "Exponential" "Uniform"};
call randseed(1);
 
/* each column of X is from a random distribution */
X = j(N,p);
do i = 1 to p;
   X[,i] = randfun(N, sample(distrib,1));
end;
 
varNames = "x1":("x"+strip(char(p)));
create ManyVars from X[colname=varNames]; append from X; close;
quit;

Visualizing the 100 distributions

Suppose that you are asked to visualize the distributions in the ManyVars data set. Previously I showed how to use Base SAS to standardize the variables, reshape the data, and create the box plots. The following statement show how to implement the technique by using SAS/IML software. For no particular reason, I order these variables by the third quartile, rather than by the median:

proc iml;
use ManyVars;
read all var _NUM_ into X[colname=varNames];
close ManyVars;
N = nrow(X);
 
MinX = X[><, ]; MaxX = X[<>, ];
X = (X-MinX)/ (MaxX-MinX);       /* standardize vars into [0,1] */
call qntl(q3, X, 0.75);          /* compute 3rd quartile */
 
r = rank(q3);                    /* sort variables by that statistic */
temp = X;        X[,r] = temp;
temp = varNames; varNames[,r] = temp;
 
_Value_ = shapecol(X,0,1);      /* convert from wide form to long form */
varName = shapecol(repeat(varNames, N),0,1);
 
title "Variables ordered by Q3";          /* plot the distributions */
ods graphics / width =2400 height=600;
call box(_Value_) category=varName 
     other="xaxis discreteorder=data display=(nolabel);";
viz100

The result is shown above; click to enlarge. You can see the four different distributional shapes, although you have to look closely to differentiate the lognormal variables (which end with variable X83) and the exponential variables (which begin with variable X41). The changes in shape between the normal data (X69) and the uniform data (X99) are apparent. Also, notice that the skewed distributions have mean larger than the median, whereas for the symmetric distributions the sample means and the medians are approximately equal. The graph provides a quick-and-dirty view of the data, which is always useful when you encounter new data.

I standardized the variables into the interval [0,1], but other standardizations are possible. For example, to standardize the variable to mean zero and unit variance, use the statement X = (X-mean(X))/ std(X).

The SAS/IML program is remarkable for its compactness, especially when compared to the Base SAS implementation. Three statements are used to standardize the data, but I could have inlined the computations into a single statement. One statement is used to compute the quantiles. Reordering the variables is accomplished by using "the rank trick". The SHAPECOL function makes it easy to convert the data from wide form to long form. Finally, the BOX call makes it easy to create the box plot without even leaving the SAS/IML environment. These statements could be easily encapsulated into a single SAS/IML module that could be reused to visualize other data sets.

I think the visualization technique is powerful, although it is certainly not a panacea. When confronted with new data, this technique can help you to quickly learn about the distribution of the variables.

What do you think? Would you use this visualization as one step in an exploratory analysis? What are some of the strengths and weaknesses of this technique? Leave a comment.

Post a Comment

Order variables by values of a statistic

ordervars1

When I create a graph of data that contains a categorical variable, I rarely want to display the categories in alphabetical order. For example, the box plot to the left is a plot of 10 standardized variables where the variables are ordered by their median value. The ordering makes it apparent that the variables on the left (Invoice, MSRP) are highly skewed with many outliers, whereas the rightmost variable (length) appears to be symmetric and perhaps normally distributed.

When you use the SG procedures in SAS, there are several ways to change the order in which categories appear in a graph. One extremely powerful method is to use the DISCRETEORDER=DATA option on the XAXIS statement of the SGPLOT procedure. The DISCRETEORDER=DATA option tells the SGPLOT procedure to display the categories in the order in which they appear in the data set. In other words, if you can figure out some way to order the observations (DATA step, PROC SORT, PROC SQL,....), then you will see that same order when you create a graph of the data.

To illustrate this idea, let's create the box plot at the top of this article. The data are standardized versions of the numerical variables in the Sashelp.Cars data set. The statistic of interest is the median. The following statements create the data and compute the medians:

%let DSName = sashelp.cars;
%let Stat = Median; /* or mean, stddev, qrange, skew, etc */
/* standardize variables to [0,1] */
proc stdize method=range data=&DSName out=MyData; run;
 
proc means data=MyData &Stat STACKODSOUTPUT; 
ods output Summary=StatOut;
run;
t_ordervar1

The ODS OUTPUT statement is used to create a data set with the median statistic for each variable. The STACKODSOUTPUT option is used so that the output data set has the same structure as the printed table.

The next step is to get the list of variables ordered by the median statistic. My favorite method is to use PROC SQL's useful SELECT INTO :MACROVAR syntax. I thanks my colleague Gordon for reminding me I do not need to use PROC SORT to sort the variables, I can use the ORDER BY clause in PROC SQL instead.

/* put ordered variable names into macro variable */
proc sql noprint;                              
 select Variable into :varlist separated by ' '
 from StatOut order by &Stat;
quit;

The macro variable &VARLIST now contains a list of the variables, ordered by their medians. In order to create a single graph that has multiple box plots, transpose the data from wide form (many varables) to long form (two variables that specify the variable name and value). The following statements add an "observation number" variable to the data and use the RETAIN statement to reorder the variables. Then PROC TRANSPOSE converts the data set from wide form to long form. In the Long data set, the variable names are ordered by the value of the statistic. (Alternatively, you could merge the statistics and the original data and then sort the merged data by the statistic and the observation number.)

data Wide / view=Wide;     
retain &varlist;           /* reorder vars by statistic */
set MyData;
obsNum = _N_;              /* add ID for subject (observation) */
keep obsNum &varlist;
run;
 
/* transpose from wide to long data format; VARNAME is a categorical var */
proc transpose data=Wide name=VarName 
               out=Long(rename=(Col1=_Value_) drop=_LABEL_);
by obsNum;
run;
t_ordervar2

Data in this form is ideal for graphing by using SGPLOT statements that support a GROUP= or CATEGORY= option. To create the graph at the beginning of this article, use the VBOX statement with the CATEGORY= option. If you also use the DISCRETEORDER=DATA option on the XAXIS statement, the box plots are ordered by their median values, as shown in the following:

title "Box Plots of Standardized Variables, Ordered by &Stat";
proc sgplot data=Long;
   label _Value_ = "Standardized Value" VarName="Variable";
   vbox _Value_ / category=VarName;
   xaxis discreteorder=data display=(nolabel);        /* respect ordering */
run;

Obviously, you can use this same technique to create box plots that are ordered by other statistics.

What technique do you use to order groups or categories in a graph? Can you think of applications for which this technique would be useful? Leave a comment.

Post a Comment

Ciphers, keys, and cryptoquotes

Today is my fourth blog-iversary: the anniversary of my first blog post in 2010. To celebrate, I am going to write a series of fun posts based on The Code Book by Simon Singh, a fascinating account of the history of cryptography from ancient times until the present. While reading the book, I was struck by the number of statistical concepts that are involved in cryptography.

I will blog about cryptography once or twice a month (Tag: Ciphers). My focus will be statistical ideas in classical cryptography, from ancient Rome through World War II. Topics will include substitution ciphers, frequency analysis of letters and pairs of letters, and how to construct and solve recreational puzzles, such the popular Cryptoquote puzzle that appears in many newspapers.

As usual, each article will be accompanied by a SAS program that demonstrates a technique in statistical programming. Even though most of us will never encounter secret messages as part of our professional work, these posts will demonstrate useful techniques for writing statistical programs in SAS/IML language.

But enough talk! Let's get started by defining some basic terms:

  • A substitution cipher is a permutation of an alphabet. Each letter of the alphabet is transformed under the permutation into another (not necessarily different) letter, and the ordered set of permuted letters is the cipher. Enciphering is the process of replacing each letter in a text by its counterpart in the cipher. Deciphering is the process of applying the inverse permutation to an enciphered message to recover the original message.
  • The message prior to encryption is called the plaintext message. The alphabet in its natural order is the plaintext alphabet. Traditionally, plaintext is written in all lowercase letters.
  • The encrypted message is called the ciphertext. Traditionally, ciphertext is written in all uppercase letters.

Substitution ciphers and keys

The substitution cipher has been used to encipher messages since the time of Julius Caesar. Its simplicity kept it in use until the Renaissance, even though Arab mathematicians developed frequency analysis in the ninth century and the Europeans knew how to break substitution ciphers for long messages in the 1500s. Today the substitution cipher is used mainly to construct recreational puzzles.

A substitution cipher is determined by a key that, in conjunction with some algorithm, transforms a plaintext alphabet into a cipher alphabet. A special case of the substitution cipher is the simple Caesar cipher (or shift cipher), which has only 25 possible keys and is therefore susceptible to a brute force attack (just check all possible shifts). However, there are 26! = 4 x 1026 permutations of 26 letters, so the general substitution cipher is resistant to a naive brute force attack.

Before the modern computer, general substitution ciphers were not used often because the agents would need to remember the complete 26-element permutation, and the process of encipherment by hand was slow and subject to error. However, if you have access to a programming language that includes a random number generator, the key can be any easily remembered seed for the generator.

Here's how a programming language like SAS/IML makes it simple to encipher a message. First, choose a random number seed as the encryption key. Then create a cipher by using the SAMPLE function to sample without replacement ("WOR") from the plaintext alphabet. For example, I might choose 432013 as the secret key because I can remember that my book Simulating Data with SAS was published on April 3, 2013. The following SAS/IML program constructs the cipher:

proc iml;
key = 432013;            
call randseed(key, 1);   /* reset random number stream; initialize with key */
alphabet = "a":"z";      /* plaintext alphabet */
cipher = upcase(sample(alphabet, 26, "WOR") );  /* permutation of alphabet */
print (alphabet//cipher)[r={"alphabet" "cipher"}];
t_subcipher1

The output shows the cipher alphabet beneath the plaintext alphabet. For the key 432013, the letter 'a' is enciphered as 'D', 'b' goes to 'C', 'c' goes to 'Z', and so on. For this permutation, 'x' is enciphered as 'X', so not every letter is changed by the encipherment.

The program uses a trick that you might not have seen before. The RANDSEED subroutine in SAS/IML has an optional second argument. If the second argument has the value 1, then the random number generator is reset. By calling the RANDSEED subroutine in this way, you can ensure that the encipherment does not depend on the current state of the random number generator.

You can use the TRANSLATE function in Base SAS to replace every element in the plaintext alphabet by the associated letter in the enciphered alphabet. The ROWCAT function in SAS/IML concatenates the 26-element arrays into a single character string with 26 letters. The following statements enciphers one of my favorite quotes by Isaac Newton:

msg = {"If I have seen further than others, ",
       "it is by standing upon the ",
       "shoulders of giants. ~Isaac Newton"};
plaintext = lowcase(msg);
ciphertext = translate(plaintext, rowcat(cipher), rowcat(alphabet));
print ciphertext;
t_subcipher3

The output of the program is an enciphered message that is in the spirit of the Cryptoquote puzzle. Use this technique to construct puzzles for your puzzle-loving friends!

Modules to encipher and decipher messages

If you and your friend want to exchange messages by using a substitution cipher, you need to agree on a key and an algorithm for constructing a permutation of the alphabet. The following SAS/IML modules use the SAMPLE function in SAS/IML to construct a cipher from the key. The Encipher module applies the permutation to a message, thus encrypting it. The Decipher module applies the inverse permutation, thus decrypting the enciphered text.

start Encipher(key, msg);
   call randseed(key, 1);         /* reset and initialize random number stream */
   alphabet = "a":"z";            
   cipher = upcase( sample(alphabet, 26, "WOR") );  /* permutation of alphabet */
   ciphertext = translate(lowcase(msg), rowcat(cipher), rowcat(alphabet));
   return (ciphertext);
finish;
 
start Decipher(key, msg);
   call randseed(key, 1);         /* reset and initialize random number stream */
   alphabet = "a":"z";           
   cipher = upcase( sample(alphabet, 26, "WOR") );  /* permutation of alphabet */
   plaintext = translate(upcase(msg), rowcat(alphabet), rowcat(cipher));
   return (plaintext);
finish;
 
msg = {"If I have seen further than others, ",
       "it is by standing upon the ",
       "shoulders of giants. ~Isaac Newton"};
cryptogram = Encipher(key, msg);
soln = Decipher(key, cryptogram);
print soln;
t_subcipher2

Now you and your friends can exchange enciphered messages such as "Want to meet for lunch today at eleven forty-five?" Try it out and let me know what you think. But remember not to say nasty things about your boss: the substitution cipher is not a secure means of communication, as I will demonstrate in a future blog post.

Post a Comment

How to create a hexagonal bin plot in SAS

While I was working on my recent blog post about two-dimensional binning, a colleague asked whether I would be discussing "the new hexagonal binning method that was added to the SURVEYREG procedure in SAS/STAT 13.2." I was intrigued: I was not aware that hexagonal binning had been added to a SAS procedure!

Hexagonal binning is an alternate to the usual rectangular binning, which I blogged about last week. It has been promoted in the statistical community by Daniel Carr (JASA, 1987), who recommends it for visualizing the density of bivariate data in a scatter plot. Some research suggests that using hexagonal bins results in less visual bias than rectangular bins, especially for clouds of points that are roughly elliptical. A hexagonal bin plot is created by covering the data range with a regular array of hexagons and coloring each hexagon according to the number of observations it covers. As with all bin plots, the hex-binned plots are good for visualizing large data sets for which a scatter plot would suffer from overplotting. The bin counts estimate the density of the observations. (In SAS, you can also use PROC KDE to compute a kernel density estimate.)

When I looked at the documentation for the SURVEYREG procedure in SAS/STAT 13.2, I found the feature that my friend mentioned. The PLOTS= option on the PROC SURVEYREG statement supports creating a plot that overlays a regression line on a hex-binned heat map of two-dimensional data. Although there is no option in PROC SURVEYREG to remove the regression line, you can still use the procedure to output the counts in each hexagonal bin.

Creating a heat map of counts for hexagonal bins

You can obtain the counts and the vertices of the hexagonal bins by using a trick that I blogged about a few years ago: Use ODS to create a SAS data set that contains the data underlying the graph. You can then use the POLYGON statement in PROC SGPLOT to create a hex-binned plot of the counts.

Because I want to create several hexagonal heat maps, I will wrap these two steps into a single macro call:

/* Use hexagonal binning to estimate the bivariate density.
   Requires SAS/STAT 13.2 (SAS 9.4m2) */
%macro HexBin(dsName, xName, yName, nBins=20, colorramp=TwoColorRamp);
  ods select none;
  ods output fitplot=_HexMap;  /* write graph data to a data set */
  proc surveyreg data=&dsname plots(nbins=&nBins weight=heatmap)=fit(shape=hex);
     model &yName = &xName;
  run;
  ods select all;
 
  proc sgplot data=_HexMap;
     polygon x=XVar y=YVar ID=hID / colorresponse=WVar fill 
                                    colormodel=&colorramp;
  run;
%mend;

The %HexBin macro calls PROC SURVEYREG to create the heat map with hexagonal binning. However, the graph is not displayed on the screen, but is redirected to a data set. The data set is in a format that can be directly rendered by using the POLYGON statement in PROC SGPLOT.

To run the macro, you have to specify a data set name, the name of an X variable, and the name of a Y variable. By default, the data are binned into approximately 20 bins in both directions. You can control the number of bins by using the NBINS= option. Unlike for rectangular bins, you don't get 400 hexagons when you specify NBINS=20. Instead, the procedure computes the size of the hexagons so that they "have approximately the same area as the same number of rectangular bins would have." However, as you would expect, a large value for the NBINS= option results in many small bins, whereas a small value results in a few large bins.

By default, a two-color color ramp is used to visualize the counts, but you can specify other color ramps by using the COLORRAMP= option. Inside the macro, I chose not to use the OUTLINE option on the POLYGON statement. If you prefer to see outlines for the hexagons, feel free to modify the macro (or just tack it to the end of the COLORRAMP= option).

The following statement calls the %HexBin macro to create a hexagonal bin plot of the birth weight of 50,000 babies versus the relative weight gain of their mothers for data in the Sashelp.bweight data set. My previous blog post shows a heat map of the same data by using rectangular bins.

title "Birth Weight vs. Mother's Weight gain";
%HexBin(sashelp.bweight, MomWtGain, Weight);
hexbin1

The hex-binned heat map reminds me of the effect of using transparency to overcome overplotting. You see a faint light-blue haze where the density of points is low. Dark colors indicate regions for which the density of points is high (more than 3,000 points per bin, for this example).

A fun set of data to visualize is the sashelp.Zipcode data because if you plot the longitude and latitude variables, the hexagons create a pixelated version of the US! Bins with high counts indicate regions for which there are many zip codes packed closely together, such as major metropolitan areas. Rather than use the default color ramp, the following example uses the PALETTE function in SAS/IML software to create a yellow-orange-red color ramp. Those colors are pasted into the COLORRAMP= option; be sure to use parentheses if you specify your own colors for a color ramp. I thought it would be appropriate to use hex values to specify the colors for the hexagons, but you can also define a color ramp by using color names such as colorramp=(lightblue white lightred red).

proc iml;
   c = palette("YLORRD", 5); print c;
quit;
 
data zips;
set sashelp.Zipcode(where=(State<=56 & State^=2 & State^=15 & X<0));
run;
 
title "Density of US Zip Codes";
%HexBin(zips, X, Y, nBins=51,
        colorramp=(CXFFFFB2 CXFECC5C CXFD8D3C CXF03B20 CXBD0026));
hexbin2

The heat map of hex-binned counts shows "hot spots" of density along the East Coast (Washington/Baltimore, Philadelphia, New York), along the West Coast (Los Angeles, the Bay Area, Seattle), and at various other major cities (Denver, Dallas, Chicago, ...). Chicago is also visible near (X,Y)=(-88, 42), although at this scale of binning Lake Michigan is not visible as a collection of empty hexagons, which makes it hard to locate midwest cities.

This hex-binned heat map can be a useful alternative to a scatter plot when the scatter plot suffers from overplotting or when you want to estimate the density of the observations. It requires SAS/STAT 13.2. Between hexagonal binning, the rectangular bin plot from my previous blog post, and PROC KDE, there are now multiple ways in SAS to visualize the density of bivariate data.

Post a Comment

Counting observations in two-dimensional bins

Last Monday I discussed how to choose the bin width and location for a histogram in SAS. The height of each histogram bar shows the number of observations in each bin. Although my recent article didn't mention it, you can also use the IML procedure to count the number of observations in each bin. The BIN function associates a bin to each observation, and the TABULATE subroutine computes the count for each bin. I have previously blogged about how to use this technique to count the observations in bins of equal or unequal width.

bin2d2

You can also count the number of observations in two-dimensional bins. Specifically, you can divide the X direction into kx bins, divide the Y direction into ky bins, and count the number of observations in each of the kx x ky rectangles. In a previous article, I described how to bin 2-D data by using the SAS/IML language and produced the scatter plot at the left. The graph displays the birth weight of 50,000 babies versus the relative weight gain of their mothers for data in the Sashelp.bweight data set.

I recently remembered that you can also count observations in 2-D bins by using the KDE procedure. The BIVAR statement in the KDE procedure supports an OUT= option, which writes a SAS data set that contains the counts in each bin. You can specify the number of bins in each direction by using the NGRID= option on the BIVAR statement, as shown by the following statements:

ods select none;
proc kde data=sashelp.bweight;
   bivar MomWtGain(ngrid=11) Weight(ngrid=7) / out=KDEOut;
run;
ods select all;
 
proc print data=KDEOut(obs=10 drop=density);
run;
hist2bin1

As shown in the output, the the VALUE1 and VALUE2 variables contain the coordinates of the center of each bin. The COUNT variable contains the bin count. You can read the counts into SAS/IML and print it out as a matrix. In the data set, the Y variable changes the fastest as you go down the rows. Consequently, you need to use the SHAPECOL function rather than the SHAPE function to form the 7 x 11 matrix of counts:

proc iml;
kX = 11; kY = 7;
use KDEOut; read all var {Value1 Value2 Count}; close;
M = shapecol(Count, kY, kX);       /* reshape into kY x kX matrix */
/* create labels for cells in X and Y directions */
idx = do(1,nrow(Value1),kY);        /* location of X labels */
labelX = putn(Value1[idx], "5.1");
labelY = putn(Value2[1:kY], "5.0"); 
print M[c=labelX r=labelY];
hist2bin2

As I explained in Monday's blog post, there are two standard ways to choose bins for a histogram: the "midpoints" method and the "endpoints" method. If you compare the bin counts from PROC KDE to the bin counts from my SAS/IML program, you will see small differences. That is because the KDE procedure uses the "midpoints" algorithm for subdividing the data range, whereas I used the "endpoints" algorithm for my SAS/IML program.

Last week I showed how to use heat maps to visualize matrices in SAS/IML. This matrix of counts is begging to be visualized with a heat map. Because the counts vary across several orders of magnitude (from 100 to more than 104), a linear color ramp will not be an effective way to visualize the raw counts. Instead, transform the counts to a log scale and create a heat map of the log-counts:

call HeatmapCont(log10(1+M)) 
                 xvalues=labelX yvalues=labelY
                 colorramp="ThreeColor" legendtitle="Log(Count)"
                 title="Counts in Each Bin";
hist2bin3

The heat map shows the log-count of each bin. If you prefer a light-to-dark heatmap for the density, use the "TwoColor" value for the COLORRAMP= option. The heat map is a great way to see the distribution of counts at a glance. It enables you to see the approximate values of most cells, and you can easily determine cells where there are many or few observations. Of course, if you want to know the exact value of the count in each rectangular cell, look at the tabular output.

Post a Comment

Choosing bins for histograms in SAS

When you create a histogram with statistical software, the software uses the data (including the sample size) to automatically choose the width and location of the histogram bins. The resulting histogram is an attempt to balance statistical considerations, such as estimating the underlying density, and "human considerations," such as choosing "round numbers" for the location and width of bins. Common "round" bin widths include 1, 2, 2.5, and 5, as well as these numbers multiplied by a power of 10.

The default bin width and locations tend to work well for 95% of the data that I plot, but sometimes I decide to override the default choices. This article describes how to set the width and location of bins in histograms that are created by the UNIVARIATE and SGPLOT procedures in SAS.

Why override the default bin locations?

The most common reason to override the default bin locations is because the data have special properties. For example, sometimes the data are measured in units for which the common "round numbers" are not optimal:

  • For a histogram of time measured in minutes, a bin width of 60 is a better choice than a width of 50. Bin widths of 15 and 30 are also useful.
  • For a histogram of time measured in hours, 6, 12, and 24 are good bin widths.
  • For days, a bin width of 7 is a good choice.
  • For a histogram of age (or other values that are rounded to integers), the bins should align with integers.

You might also want to override the default bin locations when you know that the data come from a bounded distribution. If you are plotting a positive quantity, you might want to force the histogram to use 0 as the leftmost endpoint. If you are plotting percentages, you might want to force the histogram to choose 100 as the rightmost endpoint.

To illustrate these situations, let's manufacture some data with special properties. The following DATA step creates two variables. The T variable represents time measured in minutes. The program generates times that are normally distributed with a mean of 120 minutes, then rounds these times to the nearest five-minute mark. The U variable represents a proportion between 0 and 1; it is uniformly distributed and rounded to two decimal places.

data Hist(drop=i);
label T = "Time (minutes)" U = "Uniform";
call streaminit(1);
do i = 1 to 100;
   T = rand("Normal", 120, 30); /* normal with mean 120  */
   T = round(T, 5);             /* round to nearest five-minute mark */
   U = rand("Uniform");         /* uniform on (0,1) */
   U = floor(100*U) / 100;      /* round down to nearest 0.01 */
   output;
end;
run;

How do we control the location of histogram bins in SAS? Read on!

Custom bins with PROC UNIVARIATE: An example of a time variable

I create histograms with PROC UNIVARIATE when I am interested in also computing descriptive statistics such as means and quantiles, or when I want to fit a parametric distribution to the data. The following statements create the default histogram for the time variable, T:

title "Time Data (N=100)";
ods select histogram(PERSIST);  /* show ONLY the histogram until further notice */
proc univariate data=Hist;
histogram T / odstitle=title odstitle2="Default Bins";
run;
histbin1

The default bin width is 20 minutes, which is not horrible, but not as convenient as 15 or 30 minutes. The first bin is centered at 70 minutes; a better choice would be 60 minutes.

The HISTOGRAM statement in PROC UNIVARIATE supports two options for specifying the locations of bins. The ENDPOINTS= option specifies the endpoints of the bins; the MIDPOINTS= option specifies the midpoints of the bins. The following statements use these options to create two customize histograms for which the bin widths are 30 minutes:

proc univariate data=Hist;
histogram T / midpoints=(60 to 210 by 30)  odstitle=title odstitle2="Midpoints";
run;
 
proc univariate data=Hist;
histogram T / endpoints=(60 to 210 by 30)  odstitle=title odstitle2="Endpoints";
run;
histbin2

The histogram on the left has bins that are centered at 30-minute intervals. This histogram makes it easy to estimate that about 40 observations are approximately 120 minutes. The counts for other half-hour increments are similarly easy to estimate. In contrast, the histogram on the right has bins whose endpoints are 60, 90, 120,... minutes. With this histogram, it easy to see that about 35 observations have times that are between 90 and 120 minutes. Similarly, you can estimate the number of observations that are greater than three hours or less than 90 minutes.

Both histograms are equally correct. The one you choose should depend on the questions that you want to ask about the data. Use midpoints if you want to know how many observations have a value; use endpoints if you want to know how many observations are between two values.

If you run the SAS statements that create the histogram on the right, you will see the warning message
WARNING: The ENDPOINTS= list was extended to accommodate the data.
This message informs you that you specified the last endpoint as 210, but that additional bins were created to display all of the data.

Custom bins for a bounded variable

As mentioned earlier, if you know that values are constrained within some interval, you might want to choose histogram bins that incorporate that knowledge. The U variable has values that are in the interval [0,1), but of course PROC UNIVARIATE does not know that. The following statement create a histogram of the U variable with the default bin locations:

title "Bounded Data (N=100)";
proc univariate data=Hist;
histogram U / odstitle=title odstitle2="Default Bins";
run;
histbin3

The default histogram shows seven bins with a bin width of 0.15. From a statistical point of view, this is an adequate histogram. The histogram indicates that the data are uniformly distributed and, although it is not obvious, the left endpoint of the first bin is at 0. However, from a "human readable" perspective, this histogram can be improved. The following statements use the MIDPOINTS= and ENDPOINTS= options to create histograms that have bin widths of 0.2 units:

proc univariate data=Hist;
histogram U / midpoints=(0 to 1 by 0.2)  odstitle=title odstitle2="Midpoints";
run;
 
proc univariate data=Hist;
histogram U / endpoints=(0 to 1 by 0.2)  odstitle=title odstitle2="Endpoints";
run;
ods select all;  /* turn off PERSITS; restore normal output */
histbin4

The histogram on the left is not optimal for these data. Because we created uniformly distributed data in [0,1], we know that the expected count in the leftmost bin (which is centered at 0) is half the expected count of an inner bin. Similarly, the expected count in the rightmost bin (which is centered at 1) is half the count of an inner bins because no value can exceed 1. Consequently, this choice of midpoints is not very good. For these data, the histogram on the right is better at revealing that the data are uniformly distributed and are within the interval [0,1).

Custom bins with PROC SGPLOT

If you do not need the statistical power of the UNIVARIATE procedure, you might choose to create histograms with PROC SGPLOT. The SGPLOT procedure supports the BINWIDTH= and BINSTART= options on the HISTOGRAM statement. The BINWIDTH= option specifies the width for the bins. The BINSTART= option specifies the center of the first bin.

I recommend that you specify both the BINWIDTH= and BINSTART= options, and that you choose the bin width first. Be aware that not all specifications result a valid histogram. If you make a mistake when specifying the bins, you might get the following error
WARNING: The specified BINWIDTH= value will be ignored in order to accommodate the data.
That message usually means that the minimum value of the data was not contained in a bin. For a bin width of h, the BINSTART= value must be less than xmin + h/2, where xmin is the minimum value of the data.

By default, the axis does not show a tick mark for every bin, but you can force that behavior by using the SHOWBINS option. The following statements call the SGPLOT procedure to create histograms for the time-like variable, T. The results are again similar to the custom histograms that are shown in the previous section:

title "Time Data (N=100)";
title2 "Midpoints";
proc sgplot data=Hist;
histogram T / binwidth=15 binstart=60 showbins; /* center first bin at 60 */
run;
 
title2 "Endpoints";
proc sgplot data=Hist;
histogram T / binwidth=15 binstart=52.5;  /* 52.5 = 45 + h/2 */
xaxis values=(45 to 180 by 15);           /* alternative way to place ticks */
run;

The following statements call the SGPLOT procedure to create histograms for the bounded variable, U. The results are similar to those created by the UNIVARIATE procedure:

title "Bounded Data (N=100)";
title2 "Midpoints";
proc sgplot data=Hist;
histogram U / binwidth=0.2 binstart=0 showbins;  /* center first bin at 0 */
run;
 
title2 "Endpoints";
proc sgplot data=Hist;
histogram U / binwidth=0.2 binstart=0.1;      /* center first bin at h/2 */
xaxis values=(0 to 1 by 0.2);       
run;

In summary, for most data the default bin width and location result in a histogram that is both statistically useful and easy to read. However, the default choices can lead to a less-than-optimal visualization if the data have special properties, such as being time intervals or being bounded. In those cases, it makes sense to choose a bin width and a location of the first bin such that reveals your data's special properties. For the UNIVARIATE procedure, use the MIDPOINTS= or ENDPOINTS= options on the HISTOGRAM statement. For the SGPLOT procedure, use the BINWIDTH= and BINSTART= options to create a histogram with custom bins.

Post a Comment

Analyzing activity-tracker data: How many steps per day do YOU take?

My wife got one of those electronic activity trackers a few months ago and has been diligently walking every day since then. At the end of the day she sometimes reads off how many steps she walked, as measured by her activity tracker. I am always impressed at how many steps she racks up during the day through a combination of regular activity and scheduled walks.

There has been a lot written about the accuracy of electronic activity trackers. Personally, I don't worry about accuracy as long as the errors are in the 10% range. The purpose of the tools are to give people an idea of how much they are moving and to encourage them to get off the couch. Whether someone walked 4.2 miles or 3.8 isn't as important as the fact that she walked about 4 miles.

Because my wife's daily numbers seem so impressive, I decided to download some data from other people who use the same device. The device can be linked to a person's Twitter account and programmed to tweet a summary of the person's activity level each day. The Tweets all have a common hashtag and format, so it is easy to download a few hundred tweets and prepare them for data analysis. In an act of statistical voyeurism, I present a descriptive analysis of the activity level of 231 random people who use a particular brand of activity tracker, as reported by their tracker for Sunday, August 17, 2014. You can download the SAS program that analyzes these data.

Distribution of steps

trackerhist The trackers records steps. The histogram at the left shows the distribution of the number of steps taken for the 231 subjects. (Click to enlarge.) A kernel density estimate is overlaid, and various quantiles are displayed in the inset. The histogram shows that about 25% of the people in the sample walked fewer than 4,000 steps, and about half of the people walked fewer than 7,100 steps. About a quarter of the people walked more than 11,600 steps, and the I-am-very-active award goes to those people who walked more than 18,000 steps—they are the upper 95th percentile of this sample.

The tail of the distribution falls off rapidly, which means that there is a low probability of finding someone who walks more than 30,000 steps per day. In other words, this is not a fat-tailed distribution. On the contrary, a person in the upper percentiles is working her tail off!

How many steps does it take to walk a mile?

trackerfit

The tracker also reports distance in miles. The basic device uses an algorithm to convert those steps into an approximate distances so, as they say, your distance may vary (nyuck-nyuck!). The device does not know your exact stride length.

The scatter plot to the left shows the relationship between distance and steps. For people who take many steps, there is substantial variation in the reported distance. Probably some of those people were running or jogging, which changes the length of the stride. The line indicates predicted distance for a given number of steps, as calculated by a robust regression algorithm.

These data can help to answer the question, "How many steps does it take to walk a mile?" Based on these data, it takes an average person 2,272 steps to walk a mile. Of course, shorter people will require more steps and taller people fewer, and there is the whole debate about how accurate these trackers are at estimating distance. Nevertheless, 2,272 steps is a good estimate for a mile. For a simpler number, you can estimate that 10,000 steps is about four miles.

These data also enable you to estimate the length of the average stride, which is 2.32 feet, or about 71 centimeters per step.

What do you think of these data? If you use a fitness tracking device, how many steps do you take each day? Leave a comment.

Post a Comment

Creating heat maps in SAS/IML

In a previous blog post, I showed how to use the graph template language (GTL) in SAS to create heat maps with a continuous color ramp. SAS/IML 13.1 includes the HEATMAPCONT subroutine, which makes it easy to create heat maps with continuous color ramps from SAS/IML matrices. Typical usage includes creating heat maps of data matrices, correlation matrices, or covariance matrices. A more advanced usage is using matrices to visualize the results of computational algorithms or computer experiments.

Data matrices

Heat maps provide a convenient way to visualize many variables and/or many variables in a single glance. For example, suppose that you want to compare and contrast the 24 trucks in the Sashelp.Cars data by using the 10 numerical variables in the data set. The following SAS/IML statements read the data into a matrix and sort the matrix according to the value of the Invoice variable (that is, by price). The Model variable, which identifies each truck, is sorted by using the same criterion:

proc iml;
use Sashelp.Cars where(type="Truck");
read all var _NUM_ into Trucks[c=varNames r=Model];
close Sashelp.Cars;
 
call sortndx(idx, Trucks[,"Invoice"]);     /* find indices that sort the data */
Trucks = Trucks[idx,]; Model = Model[idx]; /* sort the data and the models    */

The original variables are all measured on different scales. For example, the Invoice variable has a mean value of $23,000, whereas the EngineSize variable has a mean value of 4 liters. Most of the values in the matrix are less than 300. Consequently, a heat map of the raw values would not be illuminating: The price variables would be displayed as dark cells (high values) and the rest of the matrix would be almost uniformly light (small values, relative to the prices). Fortunately, the HEATMAPCONT subroutine supports the SCALE="Col" option to standardize each variable to have mean 0 and unit variance. With that option, a heat map reveals the high, middle, and low values of each variable:

call HeatmapCont(Trucks) scale="Col"; /* standardize cols */
heatmaptrucks1

The HEATMAPCONT output shows the following default values:

  • The default color ramp is the 'TwoColor' color ramp. With my ODS style, white is used to display small values and dark blue is used to display large values. The color ramp indicates that the standardized values are approximately in the range [-2, 3], which probably indicates that several variables have positive skewness.
  • The axes are labeled by using the row numbers of the 24 x 10 matrix.
  • The Y axes "points down." In other words, the heat map displays the matrix as it would be printed: the top of the heat map displays the first few rows and the bottom of the heat map displays the last few rows.

A problem with this initial heat map is that we don't know which trucks correspond to the rows, nor do we know which variables corresponds to the columns. You can use the XVALUES= and YVALUES= options to add that information to the heat map. Also, let's use a three-color color ramp and center the range of the color ramp so that white represents the mean value for each variable and red (blue) represents positive (negative) deviations from the mean. The resulting image is shown below (click to enlarge):

call HeatmapCont(Trucks) scale="Col"           /* standardize cols    */
     xvalues=varNames yvalues=Model            /* label rows and cols */
     colorramp="ThreeColor" range={-3 3}       /* color range [-3,3]  */
     legendtitle = "Standardized values" title="Truck Data";
heatmaptrucks2

This visualization of the data enables us to draw several conclusions about the data. In general, expensive cars are powerful (large values of EngineSize/, Cylinders, and Horsepower), large in size (large values of Weight, Wheelbase, and Length), and get poor fuel economy (small values of MPG_City and MPG_Highway). However, two vehicles seem unusual. Compared with similarly priced trucks, the Baja is smaller and more fuel efficient. Compared with similarly priced trucks, the Tundra is more powerful and larger in size.

Correlation matrices

Because the data matrix is sorted by price, it looks like price variables are positively correlated with all variables except for the fuel economy variables. Let's see if that is the case by computing the correlation matrix and displaying it as a heat map:

corr = corr(Trucks);
call HeatmapCont(corr)
     xvalues=varNames yvalues=varNames
     colorramp="ThreeColor" range={-1 1}
     title="Correlations in Truck Data";
heatmaptrucks3

The correlation matrix confirms our initial impression. The price variables are strongly correlated with the variables that indicate the power and size of the trucks. The price is negatively correlated with the fuel efficiency variables.

One of the advantages of computing correlation analysis in the SAS/IML language is that you can easily change the order of the variables. For example, the following statements move the fuel efficiency variables to the end, so that all of the positively correlated variables are grouped together:

newOrder = (1:5) || (8:10) || (6:7);
corr = corr[newOrder, newOrder];
varNames = varNames[newOrder];

Can you think of other applications of visualizing matrices by using the HEATMAPCONT subroutine? Leave a comment.

Post a Comment

Creating a basic heat map in SAS

heatmapcubic

Heat maps have many uses. In a previous article, I showed how to use heat maps with a discrete color ramp to visualize matrices that have a small number of unique values, such as certain covariance matrices and sparse matrices. You can also use heat maps with a continuous color ramp to visualize correlation matrices or data matrices.

This article shows how to use a heat map with a continuous color ramp to visualize a function of two variables. A heat map of a simple cubic function is shown to the left. In statistical applications the function is often a quantity—such as the mean or variance—that varies as a function of two parameters. Heat maps are like contour plots without the contours; you can use a heat map to display smooth or nonsmooth quantities.

You can create a heat map by using the HEATMAPPARM statement in the graph template language (GTL). The following template defines a simple heat map:

proc template;              /* basic heat map with continuous color ramp */
define statgraph heatmap;
dynamic _X _Y _Z _T;        /* dynamic variables */
 begingraph;
 entrytitle _T;             /* specify title at run time (optional) */
  layout overlay;
    heatmapparm x=_X y=_Y colorresponse=_Z /  /* specify variables at run time */
       name="heatmap" primary=true
       xbinaxis=false ybinaxis=false;  /* tick marks based on range of X and Y */
    continuouslegend "heatmap";
  endlayout;
endgraph;
end;
run;

The template is rendered into a graphic when you call the SGRENDER procedure and provide a source of values (the data). The data should contain three variables: two coordinate variables that define a two-dimensional grid and one response variable that provides a value at each grid point. I have used dynamic variables in the template to specify the names of variables and the title of the graph. Dynamic variables are specified at run time by using PROC SGRENDER. If you use dynamic variables, you can use the template on many data sets.

The following DATA step creates X and Y variables on a uniform two-dimensional grid. The value of the cubic function at each grid point is stored in the Z variable. These data are visualized by calling the SGRENDER procedure and specifying the variable names by using the DYNAMIC statement. The image is shown at the top of this article.

/* sample data: a cubic function of two variables */
%let Step = 0.04;
data A;
do x = -1 to 1 by &Step;
   do y = -1 to 1 by &Step;
      z = x**3 - y**2 - x + 0.5;
      output;
   end;
end;
run;
 
proc sgrender data=A template=Heatmap; 
   dynamic _X='X' _Y='Y' _Z='Z' _T="Basic Heat Map";
run;

The GTL template uses default values for most options. The default color ramp is a continuous three-color ramp. For the ODS style I am using, the default ramp uses blue to display low values, white to display middle values, and red to display high values. The XBINAXIS=false and YBINAXIS=false options override the default placement of tick marks on the axes. By default, the tick marks are be placed at data values of the X and Y variable, which are multiple of 0.04 for this data. The following image shows part of the heat map that results if I omit the XBINAXIS= and YBINAXIS= options. I prefer my tick labels to depend on the data range, rather than the particular value of &Step that I used to generate the data.

heatmapcubic2

The HEATMAPPARM statement has many options that you can use to control features of the heat map. Because the heat map is an important tool in visualizing matrices, SAS/IML 13.1 supports functions for creating heat maps directly from matrices. I will discuss one of these functions in my next blog post.

Post a Comment

Guiding numerical integration: The PEAK= option in the SAS/IML QUAD subroutine

One of the things I enjoy about blogging is that I often learn something new. Last week I wrote about how to optimize a function that is defined in terms of an integral. While developing the program in the article, I made some mistakes that generated SAS/IML error messages. By understanding my errors, I learned something new about the default options for the QUAD subroutine in SAS/IML software.

The QUAD subroutine supports the PEAK= option, but until last week I had never used the option. (Obviously the default value works well most of the time, since I have been using the QUAD routine successfully for many years!) I discovered that the PEAK= option can be important to ensuring that a numerical integration problem converges to an accurate answer.

The default value of the PEAK= option is 1. But what does that mean? Here's what I gleaned from the documentation:

  • The PEAK= option is used to scale the integrand and to determine whether the adaptive integration algorithm needs to perform additional iterations.
  • The option specifies a value of x where the magnitude of the integrand f(x) is relatively large. For integrations on infinite and semi-infinite intervals, the value of the PEAK= option specifies where the function has a lot of "mass." For example, the standard normal probability density function is centered at x=0, so you can specify PEAK=0 to integrate that function on (–∞, ∞). This example illustrates why the option is named "PEAK."
  • For best results, choose the PEAK= option to be in the interior of the interval of integration. Don't choose a value that is too close to the boundary. For example, on a finite domain [a, b], the midpoint of the interval is often a good choice: PEAK=(a + b)/2.
  • The PEAK= value should not be a location where the integrand is zero. That is, if PEAK=v, then make sure that f(v) ≠ 0.
  • In spite of its name, the value of the PEAK= option does not need to be the location of the maximum value of the integrand. For integration on a finite domain you can specify almost any value in the interval, subject to the previous rules.

The remainder of this article uses the quadratic function f(x) = 1 – x2 to illustrate how you can to control the QUAD subroutine by using the PEAK= option.

Avoid zeros of the integrand

The following SAS/IML statements define the function to integrate. The integrand has a zero at the value x = 1, which is the default value for the PEAK= option. If you attempt to integrate this function on the interval [0, 2], you see the following scary-looking error message:

proc iml;
start Func(x) global(a);
   return( 1 - x##2 );         /* note that Func(1) = 0 */
finish;
 
call quad(w, "Func", {0 2});  /* default: PEAK=1 */
Convergence could not be attained over the subinterval (0 , 2 )
 
The function might have one of the following:
     1) A slowly convergent or a divergent integral.
     2) Strong oscillations.
     3) A small scale in the integrand: in this case you can change the 
        independent variable in the integrand, or
        provide the optional vector describing roughly the mean and 
        the standard deviation of the integrand.
     4) Points of discontinuities in the interior:
        in this case, you can provide a vector containing
        the points of discontinuity and try again.

The message says that the integration did not succeed, and it identifies possible problems and offers some suggestions. For this example, the integral is not divergent, oscillatory, or discontinuous, so we should look at the third item in the list. The message seems to assume that the integrand is a statistical probability distribution, which is not true for our example. For general functions, the advice could be rewritten as "... provide values for the PEAK= and SCALE= options. The SCALE= value should provide a horizontal estimate of scale; the integrand evaluated at the PEAK= value should provide a vertical estimate of scale."

For the example, –3 ≤ f(x) ≤ 1 when x is in the interval [0, 2], so the integrand has unit scale in the vertical direction. You can choose almost any value of x for the PEAK= option, but you should avoid values where the integrand is very small. The following statement uses PEAK=0.5, but you could also use PEAK= 1.5 or PEAK=0.1234.

call quad(w, "Func", {0 2}) peak=0.5;  
print w;
t_quadpeak

In a similar way, you should make sure that the integrand is defined at the PEAK= value. Avoid singular points and points of discontinuity.

Avoid the boundaries of the interval

Here's another helpful tip: choose the PEAK= option so that it is sufficiently inside the interval of integration. Avoid choosing a value that is too close to the lower or upper limit of integration. The following statement chooses a value for the PEAK= option that is just barely inside the interval [0, 2]. A warning message results:

val = 2 - 1e-7;     /* value is barely inside interval [0,2] */
call quad(w, "Func", {0 2}) peak=val;
Due to the roundoff encountered in computing FUNC
      near the point     2.00000E+00
      convergence can be attained, but only
      with an estimated error     0.00000E+00

The warning message says that a problem occurs near the point x=2. Although the routine will continue and "convergence can be attained," the routine is letting you know that it cannot perform the integration as well as it would like to. If you move the PEAK= value farther away from the limits of integration, the warning goes away and the subroutine can integrate the function more efficiently.

So there you have it. Although the default value of the PEAK= option works well for most integrals, sometimes the programmer needs to provide the QUAD routine with additional knowledge of the integrand. When you provide the PEAK= option, be sure to avoid zeros and discontinuities of the integrand, and stay away from the boundaries of the integration interval.

Post a Comment