Lexicographic combinations in SAS

t_lexico1

In a previous blog post, I described how to generate combinations in SAS by using the ALLCOMB function in SAS/IML software. The ALLCOMB function in Base SAS is the equivalent function for DATA step programmers.

Recall that a combination is a unique arrangement of k elements chosen from a set of n elements. As such, the combination can always be written in a standard order where the elements increase from left to right and no element is larger than n. The output for the ALLCOMB function is shown to the left for n = 5 and k = 3.

As you scan down the rows, you might ask yourself, "In what order are these combinations generated?" The rows are not in the order that I would generate with pencil and paper. I would start with combination {1 2 3}, then proceed to increment the rightmost digit to obtain {1 2 4} and {1 2 5}. From there, I'd move over to the second-to-last column, increment that digit, reset the last column, and generate {1 3 4} and {1 3 5}. I would iteratively continue that process until all combinations were generated.

The pencil-and-paper method generates all of the combinations that begin with item 1, then generates the combinations that begin with item 2, and ends with the combination {3 4 5}. A list of combinations in "dictionary order," is called the lexicographic ordering. Clearly, the ALLCOMB function does not generate combinations in lexicographic order. Instead, the function generates the combinations in "minimal change order." Each combination is formed from the previous combination by removing one item and inserting another. Consequently, each row has k – 1 items in common with the preceding row.

SAS also provides a way to generate the combinations in lexicographic order. The following DATA step shows how to use the COMB function and LEXCOMBI function to generate the lexicographic ordering:

%let n=5;              /* total number of items */          
%let k=3;              /* number of items per row */
/* generate combinations of n items taken k at a time */
data Comb(keep=c1-c&k);
array c[&k] (&k.*0);   /* initialize array to 0 */
ncomb = comb(&n, &k);  /* number of combinations */
do j = 1 to ncomb;
   rc = lexcombi(&n, &k, of c[*]);
   output;
end;
run;
t_lexico2

The output is shown in the table on the right. Although it is not included in the output, the LEXCOMBI function returns the index of the column that was changed, which can be useful information for some applications.

One of the great things about SAS software is that there are multiple ways to accomplish most tasks. In addition to the preceding DATA step, you could also generate combinations in lexicographic order by doing any of the following:

Programming the lexicographic combination algorithm

Most SAS programmers enjoy calling pre-written functions or procedures in SAS. However, SAS software provides the DATA step and the SAS/IML language for those occasions when you want to program an algorithm yourself.

This fact is especially pertinent for students and adult learners. With the launch of the free SAS University Edition, the next generation of SAS programmers has free access to the DATA step and the SAS/IML language. With a working knowledge of SAS programming, you can prepare data for analysis and implement new analyses that are not built into SAS.

The algorithm to "generate all combinations of n items taken k at a time," is a great programming exercise. Programming the algorithm without calling the LEXCOMBI function enables a programmer to gain experience processing DATA step arrays with k elements. It requires some subtle programming techniques, such as looping forward and backwards through indices of an array.

Would you like to try your hand at writing such a DATA step program? If so, stop reading now. Spoiler ahead!


The following DATA step was sent to me by a senior executive at SAS who still enjoys finding time to program. I found it easy to mentally trace through the program's logic for n = 5 and k = 3 to see how the program works. It keeps track of the number of combinations that it generates and prints the value of "n choose k" when the program terminates.

/* Jim Goodnight's program to generate combinations 
   (in lexicographic order) of n items taken k at a time */
data Comb(keep=c1-c&k);
array c[&k];                /* array to hold combinations      */
n=&n; k=&k;
do i=1 to k; c[i]=i; end;   /* initialize array to 1--k        */
output;
 
count=1;                    /* enumerate combinations          */
do while(count);            /* find next combination           */
   if c[k]<n then c[k]+1;   /* increment value in last column  */
   else do; 
      n1=n; found=0;        /* search earlier in array         */ 
      do i=k-1 to 1 by -1 while(found=0); 
         n1=n1-1; 
         if c[i]<n1 then found=i; /* found column to increment */
      end;
      if found=0 then goto term;    /* none found ... finished */
      c[found]+1;           /* increment value in found column */
      do j=found+1 to k; 
         c[j]=c[j-1]+1;     /* reset values of subsequent cols */
      end; 
   end; 
   output;
   count+1;
end;
term: put count=;           /* at end, print count = comb(n,k) */
run;

Can you modify the program to generate permutations in lexicographic order (see the LEXPERM function)? How would you vectorize this algorithm for the SAS/IML language? Do SAS programmers at your company implement algorithms like this, or do they mostly use the DATA step to extract, merge, transform, and prepare data for analysis? Leave a comment.

Post a Comment

Computing prediction ellipses from a covariance matrix

In a previous blog post, I showed how to overlay a prediction ellipse on a scatter plot in SAS by using the ELLIPSE statement in PROC SGPLOT. The ELLIPSE statement draws the ellipse by using a standard technique that assumes the sample is bivariate normal. Today's article describes the technique and shows how to use SAS/IML to compute a prediction ellipse from a 2 x 2 covariance matrix and a mean vector. This can be useful for plotting ellipses for subgroups, ellipses that correspond to robust covariance estimates, or an ellipse for a population (rather than for a sample).

SAS/IML routines for computing prediction ellipses have been around for a long time. A module called the CONTOUR module was in the version 6 (1989) documentation for SAS/IML. In version 6.12, the module was used to compare prediction ellipses for robust and classical covariance matrices. A module appears in Michael Friendly's 1991 book The SAS System for Statistical Graphics, and in several of his papers, including a 1990 SUGI article. Friendly continues to distribute the %ELLIPSES macro for displaying ellipses on scatter plots. The SAS/IML function in this article is similar to these earlier modules.

Prediction ellipses: The main ideas

You can compute a prediction ellipse for sample data if you provide the following information:

  • m: A vector for the center of the ellipse.
  • S: A covariance matrix. This can be a classical covariance matrix or a robust covariance matrix.
  • n: The number of nonmissing observations in the sample.
  • p: The confidence level for the prediction ellipse. Equivalently, you could specify a significance level, α, which corresponds to a 1 – α confidence level.

The implicit formula for the prediction ellipse is given in the documentation for the CORR procedure as the set of points that satisfy a quadratic equation. However, to draw the ellipse, you should parameterize the ellipse explicitly. For example, when the axes of the ellipse are aligned with the coordinate axes, the equation of an ellipse with center (c,d) and with radii a and b is defined implicitly as the set of points (x,y) that satisfies the equation
      (x-c)2 / a2 + (y-d)2 / b2 = 1.
However, if you want to draw the ellipse, the parametric form is more useful:
      x(t) = c + a cos(t)
      y(t) = d + b sin(t)
for t in the interval [0, 2π].

An algorithm to draw a prediction ellipse

I can think of two ways to draw prediction ellipses. One way is to use the geometry of Mahalanobis distance. The second way, which is used by the classical SAS/IML functions, is to use ideas from principal components analysis to plot the ellipse based on the eigendecomposition of the covariance matrix:

  1. Find the eigenvalues (λ1 and λ2) and eigenvectors (e1 and e2) of the covariance matrix, S. The eigenvectors form an orthogonal basis for the coordinate system for plotting the ellipse. The first eigenvector (e1) points in the direction of the greatest variance and defines the major axis for the prediction ellipse. The second eigenvector (e2) points in the direction of the minor axis.
  2. As mentioned previously, sines and cosines parameterize an ellipse whose axes are aligned with the standard coordinate directions. It is just as simple to parameterize an ellipse in the coordinates defined by the eigenvectors:
    • The eigenvectors have unit length, so a circle is formed by the linear combination cos(t)*e1 + sin(t)*e2 for t in the interval [0, 2π].
    • The ellipse sqrt(λ1)cos(t)*e1 + sqrt(λ2)sin(t)*e2 is a "standardized ellipse" in the sense that the standard deviations of the variables are used to scale the semi-major axes.
  3. To get a prediction ellipse, scale the standardized ellipse by a factor that depends on quantiles of the F2,n-2 distribution, the confidence level, and an adjustment factor that depends on the sample size n. The SAS/IML module contains the details.
  4. Translate the prediction ellipse by adding the vector m.

A SAS/IML module for computing a prediction ellipse

The following module accepts a vector of k confidence levels. The module returns a matrix with three columns. The meaning of each column is described in the comments.

proc iml;
start PredEllipseFromCov(m, S, n, ConfLevel=0.95, nPts=128);
/* Compute prediction ellipses centered at m from the 2x2 covariance matrix S.
   The return matrix is a matrix with three columns.
   Col 1: The X coordinate of the ellipse for the confidence level.
   Col 2: The Y coordinate of the ellipse for the confidence level.
   Col 3: The confidence level. Use this column to extract a single ellipse.
 
   Input parameters:
   m           a 1x2 vector that specifies the center of the ellipse. 
               This can be the sample mean or median.
   S           a 2x2 symmetric positive definite matrix. This can be the 
               sample covariance matrix or a robust estimate of the covariance.
   n           The number of nonmissing observations in the data. 
   ConfLevel   a 1 x k vector of (1-alpha) confidence levels that determine the
               ellipses. Each entry is 0 < ConfLevel[i] < 1.  For example,
               0.95 produces the 95% confidence ellipse for alpha=0.05.
   nPts       the number of points used to draw the ellipse. The default is 0.95.
*/
 
   /* parameterize standard ellipse in coords of eigenvectors */
   call eigen(lambda, evec, S);   /* eigenvectors are columns of evec */
   t = 2*constant("Pi") * (0:nPts-1) / nPts;
   xy  = sqrt(lambda[1])*cos(t) // sqrt(lambda[2])*sin(t);
   stdEllipse = T( evec * xy );   /* transpose for convenience */
 
   /* scale the ellipse; see documentation for PROC CORR */
   c = 2 * (n-1)/n * (n+1)/(n-2);          /* adjust for finite sample size */
   p = rowvec(ConfLevel);                  /* row vector of confidence levels */
   F = sqrt(c * quantile("F", p, 2, n-2)); /* p = 1-alpha */
 
   ellipse = j(ncol(p)*nPts, 3);  /* 3 cols: x y p */
   startIdx = 1;                  /* starting index for next ellipse */
   do i = 1 to ncol(p);           /* scale and translate */
      idx = startIdx:startIdx+nPts-1;
      ellipse[idx, 1:2] = F[i] # stdEllipse + m; 
      ellipse[idx, 3] = p[i];     /* use confidence level as ID */
      startIdx = startIdx + nPts;
   end;           
   return( ellipse );
finish;

Compare a classical and robust prediction ellipse

The SAS/IML documentation includes an example in which a classical prediction ellipse is compared with a robust prediction ellipse for three-dimensional data that contain outliers. The following SAS/IML statements define the classical and robust estimates of location and scatter for two of the variables. The PredEllipseFromCov function is called twice: once for the classical estimates, which are based on all 21 observations, and once for the robust estimates, which are based on 17 observations:

/* Stackloss data example from SAS/IML doc */
/* classical mean and covariance for stackloss data */
/*      Rate      AcidConcentration */
N = 21;
mean = {60.428571 86.285714}; 
cov  = {84.057143 24.571429,
        24.571429 28.714286};
classicalEllipse = PredEllipseFromCov(mean, cov, N, 0.9);
 
/* robust estimates of location and scatter for stackloss data */
/*         Rate      AcidConcentration */
N = 17;
robMean = {56.7      85.5}; 
robCov  = {23.5     16.1,
           16.1      32.4};
robEllipse = PredEllipseFromCov(robMean, robCov, N, 0.9);
 
/* write the ellipse data to a SAS data set */
E = classicalEllipse || robEllipse[,1:2];
create Ellipse from E[c={"classX" "classY" "ConfLevel" "robX" "robY"}];
append from E;
close Ellipse;
quit;

The following SAS statements merge the data and the coordinates for the prediction ellipses. The POLYGON statement in the SGPLOT procedure is used to overlay the ellipses on a scatter plot of the data. The POLYGON statement is available in SAS 9.4M1. Notice that the PredEllipseFromCov function returns a matrix with three columns. The third column (the confidence level) is used as the ID= variable for the POLYGON statement:

data Stackloss;
input Rate Temperature AcidCon @@;
datalines;
80  27  89  80  27  88  75  25  90  62  24  87  62  22  87
62  23  87  62  24  93  62  24  93  58  23  87  58  18  80
58  18  89  58  17  88  58  18  82  58  19  93  50  18  89
50  18  86  50  19  72  50  19  79  50  20  80  56  20  82
70  20  91
;
 
data All;
set Stackloss Ellipse;
run;
 
title "Classical and Robust Prediction Ellipses";
proc sgplot data=All;
scatter x=Rate y=AcidCon / transparency=0.5 markerattrs=(symbol=CircleFilled size=12);
polygon x=classX y=classY id=ConfLevel / 
        lineattrs=GraphData1 name="classical" legendlabel="Classical 90% Ellipse";
polygon x=robX y=robY id=ConfLevel / 
        lineattrs=GraphData2 name="robust" legendlabel="Robust 90% Ellipse";
xaxis grid; yaxis grid; 
keylegend "classical" "robust";
run;
predellipse3

The classical prediction ellipse is based on all 21 observations. The robust estimation method classified four observations as outliers, so the robust ellipse is based on 17 observations. The four outliers are the markers that are outside of the robust ellipse.

Plotting prediction ellipses for subgroups

You can also use this module to overlay prediction ellipses for subgroups of the data. For example, you can compute the mean and covariance matrix for each of the three species of flower in the sashelp.iris data. You can download the complete program that computes the prediction ellipses and overlays them on a scatter plot of the data. The following graph shows the result:

predellipse4

In summary, by using the SAS/IML language, you can write a short function that computes prediction ellipses from four quantities: a center, a covariance matrix, the sample size, and the confidence level. You can use the function to compute prediction ellipses for classical estimates, robust estimates, and subgroups of the data. You can use the POLYGON statement in PROC SGPLOT to overlay these ellipses on a scatter plot of the data.

Post a Comment

Add a prediction ellipse to a scatter plot in SAS

It is common in statistical graphics to overlay a prediction ellipse on a scatter plot. This article describes two easy ways to overlay prediction ellipses on a scatter plot by using SAS software. It also describes how to overlay multiple prediction ellipses for subpopulations.

What is a prediction ellipse?

A prediction ellipse is a region for predicting the location of a new observation under the assumption that the population is bivariate normal. For example, an 80% prediction ellipse indicates a region that would contain about 80% of a new sample that is drawn from a bivariate normal population with mean and covariance matrices that are equal to the sample estimates.

A prediction ellipse is helpful for detecting deviation from normality. If you overlay a prediction ellipse on your sample and the sample does not "fill" the ellipse in the way that bivariate normal data would, then you have a graphical indication that the data are not bivariate normal.

Because the center of the ellipse is the sample mean, a prediction ellipse can give a visual indication of skewness and outliers in the data. A prediction ellipse also displays linear correlation: If you standardize both variables, a skinny ellipse indicates highly correlated variables, whereas an ellipse that is nearly circular indicates little correlation.

How to draw a prediction ellipse in SAS

SAS provides two easy ways to overlay a prediction ellipse on a scatter plot. The SGPLOT procedure supports the SCATTER statement, which creates a scatter plot, and the ELLIPSE statement, which overlays a bivariate normal prediction ellipse, computed by using the sample mean and covariance. The following statements overlay a 95% prediction ellipse on 50 measurements of the width and length of petals for a particular species of flower:

title "95% Prediction Ellipse";
title2 "iris Versicolor";
proc sgplot data=sashelp.iris noautolegend;
   where Species='Versicolor';
   scatter x=PetalLength y=PetalWidth / jitter;  /* JITTER is optional    */
   ellipse x=PetalLength y=PetalWidth;           /* default is ALPHA=0.05 */
run;
predellipse1

The JITTER option (which requires SAS 9.4) is used to slightly displace some of the observations. Without the option, some markers overlap because the data are rounded to the nearest millimeter. By default, the ELLIPSE statement computes and displays a 95% prediction ellipse, which tends to surround most of the data. However, you can use the ALPHA= option to display a 100(1-α)% prediction ellipse. Some researchers recommend overlaying a 68% prediction ellipse (ALPHA=0.32) because 68% is the probability of observing univariate normal data that is within one standard deviation of the mean. See the article by Michael Friendly for examples and a discussion.

Because a prediction ellipse gives a graphical indication of the linear correlation between variables, you can create a prediction ellipse as part of a correlation analysis in SAS. The following call to the CORR procedure uses the PLOTS=SCATTER option to overlay a 95% prediction ellipse. The output (not shown) is similar to the previous graph.

proc corr data=sashelp.iris plots=scatter(ellipse=prediction);
   where Species='Versicolor';
   var PetalLength PetalWidth;
run;

Draw prediction ellipses for each group

Occasionally, you might want to overlay prediction ellipses for subsets of the data that correspond to subpopulations. For example, if your data contain both male and female subjects, it might be that the means and covariance for the variables are different between males and females. In that case, overlaying a prediction ellipse for each subpopulation helps you to visualize how characteristics vary between the groups.

There are several ways to draw prediction ellipses for groups within the data. Michael Friendly provides a macro that uses SAS/GRAPH to overlay ellipses. It supports a grouping variable, as shown in his 2006 paper in J. Stat. Soft. (This macro has been used in some examples by Robert Allison.)

If you want to use ODS statistical graphics to display the multiple ellipses, you need to use a little trick. Because the ELLIPSE statement in PROC SGPLOT does not support a GROUP= option as of SAS 9.4m2, you have to reshape the data so that each group becomes a new variable. This is equivalent to transposing the data from a "long form" to a "wide form." From my previous blog post, here is one way to create six variables that represent the petal length and width variables for each of the three species of iris in the sashelp.iris data set:

data Wide;
/*   |-- PetalLength --| |--- PetalWidth ---|  */
keep L_Set L_Vers L_Virg W_Set W_Vers W_Virg;  /* names of new variables */
merge sashelp.iris(where=(Species="Setosa") 
              rename=(PetalLength=L_Set PetalWidth=W_Set))
      sashelp.iris(where=(Species="Versicolor") 
              rename=(PetalLength=L_Vers PetalWidth=W_Vers))
      sashelp.iris(where=(Species="Virginica") 
              rename=(PetalLength=L_Virg PetalWidth=W_Virg));
run;

Now that the data are in separate variables, call PROC SGPLOT to overlay three scatter plots and three 95% prediction ellipses:

title "95% Prediction Ellipses for Each Group";
proc sgplot data=Wide;
  scatter x=L_Set  y=W_Set  / jitter name="Set"  legendlabel="Setosa";
  scatter x=L_Virg y=W_Virg / jitter name="Virg" legendlabel="Virginica";
  scatter x=L_Vers y=W_Vers / jitter name="Vers" legendlabel="Versicolor";
  ellipse x=L_Set  y=W_Set  / lineattrs=GraphData1;
  ellipse x=L_Virg y=W_Virg / lineattrs=GraphData2;
  ellipse x=L_Vers y=W_Vers / lineattrs=GraphData3;
  keylegend "Set" "Virg" "Vers" / title="Species:";
  xaxis label="Petal Length (mm)";
  yaxis label="Petal Width (mm)";
run;
predellipse2

Notice that I used a trick to make the color of each prediction ellipse match the color of the data to which it is associated. The colors for markers in the scatter plots are assigned automatically according to the current ODS style. The style elements used for the markers are named GraphData1, GraphData2, and GraphData3. When I create the ellipses, I use the LINEATTRS= statement to set the style attributes of the ellipse to match the corresponding attributes of the data.

The graph visualizes the relationship between the petal length and the petal width variables in these three species. At a glance, three facts are evident:

  • The means of the variables (the centers of the ellipses) are different across the groups. This is a "graphical MANOVA test."
  • The Versicolor ellipse is "thinner" than the others, which indicates that the correlation between petal length and width is greater in that species.
  • The Virginica ellipse is larger than the others, which indicates greater variance within that species.

In summary, it is easy to use the ELLIPSE statement in PROC SGPLOT to add a prediction ellipse to a scatter plot. If you want to add an ellipse for subgroups of the data, use the trick from my previous article to reshape the data. Plotting ellipses for subgroups enables you to visually inspect covariance and correlation within groups and to compare those quantities across groups.

Post a Comment

How to create and detect an empty matrix

An empty matrix is a matrix that has zero rows and zero columns. At first "empty matrix" sounds like an oxymoron, but when programming in a matrix language such as SAS/IML, empty matrices arise surprisingly often.

Sometimes empty matrices occur because of a typographical error in your program. If you try to print or compute with a matrix that has not been defined, the SAS/IML program will usually display an error message:

proc iml;
print UndefinedMatrix;
 
            ERROR: (execution) Matrix has not been set to a value.

However, sometimes an empty matrix is the correct result of a valid computation. For example:

  • If you use the LOC function to find elements in a vector that satisfy a criterion, the empty matrix results when no elements satisfy the criterion:
    x = {2 4 6};
    y = loc(x < 0);         /* y is empty matrix */
  • If you use the XSECT function to find the intersection of two sets, the empty matrix results when the intersection is empty:
    y = xsect({1}, {2});    /* empty intersection of sets */
  • If you use the REMOVE function to remove elements of a matrix, the empty matrix results when you remove all elements:
    y = remove({2}, 1);     /* remove 1st element from 1x1 matrix */

In my book Statistical Programming with SAS/IML Software, I mention that you can determine whether a matrix is empty by using one of three functions: TYPE, NROW, or NCOL, as follows:

print (type(y))[L="type"] (nrow(y))[L="nrow"] (ncol(y))[L="ncol"];
 
if ncol(y)=0 then print "y is empty";
else              print "y is not empty";
t_emptymatrix

The output shows that the "type" of an empty matrix is 'U' (for "undefined"). In my book I use the NCOL function to test for empty matrices, but the other functions work just as well.

New function for detecting empty matrices

Recent releases of SAS/IML software support new ways to create and detect empty matrices:

  • In SAS/IML 12.1, you can use curly braces to define an empty matrix:
    y = {};                 /* y is an empty matrix */
  • In SAS/IML 12.3, you can use the ISEMPTY function to test whether a matrix is empty:
    if IsEmpty(y) then print "y is empty";  
    else               print "y is not empty";

I think that the new syntax produces SAS/IML programs that are easier to read and understand. Have you encountered empty matrices in your SAS/IML programs? Where? How did you handle them? Leave a comment.

Post a Comment

The SAS/IML File Exchange is open

iml-exchange

Have you written a SAS/IML program that you think is particularly clever? Are you the proud author of SAS/IML functions that extend the functionality of SAS software?

You've worked hard to develop, debug, and test your program, so why not share it with others? There is now a central location for SAS/IML programmers to upload data, programs, and documentation. It is called the SAS/IML File Exchange. The File Exchange is a Web site where authors can upload programs that they have written and where SAS/IML programmers can search for useful programs.

SAS programmers have a long history of sharing programs:

The File Exchange is a subcommunity of the popular SAS/IML Support Community, where programmers discuss algorithms, ask questions, and get help writing, debugging, and improving their programs. In 2013, there were more than 500 posts to the SAS/IML Support Community.

The new SAS/IML File Exchange provides a much-needed repository for ready-to-use programs so that programmers can spend less time reinventing the wheel and more time writing amazing programs. To participate in the File Exchange, do the following:

  • Search your files for that awesome SAS/IML program that you wrote last year. Add comments to the program so that others can appreciate how clever you are! :-)
  • Read the article "How to Contribute to the SAS/IML File Exchange," which describes how to upload files to the File Exchange.
  • If you do not yet have a SAS profile, go to the SAS Support Communities and establish a profile. If you already have a SAS profile, log in by using your email address.
  • Go to the SAS/IML File Exchange and upload your files individually or as a ZIP file.

The community of SAS/IML programmers has been growing rapidly. Because the free SAS University Edition includes the SAS/IML product, I expect the community to grow even faster in the future. I hope that programmers of all skill levels will use the SAS/IML File Exchange to post programs that everyone can use. Why not post your program today?

Post a Comment

A log transformation of positive and negative values

In my four years of blogging, the post that has generated the most comments is "How to handle negative values in log transformations." Many people have written to describe data that contain negative values and to ask for advice about how to log-transform the data.

Today I describe a transformation that is useful when a variable ranges over several orders of magnitude in both the positive and negative directions. This situation comes up when you measure the net change in some quantity over a period of time. Examples include the net profit at companies, net change in the price of stocks, net change in jobs, and net change in population.

A state-by-state look at net population change in California

logmodulus1

Here's a typical example. The US Census Bureau tracks state-to-state migration flows. (For a great interactive visualization of internal migration, see the Governing Data Web site.) The adjacent scatter plot shows the net number of people who moved to California from some other US state in 2011, plotted against the population of the state. (Click to enlarge.) For example, 35,650 people moved to California from Arizona, whereas 49,635 moved to Arizona from California, so Arizona was responsible for a net change of –13,985 to the California population. The population of Arizona is just under 5 million, so the marker for Arizona appears in the lower left portion of the graph.

Most states account for a net change in the range [–5000, 5000] and most states have populations less than 5 million. When plotted on the scale of the data, these markers are jammed into a tiny portion of the graph. Most of the graph is empty because of states such as Texas and Illinois that have large populations and are responsible for large changes to California's population.

As discussed in my blog post about using log-scale axes to visualize variables, when a variable ranges over several orders of magnitudes, it is often effective to use a log transformation to see large and small values on the same graph. The population variable is a classic example of a variable that can be log-transformed. However, 80% of the values for the net change in population are negative, which rules out the standard log transformation for that variable.

The log-modulus transformation

logmodulus2

A modification of the log transformation can help spread out the magnitude of the data while preserving the sign of data. It is called the log-modulus transformation (John and Draper, 1980). The transformation takes the logarithm of the absolute value of the variable plus 1. If the original value was negative, "put back" the sign of the data by multiplying by –1. In symbols,
L(x) = sign(x) * log(|x| + 1)

The graph of the log-modulus transformation is shown to the left. The transformation preserves zero: a value that is 0 in the original scale is also 0 in the transformed scale. The function acts like the log (base 10) function when x > 0. Notice that L(10) ≈ 1, L(100) ≈ 2, and L(1000) ≈ 3. This property makes it easy to interpret values of the transformed data in terms of the scale of the original data. Negative values are transformed similarly.

Applying the log-modulus transformation

logmodulus3

Let's see how the log-modulus transformation helps to visualize the state-by-state net change in California's population in 2011. You can download the data for this example and the SAS program that creates the graphs. A previous article showed how to use PROC SGPLOT to display a log axis on a scatter plot, and I have also discussed how to create custom tick marks for log axes.

The scatter plot to the left shows the data after using the log-modulus transformation on the net values. The state populations have been transformed by using a standard log (base 10) transformation. The log-modulus transformation divides the data into two groups: those states that contributed to a net influx of people to California, and those states that reduced the California population. It is now easy to determine which states are in which group: The states that fed California's population were states in New England, the Rust Belt, and Alaska. It is also evident that size matters: among states that lure Californians away, the bigger states tend to attract more.

The main effect of the log-modulus transformation is to spread apart markers that are near the origin and to pull in markers that are relatively far from the origin. By using the transformation, you can visualize variables that span several orders of magnitudes in both the positive and negative directions. The resulting graph is easy to interpret if you are familiar with logarithms and powers of 10.

Post a Comment

Create custom tick marks for axes on the log scale

In my previous blog post, I showed how to use log axes on a scatter plot in SAS to better visualize data that range over several orders of magnitude. Because the data contained counts (some of which were zero), I used a custom transformation x → log10(x+1) to visualize the data. You can download the data and the SAS program.

I left one problem unresolved at the end of my last post: The tick marks on the axes were labeled on the log scale so that, for example, a marker positioned at the tick mark labeled '2' actually represents a value of 102 = 100. If the graph is intended for people who do not use logarithms on a regular basis, this log-scale axis might hinder them from properly interpreting the graph.

Fortunately, the SGPLOT procedure in SAS supports custom tick marks. In the XAXIS and YAXIS statements, you can use the VALUES= option to specify the location of tick marks and you can use the VALUESDISPLAY= option to specify the label for each tick mark.

Determine the tick locations

The goal is to place tick marks on the plot for the transformed data, but label those ticks by using the original untransformed counts. For example, suppose that you decide to display tick marks that correspond to the following counts: 0, 5, 10, 25, 50, 100, 250, and 500. The following DATA step computes the log-x-plus-one transformation for those values:

data TickMarks;
input Count @@;
LogCountP1 = log10(Count+1);
datalines;
0 5 10 25 50 100 250 500
;
run;
 
proc print noobs; run;
t_LogTransform

The numbers in the second column are the locations of the tick marks in the scatter plot of the transformed data. Put those numbers on the VALUES= option. The numbers in the first column are the corresponding labels that we want to display with those tick marks. Put those numbers (as text strings) on the VALUESDISPLAY= option, as follows:

ods graphics / width=750 height=1000;
title "Custom Axes on Log Scale";
proc sgplot data=LogComments noautolegend;
   label logCommentP1="Number of Original Comments" logResponseP1="Number of Responses";
   scatter x=logCommentP1 y=logResponseP1 / datalabel=NickName;
   lineparm x=0 y=0 slope=1; 
   xaxis grid offsetmin=0.01 offsetmax=0.1
         values=(0 0.78 1.04 1.41 1.71 2.00)              /* tick locations   */
         valuesdisplay = ("0" "5" "10" "25" "50" "100");  /* labels displayed */
   yaxis grid offsetmin=0.05 offsetmax=0.1
         values=(0 0.78 1.04 1.41 1.71 2.00 2.40 2.70)
         valuesdisplay = ("0" "5" "10" "25" "50" "100" "250" "500");
run;
LogAxis4

This plot shows the custom tick marks for the axes. The data are plotted on a log scale, but the labels on the tick marks show the original scale of the data. It is easy to estimate the number of comments and responses for each individual. For example, Robert has 25 original comments and less than 250 responses. John has less than 10 original comments and 50 responses.

Of course, you still have to careful reading graphs that have nonlinear axes. For one thing, you can't compare distances between points. On the plot, it looks like Tricia and Michelle are about the same distance apart as Rick and Chris, but that is not true. Tricia and Michelle differ by 25 comments, whereas Rick and Chris differ by more than 150.

Automating the tick locations and labels

I have one final remark. When creating the plot, I used the DATA step to compute the locations for a selected set of tick marks, but then I entered those values by hand on the VALUES= and VALUESDISPLAY= options in PROC SGPLOT. The fancier approach is to pack information about the tick marks into SAS macro variables and use the macro variables in PROC SGPLOT. You can use the SYMPUTX routine and string concatenation routines to carry out this task. The following SAS/IML program shows how to assign macro variables in PROC IML. I will leave the analogous DATA step program as an exercise for an ambitious SAS programmer:

proc iml;
Count = {0 5 10 25 50 100 250 500};
d = ' "' + char(Count,3) + '"';
v = " " + putn(log10(Count+1), "Best6.");
call symputx("D2", rowcat(d));
call symputx("V2", rowcat(v));
quit;
 
%put _user_;
t_LogTransform2

The values of the macro variables are displayed in the SAS Log. You can now use these macro variables in the PROC SGPLOT statements as follows:

proc sgplot data=LogComments noautolegend;
   ...
   yaxis grid offsetmin=0.05 offsetmax=0.1 values=(&V2) valuesdisplay=(&D2);
run;
Post a Comment

Scatter plots with logarithmic axes...and how to handle zeros in the data

If you are trying to visualize numerical data that range over several magnitudes, conventional wisdom says that a log transformation of the data can often result in a better visualization. This article shows several ways to create a scatter plot with logarithmic axes in SAS and discusses some of the advantages and disadvantages of using logarithmic transformations. It also discusses a common problem: How to transform data that range over several orders of magnitudes but that also contain zeros? (Recall that the logarithm of zero is undefined!)

Let's look at an example. My colleague, Chris Hemedinger, wrote a SAS program that collects data about comments on the blogs.sas.com Web site. For each comment, he recorded the name of the commenter and whether the comment was an original comment or a response to a previous comment. For example, "This is a great article. Thanks!" is classified as a comment, whereas "You're welcome. Glad you liked it!" is classified as a response. You can download the program that creates the data and the graphs in this article. I consider only commenters who have posted more than ten comments.

The following call to the SGPLOT procedure create a scatter plot of these data:

title "Comments and Responses on blogs.sas.com";
proc sgplot data=Comments noautolegend;
   scatter x=Comment y=Response / datalabel=TruncName;
   lineparm x=0 y=0 slope=1; 
   yaxis grid offsetmin=0.05;
   xaxis grid;
run;
LogAxis1

The scatter plot shows the number of comments and responses for 50 people. Those who have commented more than 30 times are labeled, and a line is drawn with unit slope. The line, which was drawn by using the LINEPARM statement, enables you to see who has initiated many comments and who has posted many responses. For example, Michelle and Tricia (lower right) often comment on blogs, but few of their comments are in response to others. In contrast, Sanjay and Robert are regular SAS bloggers and most of their remarks are responses to other people's comments. The big outlier is Chris, who has initiated almost 100 original comments while also posting more than 500 responses to comments on his popular blog.

The scatter plot is an excellent visualization of the data... provided that your goal is to highlight the people who post the most comments and classify them into "commenter" or "responder." The plot enables you to identify about a dozen of the 50 people in the data set. But what about the other nameless markers near the origin? You can see that there are many people who have posted between 10 and 30 comments, but the current plot makes it difficult to find out who they are. To visualize those observations (while not losing information about Chris and Michelle) requires some sort of transformation that distributes the data more uniformly within the plot.

Logarithmic transformations

The standard visualization technique to use in this situation is the logarithmic transformation of data. When you have data whose range spans several orders of magnitude, you should consider whether a log transform will enhance the visualization. A log transformation preserves the order of the observations while making outliers less extreme. (It also spreads out values that are in [0,1], but that doesn't apply for these data.) For these data, both the X and the X variables span two orders of magnitude, so let's try a log transform on both variables.

The XAXIS and YAXIS statements in the SGPLOT procedure support the TYPE=LOG option, which specifies that an axis should use a logarithmic scale. However, if you use that option on these data, the following message is printed to the SAS Log:

NOTE: Log axis cannot support zero or negative values in the data range.
      The axis type will be changed to LINEAR.

The note informs you that some people (for example, Mike) have never responded to a comment. Other people (Bradley, label not shown) have only posted responses, but have never initiated a comment. Because the logarithm of 0 is undefined, the plot refuses to use a logarithmic scale.

If you are willing to leave out these individuals, you can use a WHERE clause to subset the data:

title "Automatic Log Transformation";
title2 "Comment>0 and Response>0";
proc sgplot data=Comments;
   where Comment > 0  &  Response > 0;
   scatter x=Comment y=Response / datalabel=NickName;
   xaxis grid type=log minor offsetmin=0.01;
   yaxis grid type=log minor offsetmin=0.05;
run;
title2;
LogAxis2

The graph shows all individuals who have initiated and responded at least once. The log transformation has spread out the data so that it is possible to label all markers by using first names. The tick marks on the axes show counts in the original scale of the data. It is easy to see who has about 10 or about 100 responses. With a little more effort, the minor tick marks enable you to discover who has 3 or 50 responses. I used the OFFSETMIN= option to add a little extra space for the data labels.

The SGPLOT procedure does not support using the LINEPARM statement with logarithmic axes, so there is not diagonal line. However, the grid lines enable you to see at a glance that Michael, Jim, and Shelly initiate comments as often as they respond. Individuals that appear in the lower right grid square are those who initiate more than they respond. (If you really want a diagonal line, use the VECTOR statement.)

Although the log transformation has successful spread out the data, this graph does not show the seven people who were dropped by the WHERE clause. It is undesirable to not show certain observations just because the log scale is restricted to positive counts.

The log-of-x-plus-one transformation

There is an alternative: Rather than using the automatic log scale that PROC SGPLOT provides, you can write your own data transformation. Within the DATA step, you have complete control over the transformation and you can handle zero counts in any mathematically consistent way. I have previously written about how to use a log transformation on data that contain zero or negative values. The idea is simple: instead of the standard log transformation, use the modified transformation x → log(x+1). In this transformation, the value 0 is transformed into 0. The transformed data will be spread out but will show all observations.

The drawback of the "log-of-x-plus-one" transformation is that it is harder to read the values of the observations from the tick marks on the axes. For example, under the standard log transformation, a transformed value of 1 represents an individual that has 10 comments, since log(10) = 1. Under the transformation x → log(x+1), a transformed value of 1 represents an individual that has 9 comments. You can use the IMAGEMAP option on the ODS GRAPHICS statement to add tooltips to the graph, but of course that won't help someone who is trying to read a printed (static) version of the graph. Nevertheless, let's carry out this nonstandard log transformation:

data LogComments;
set Comments;
label logCommentP1  = "log10(1 + Number of Original Comments)"
      logResponseP1 = "log10(1 + Number of Responses)";
logCommentP1 = log10(1 + Comment);
logResponseP1 = log10(1 + Response);
run;
 
ods graphics / imagemap=ON;
title "Custom Log Transformation";
proc sgplot data=LogComments noautolegend;
   scatter x=logCommentP1 y=logResponseP1 / datalabel=NickName 
                                            tip=(Comment Response Total);
   lineparm x=0 y=0 slope=1; 
   xaxis grid offsetmin=0.01;
   yaxis grid offsetmin=0.05 offsetmax=0.05;
run;
LogAxis3

This graph is pretty good: the observations are spread out and all the data are displayed. You can easily show the diagonal reference line by using the LINEPARM statement. A disadvantage of this plot is that it is harder to determine the original counts for the individuals, in part because the tick marks on the axes are displayed on the log scale. Although many analytical professional will have no problem recalling that the value 2.0 corresponds to a count of 102 = 100, the label on the axes might confuse those who are less facile with logarithms. In my next blog post I will show how to customize the tick marks to show counts on the original scale of the data.

The point of this article is that the log transformation can help you to visualize data that span several orders of magnitudes. However, the log function is properly restricted to positive data, which means that it is more complicated to create and interpret a log transformation on non-positive data.

Do you have suggestions for how to visualize these data? Download the data and let me know what you come up with.

Post a Comment

An easy way to expand data by using frequencies

A few years ago I blogged about how to expand a data set by using a frequency variable. The DATA step in the article was simple, but the SAS/IML function was somewhat complicated and used a DO loop to expand the data. (Although a reader later showed how to avoid the DO loop.) Consequently, I am happy that the REPEAT function in SAS/IML 12.3 (which shipped with SAS 9.4) supports expanding data by frequency values. Goodbye, complicated function!

To make the situation clear, here is an example from my earlier blog post:

proc iml;
values={A B C D E};         /* categories */
freq = {2 1 3 0 4};         /* nonnegative frequencies */

The vector values contains five unique categories. The vector freq is the same size and contains the frequency of each category. The goal is to expand the categories by the frequencies to create a new vector that has 10 (sum(freq)) elements. The new vector should contain two copies of 'A', one copy of 'B', three copies of 'C', no copies of 'D', and four copies of 'E'. The new syntax for the REPEAT function makes this easy:

y = repeat(values, freq);    /* SAS/IML 12.3 */
print y;
t_repeat

The REPEAT function does not support negative or missing values, but you can use the LOC function to remove those invalid values:

values={A B C D E  F G};
freq = {2 1 3 0 4 -2 .}; 
posIdx = loc(freq>0);         /* select positive frequency values */
y = repeat(values[posIdx], freq[posIdx]);
Post a Comment

Pairwise comparisons of a data vector

A SAS customer showed me a SAS/IML program that he had obtained from a book. The program was taking a long time to run on his data, which was somewhat large. He was wondering if I could identify any inefficiencies in the program.

The first thing I did was to "profile" the program to identify trouble spots. (See Ch. 15, "Timing Computations and the Performance of Algorithms," in Statistical Programming with SAS/IML Software.) I discovered one function that was taking about 80% of the total time. The inefficient function (if I may oversimplify the problem) essentially took a vector x with N elements and returned an N x N indicator matrix, A, whose (i, j)th element is 1 if x[i] ≥ x[i] and is 0 otherwise. A small example of the x vector is used in the following statements. The inefficient code looked something like this:

proc iml;
x = {16, 15, 6, 6, 20, 5, 6, 12, 5, 12};  /* data for pairwise comparison */
N = nrow(x);
 
A = j(N,N,0);
do i = 1 to N;
   do j = 1 to N;
      if x[i] >= x[j] then 
         A[i,j] = 1;       /* A[i,j] = 1 if x[i] >= x[j]; 0 otherwise */
   end;
end;
t_pairwisecomp

The 10 x 10 indicator matrix for this example is shown. When N is large, this double DO loop approach can be slow. I have written about similar computations before. For example, I've written about how to compute pairwise differences for a vector of values, which is almost the same problem. The key is to eliminate nested DO loops and replace them with vector or matrix operations, a process known as vectorization.

Pairwise comparisons: An exercise in vectorization

To implement a vector-based computation, think about the jth column of A. What is the jth column? It is the indicator vector obtained by comparing x to the jth value of x. The following statements rewrite the computation by using a single DO loop that iterates over each column of the resulting matrix:

B = j(N,N,0);
do j = 1 to N;
   B[,j] = (x >= x[j]);
end;

Equivalently, you could loop over rows of the indicator matrix by using B[i,] = (x` <= x[i]) (but notice the direction of the inequality!). Either method produces a big speedup in the computation.

A matrix approach is less intuitive, but can be more efficient unless the matrices are huge. The idea is to form a matrix with N columns, each column being a copy of the data vector x. If you compare the elements of this matrix to the elements of its transpose, you obtain the desired indicator matrix:

xx = repeat(x,  1, N);    /* the i_th row is x[i]     */
C = (xx >= xx`);          /* indicator matrix         */

The matrices A, B, and C are equal.

A slight twist

In the previous section, I oversimplified the problem. The real problem evaluated several complicated logical expressions (by using IF-THEN statements) within the doubly-nested DO loops. The result of those logical expressions affected the values of the indicator matrix. To simplify matters, I'll illustrate the situation by using a binary vector and a simple logical AND operator. Assume that the IF-THEN statements resolve to either 1 or 0 (true or false) for each value of the column index, j, as follows:

s = {0, 1, 1, 0, 0, 1, 0, 1, 1, 0};   /* result of logical expressions, j=1 to N */
 
A = j(N,N,0);
do i = 1 to N;
   do j = 1 to N;
      if x[i] >= x[j] & s[j]=1 then   /* more complicated IF-THEN logic: */
         A[i,j] = 1;                  /* A[i,j] = 1 if x[i] >= x[j] AND s[j]=1 */
   end;
end;

Although this new situation is more complicated, the lessons of the previous section remain relevant. Here's how you can compute the indicator matrix by using vector operations:

B = j(N,N,0);
do j = 1 to N;
   B[,j] = (x >= x[j]) & (s[j]=1);
end;

Here's the equivalent matrix operations:

xx = repeat(x, 1, N);
ss = repeat(s, 1, N);
C = (xx >= xx`) & (ss`=1);

If the logic within the doubly-nested loops depends on both rows and columns (i and j), then the variable s becomes a matrix and the computations are modified accordingly.

In summary, many SAS/IML programmers know that looping over rows and columns of a matrix is inefficient and that vectorizing the computations leads to performance improvements. This article applies the same principles to pairwise comparisons of data in a vector. By avoiding scalar comparisons, you can improve the performance of the computation.

Post a Comment