Create a map with PROC SGPLOT


Did you know that you can use the POLYGON statement in PROC SGPLOT to draw a map? The graph at the left shows the 48 contiguous states of the US, overlaid with markers that indicate the locations of major cities. The plot was created by using the POLYGON statement, which was introduced in SAS 9.4m1.

The POLYGON statement is a versatile and powerful statement. My colleague Sanjay Matange has shown how to use the POLYGON statement to create custom plots such as a variable-width bar chart. He has also shown how to use the POLYGONPLOT statement in the Graph Template Language (GTL) to create panels of maps, and he mentioned that the POLYGON statement in PROC SGPLOT acts similarly.

Identifying polygons

You need to look carefully at Sanjay's program to see how he constructs an ID variable that identifies each polygon. For map data, it is common for regions to be defined in a nested hierarchical manner. For example, the MAPS.US data set that is distributed with SAS/GRAPH software has two variables (STATE and SEGMENT) that together define the polygons that make up the map of US states. (For example, Michigan has two segments that correspond to Lower Peninsula and the Upper Peninsula.) Sanjay uses a standard trick that concatenate multiple variable values into a single ID variable, and uses the concatenated values to identify each polygon.

Because the MAPS.US data set is sorted by the STATE and SEGMENT, you can generate an ID variable without using concatenation. Instead, you can use BY-group processing to generate an ID variable for each polygonal segment. The following DATA step defines the PolyID variable and increments its value every time a new segment is encountered. The PolyID variable is specified on the ID= variable in the POLYGON statement in order to create a graph of the regions in the data set:

data US;
   set maps.US;
   where StateCode NOT IN ("AK", "HI", "PR");
   by State Segment;
   if first.Segment then PolyID+1;    /* create ID variable for polygons */
ods graphics / antialiasmax=1500 labelmax=1500; /* enable anti-aliasing and labels */
title "Outline of 48 US States";
proc sgplot data=US;
   polygon x=x y=y ID=PolyID / fill outline;

Notice that the MAPS.US data set uses projected radians as the coordinates for the state boundaries. You can multiply these values by 180/π to convert to degrees. You can use the GPROJECT procedure in SAS/GRAPH software to project data sets that provide longitude and latitude variables.

Overlays on polygon plots

You can overlay data on a polygon plot. In particular, you can use the BUBBLE, TEXT, SCATTER, and VECTOR statements, as well as additional POLYGON statements to overlay additional spatial information on a map. The following DATA step reads the coordinates of a few large cities from the MAPS.USCITY data set and concatenates that information with the state-boundary data. The SCATTER statement overlays the cities and labels them.

data All; 
   set maps.uscity(where=(statecode NOT IN ("AK" "HI" "PR") & pop > 5e5)
                   rename=(x=cityX y=cityY))
title "US States with Some Large Cities";
proc sgplot data=All noautolegend;
   polygon x=x y=y ID=PolyID / fill outline transparency=0.75 group=StateCode;
   scatter x=cityX y=cityY / datalabel=city;
   xaxis display=none;
   yaxis display=none;

The graph is shown at the top of this article. Notice that the GROUP= option in the POLYGON statement is used to color each state.

With additional work, you can use the GROUP= option to create a choropleth map. You can use the DATTRMAP= option on the PROC SGPLOT statement to specify colors for the various group values. Alternatively, you can use the STYLEATTRS statement to assign colors for groups.

The SGPLOT procedure does not replicate the functionality of the SAS/GRAPH software. The mapping procedures in SAS/GRAPH software (which includes GREDUCE, GPROJECT, GMAP, and MAPIMPORT) enable you to manipulate and visualize map data in more complex ways. However, the POLYGON statement in PROC SGPLOT enables you to create some simple maps that visualize countries, states, and counties, and to overlay graphical elements by using other SGPLOT statements.

Post a Comment

Label markers in graphs by using the values of several variables

In many procedures, the ID statement is used to identify observations by specifying an identifying variable, such as a name or a patient ID. In many regression procedures, you can specify multiple ID variables, and all variables are copied into output data sets that contain observation-wise statistics such as predicted values and residuals.

Values of an ID variable can also be used to label markers on a graph. For example, the following call to PROC REG models the weights of 19 students as a linear function of their heights. The PLOTS=COOKSD(LABEL) option requests a plot of Cook's D statistic, which gives information about observations that appear to have a large influence on the parameter estimates. The LABEL option requests that extreme values of the Cook's D statistic be labeled by using the ID variable. Let's see what happens if you specify two variables on the ID statement:

proc reg data=sashelp.class plots(only)=cooksd(label);
model weight = height;
id name age;             /* specify TWO ID variables */
ods select CooksDPlot;

I've added a red rectangle that surrounds the labeled marker. Although two variables (NAME and AGE) were specified in the ID statement, the graph clearly shows that only the first variable (NAME) was used as a label. The documentation for the PLOTS= option tells us that this is the expected behavior: "If you specify one or more ID variables..., then the first ID variable [that] you specify is used for the labeling."

Most procedures exhibit a similar behavior, although a few (LOGISTIC, TRANSREG, ...) provide tooltips or other support for multiple ID variables.

For the procedures that support labeling by only one variable, there is an easy workaround: create a new ID variable by concatenating several other variables.

Concatenating multiple ID variables

You can use the concatenation operator (||) in the DATA step to concatenate two variables together. The following statements concatenate values of the NAME and AGE variables. The TRIM function can be used to remove trailing blanks from the NAME variable. In many cases, the DATA step will automatically convert the numeric AGE variable to a character value, but the following statements use the PUT function to explicitly convert a numeric variable to a character value prior to concatenation:

/* concatenate NAME and AGE to make new ID variable */
data class;
set sashelp.class;
length labl $12;    /* NAME holds 8 characters. Allow for delimiter and age */
labl = trim(Name) || ": " || put(age, 2.);  /* convert two-digit age to character */
proc reg data=class plots(only)=cooksd(label);
model weight = height;
id labl;               /* label by using new variable */
ods select CooksDPlot;

With this change, each extreme point in the graph is labeled with the values of the NAME and AGE variables.

Labeling markers in ODS graphics

You can use the same trick to label a marker in a graph created by PROC SGPLOT. The canonical example is a scatter plot in which you want to label all observations. Because the DATALABEL= option supports only a single variable, you have to use concatenation if you want to see the values of multiple variables:

proc sgplot data=class;
scatter x=height y=weight / datalabel = labl;

A drawback of the "concatenation technique" is that most graphs do not have room to display long labels for many points. The DATALABEL= option uses a collision-avoidance algorithm to avoid overplotting labels, but if you have hundreds or thousands of markers, it is best to use another mechanism for labeling markers.

One option to display multiple labels is to label only certain observations in the graph, such as observations that have extreme values. Another option, which works if you are going to examine the graph on your computer, is to use the TIP= option to create a tooltip for the scatter plot. With a tooltip, you can hover a pointer over a marker and view a pop-up window that displays the values of multiple variables.

Post a Comment

Guessing games, ensemble averages, and the wisdom of the crowd


How much does this big pumpkin weigh? One of the cafeterias at SAS invited patrons to post their guesses on an internal social network at SAS. There was no prize for the correct guess; it was just a fun Halloween-week activity.

I recognized this as an opportunity to apply the "wisdom of the crowd" to create a potentially winning guess. Because the guesses were being posted to an online community, I could see all the guesses. My hypothesis was that the average of other people's guesses would be an accurate prediction for the pumpkin's weight. I decided to wait five days before submitting my prediction.

The following SAS data set shows the complete set of guesses for the pumpkin's weight. The first line after the DATALINES statement shows the 10 guesses that were submitted during the first five days. The second line shows my guess, which was 36 pounds. The third line shows subsequent guesses. The call to PROC MEANS shows the analysis of the first 10 guesses:

data guess;
N = _N_;
input Weight @@;
sum + Weight;
label Mean = "Running Mean"
      N = "Patron ID";
if N > 2 then Mean = sum / N;
37.5 35 47.5 14.5 25.5 36.6 17.2 51 32.4 37.5
39.7 42 43.6 46 35.2 36.75 28.7 42.4 40.2 48
proc means data=guess(obs=10) N mean median min max;
var Weight;

The output from PROC MEANS shows that the mean value of the first 10 guesses was 33.47. Because the range of guesses included values that I considered to be extreme, I decided instead to use the median value (35.8), which I rounded to 36 pounds. That was my guess.

Over the next few days 10 additional guesses were posted. Finally, the pumpkin's weight was revealed: it was 35 pounds. My prediction, which was based on the "wisdom" of 10 previous guesses, was within 3% of the true value. However, I did not win the contest because in fact the true weight of the pumpkin was guessed by the second person to enter the contest!

The following graph shows the sequence of guesses. My guess is marked with a star. The true weight of the pumpkin is indicated by a horizontal line. The running mean is shown as a jagged line. You can see that the running mean is very close to the true value after only a few guesses. The running mean was closer to the true value than most of the guesses.

data Augment;
set guess;
if N = 11 then Rick = 36; else Rick = .;
label Rick = "Rick's Guess";
proc sgplot data=Augment;
scatter x=N y=Weight / legendlabel="Guess";
scatter x=N y=Rick / markerattrs=(symbol=starfilled size=12 color=red);
series x=N y=Mean;
xaxis grid; yaxis grid;
refline 35 / axis=y lineattrs=(color=blue);

Using the "wisdom of the crowd" to estimate a weight has a long history that pre-dates this pumpkin-guessing contest. In 1906 Francis Galton, who introduced many important concepts in statistics, was at a county fair when he noticed a contest that encouraged fairgoers to guess the weight of an ox. After the contest ended, he obtained the paper guesses for almost 800 villagers. He computed the median guess and noticed that it was within 1% of the true weight. He called this phenomenon vox populi or "the voice of the people."

Statistically speaking, there is a difference between Galton's procedure and my own. In the pumpkin-guessing contest, everyone could see the other guesses, which means that the guesses were not independent. Because others had guessed in the range 15–50 pounds, it was unlikely that someone new would guess an extreme value such as 10 or 100 pounds, which introduced bias. Furthermore, because only the first person to correctly guess the weight would be the winner, participants presumably modified their guesses and chose a weight that had not yet been submitted, thus preventing duplicate guesses.

Nevertheless, this simple experiment shows that combining multiple predictions can often result in a prediction that is superior to the original predictions. In statistics this is called an ensemble model or a consensus model. Ensemble models are used to predict the path and intensity of hurricanes and even to predict which movies you will enjoy watching on Netflix.

Taking a mean or median is a very simple kind of ensemble model. The GLMSELECT procedure in SAS/STAT software enables you to perform model averaging as part of constructing a parsimonious linear regression model. SAS Enterprise Miner contains several sophisticated predictive algorithms that use ensemble models; see Maldonado, et al. (2014), "Leveraging Ensemble Models in SAS Enterprise Miner."

Post a Comment

Point/Counterpoint: Symbolic versus mnemonic logical operators in SAS


In SAS, the DATA step and PROC SQL support mnemonic logical operators. The Boolean operators AND, OR, and NOT are used for evaluating logical expressions. The comparison operators are EQ (equal), NE (not equal), GT (greater than), LT (less than), GE (greater than or equal), and LE (less than or equal). These character-based operators are called mnemonic because their names make it easy to remember what the operator does.


Each mnemonic operator in SAS has an equivalent symbolic operator. The Boolean operators are & (AND), | (OR), and ^ (NOT). The comparison operators are = (EQ), ^= (NE), > (GT), < (LT), >= (GE), and <= (LE). The symbol for the NOT and NE operators can vary according to the computer that you use, and the tilde character (~) can be used in place of the caret (^).

Mnemonic operators tend to appear in older languages like FORTRAN, whereas symbolic operators are common in more recent languages like C/C++, although some relatively recent scripting languages like Perl, PHP, and Windows PowerShell also support mnemonic operators. SAS software has supported both operators in the DATA step since the very earliest days, but the SAS/IML language, which is more mathematically oriented, supports only the symbolic operators.

Functionally, the operators in SAS are equivalent, so which ones you use is largely a matter of personal preference. Since consistency and standards are essential when writing computer programming, which operators should you choose?

The following sections present arguments for using each type of operator. The argument for using the mnemonic operators is summarized by Mnemonic Norman. The argument for using symbols is summarized by Symbolic Sybil. Finally, there is a rejoinder by Practical Priya. Thanks to participants on the SAS-L discussion forum and several colleagues at SAS for sharing their thoughts on this matter. Hopefully Norman, Sybil, and Priya represent your views fairly and faithfully.

Use the mnemonic operators

Hi, I'm Mnemonic Norman, and I've been programming in SAS for more than 30 years. I write a lot of DATA step, SQL, and macro code. I exclusively use the mnemonic operators for the following reasons:

  1. Easy to type. I can touch-type the main alphabet, but I've never mastered typing symbols without looking down at my fingers. In addition, exotic symbols like | (OR) are not usually located in an easy-to-reach location on my keyboard. By using the mnemonic operators, I can avoid hitting the SHIFT key and can write programs faster.
  2. Easy to read. Even complex comparisons are easy to read because they form a sentence in English:
    if x gt 0 AND sex eq "MALE" then ...
  3. Easy to remember. There is a reason why these are called mnemonic operators! I program in several different languages, and each one uses a different NE operator. In SAS it is ^=. In Lua the NE operator is ~=, in Java it is !=, and the ANSI standard for SQL is <>. I use NE so I don't have to remember the correct symbol.
  4. Easy to communicate. My boss and clients are not statisticians. They can understand the mnemonic operators better than abstract symbols.
  5. Easy to see. I don't want to emphasize my age, but statistics show that most people's eyesight begins to diminish after age 40. I find the symbols | and ^ particularly difficult to see.
  6. Easy to distinguish assignment from comparison. I like to distinguish between assignment and logical comparison with equality, but SAS uses the = symbol for both. Therefore I use the equal sign for assignment and use EQ for logical comparison. For example, in the statement
    b = x EQ y;
    it is easy to see that b is a variable that holds a Boolean expression. The equivalent statement
    b = x = y;
    looks strange. (Furthermore, in the C language, this expression assigns the value of y to both b and x.)
  7. Easy to use macro variables. I reserve the ampersand for macro variables. If I see an expression like x&n, I immediately assume that the expression resolves to a name like x1 or x17. To avoid confusion with macro variables, I type x AND n when that is what I intend.
  8. Easy to cut and paste. Because the less-than and greater-than symbols are used to delimit tags in markup languages such as HTML and XML, they can disappear when used in Web pages. In fact, I dare you to try to post this comment to Rick's blog: "I use the expression 0 < x and y > 1." This is what you'll get: "I use the expression 0 1."

Use the symbolic operators

Hi, I'm Symbolic Sybil, and I've been programming in SAS for a few years. In school I studied math, statistics, and computer science. In addition to SAS, I program in C/C++, and R. I use symbolic operators exclusively, and here are reasons why:

  1. Consistent with mathematics. When a text book or journal presents an algorithm, the algorithm uses mathematical symbols. If you study Boolean logic, you use symbols. Symbols are a compact mechanism for representing complex logical conditions. Programs that implement mathematical ideas should use mathematical notation.
  2. Consistent with other modern languages. I don't use FORTRAN or SQL. I might write a DATA step to prepare data, then jump into PROC IML to write an analysis. Sometimes I call a package in R or a library in C++. I use symbols because all the languages that I use support them.
  3. Distinguish variables from operators. Symbols are not valid variable names, so it is easy see which tokens are operators and which are variables. Although Norman claims that symbols are hard to see, I argue that they stand out! If a data set has variables named EQ and LT, the expression EQ > LT is more readable than the equivalent expression EQ GT LT.
  4. Enforce coding discipline. Some of Norman's arguments are the result of lazy programming habits. The only reason he can't remember symbols is because he doesn't use them regularly. If you put spaces around your operators, you will never confuse x&n and x & n. As to remembering which operators are supported by which programming language, that is an occupational hazard. We are highly paid professionals, so learn to live with it. I don't think the solution is to use even more operators!
  5. Easy to communicate. I disagree with Norman's claim that his non-statistical boss and clients will understand character-based operators easier. How patronizing! Did they drop out of school in the third grade? Furthermore, in the modern world, we need to be inclusive and respectful of different cultures. The character-based operators are Anglocentric and might not be easy to remember if your client is not a native English speaker. In Spanish, "greater than" is "mayor que" and "equal" is "igual". In contrast, mathematical symbols are universal.

Use them both, but be consistent

Hi, I'm Pratical Priya. There is no a need to start a flame war or to make this an either/or debate. As a famous computer scientist wrote, "the nice thing about standards is that you have so many to choose from."

I used symbols exclusively until I consulted on a project where the client insisted that we use mnemonic operators. Eventually I gained an appreciation for mnemonic operators. I think they are easier to see and are more readable for experts and non-experts alike.

Today I use a combination of symbols and mnemonic operators. Like Norman, I find the logical operators AND, OR, and NOT easier to type and to read than the symbols &, |, and ^. For the relational (comparison) operators, I always use <, >, and =. I learned these symbols in school and they are universally understood.

I argue that a hybrid approach is best: In the DATA step I use mnemonic Boolean operators but use symbols for comparison operators. This presents a clear visual separation between clauses that FORM Boolean expression and clauses that OPERATE ON logical expressions, like this:
if x = 5 AND missing(y) OR y < z then ...

However, I'm embarrassed to admit that I do not consistently use symbols for the comparison operators. I also use NE, which is inconsistent with my scheme but is more readable than ^=. If my keyboard had a "not equals" symbol (≠), I'd use it, but until then I'm sticking with NE.

Your turn: Which logical operators do you use and why?

Norman, Sybil, and Priya have made some good points. Who do you agree with? What rules do you follow so that your SAS programs use logical operators in a readable, consistent manner? Leave a comment, but as Norman said, be careful typing < and >. You might want to use the HTML tags &lt; and &gt;.

Post a Comment

Trap and cap: Avoid division-by-zero and domain errors when evaluating functions

Statistical programmers often need to evaluate complicated expressions that contain square roots, logarithms, and other functions whose domain is restricted. Similarly, you might need to evaluate a rational expression in which the denominator of the expression can be zero. In these cases, it is important to avoid evaluating a function at any point that is outside the function's domain.

This article shows a technique that I call "trap and cap." The "trap" part of the technique requires using logical conditions to prevent your software from evaluating an expression that is outside the domain of the function. The "cap" part is useful for visualizing functions that have singularities or whose range spans many orders of magnitude.

This technique is applicable to any computer programming language, and it goes by many names. It is one tool in the larger toolbox of defensive programming techniques.

This article uses a SAS/IML function to illustrate the technique, but the same principles apply to the SAS DATA step and PROC FCMP. For an example that uses FCMP, see my article about defining a log transformation in PROC FCMP.

ERROR: Invalid argument to function

The following SAS/IML function involves a logarithm and a rational expression. If you naively try to evaluate the function on the interval [-2, 3], an error occurs:

proc iml;
/* Original function module: No checks on domain of function */
start Func(x);
   numer = exp(-2*x) # (4*x-3)##3 # log(3*x);       /* not defined for x <= 0 */
   denom = x-1;                                     /* zero when x=1 */
   return( numer / denom );
x = do(-2, 3, 0.01);
f = Func(x);
ERROR: (execution) Invalid argument to function.

 count     : number of occurrences is 200
 operation : LOG at line 881 column 40

The error says that there were 200 invalid arguments to the LOG function. The program stops running when it encounters the error, so it doesn't encounter the second problem, which is that the expression numer / denom is undefined when denom is zero.

Experienced SAS programmers might know that the DATA step automatically plugs in missing values and issue warnings for the LOG function. However, I don't like warnings any more than I like errors: I want my program to run cleanly!

Trap bad input values

Because the function contains a logarithm, the function is undefined for x ≤ 0. Because the function is a rational expression and the denominator is zero when x = 1, the function is also undefined at that value.

The goal of the "trap" technique is to prevent an expression from being evaluated at points where it is undefined. Accordingly, the best approach is to compute the arguments of any function that has a restricted domain (log, sqrt, arc-trigonometric functions,...) and then use the LOC function to determine the input values that are within the domain of all the relevant functions. The function is only evaluated at input values that pass all domain tests, as shown in the following example:

proc iml;
/* Trap bad domain values; return missing values when x outside domain */
start Func(x);
   f = j(nrow(x), ncol(x), .);    /* initialize return values to missing */
   /* trap bad domain values */
   d1 = 3*x;                      /* argument to LOG */
   d2 = x-1;                      /* denominator */
   domainIdx = loc( d1>0 & d2^=0 );     /* find values inside domain */
   if ncol(domainIdx)=0 then return(f); /* all inputs invalid; return missing */
   t = x[domainIdx];              /* elements of t are valid */
   numer = exp(-2*t) # (4*t-3)##3 # log(3*t);
   denom = t-1;
   f[domainIdx] = numer / denom;  /* evaluate f(t) */
   return( f );
x = do(-2, 3, 0.01);
f = Func(x);

This call does not encounter any errors. The function returns missing values for input values that are outside its domain.

In practice, I advise that you avoid testing floating-point numbers for equality with zero. You can use the ABS function to check that quantities are not close to zero. Consequently, a preferable trapping condition follows:

   domainIdx = loc( d1>0 & abs(d2)>1e-14 );  /* find values inside domain */

I also recommend that you use the CONSTANT function to check for domain values that would result in numerical underflows or overflows, which is especially useful if your expression contains an exponential term.

Cap it! Graphing functions with singularities

You have successfully evaluated the function even though it has two singularities. Let's see what happens if you naively graph the function. The function has a horizontal asymptote at y=0 and vertical asymptotes at x=0 and x=1, so you can use the REFLINE statement in PROC SGPLOT to draw reference lines that aid the visualization. In SAS/IML, you can use the SCATTER subroutine to create a scatter plot, and pass the REFLINE statements by using the OTHER= option, as follows:

refStmt = "refline 0/axis=y; refline 0 1/axis=x;"; /* add asymptotes and axes */
call scatter(x, f) other=refStmt;

The graph is not very useful (look at the vertical scale!) because the magnitude of the function at one point dwarfs the function values at other points. Consequently, you don't obtain a good visualization of the function's shape. This is often the case when visualizing a function that has singularities.

To obtain a better visualization, manually cap the minimum and maximum value of the graph of the function. In the SGPLOT procedure, you can use the MIN= and MAX= options on the YAXIS statement to cap the view range. If you are using the CALL SCATTER routine in SAS/IML, you can add the YAXIS statement to the OTHER= option, as follows:

/* cap the view at +/-5 */
title "Sketch of Function";
title2 "-5 < y < 5";
capStmt = "yaxis min=-5 max=5;";
call scatter(x,f) other=capStmt + refStmt;

This latest graph provides a good overview of the function's shape and behavior. For intervals on which the function is continuous, you can get a nicer graph if you use the SERIES subroutine to connect the points and hide the markers. The following statements re-evaluate the function on the interval (0, 1) and caps the vertical scale by using -1 < y < 1:

title "Zoom into [0, 1]";
title2 "-1 < y < 1";
x = do(0, 1, 0.01);
f = Func(x);
capStmt = "yaxis min=-1 max=1;";
call series(x,f) other=capStmt + refStmt;

The resulting graph shows that the function has two roots in the interval (0, 1) and a local maximum near x=0.43. A visualization like this is often an important step in finding the roots or local extrema of a function.

In summary, the trap-and-cap technique enables you evaluate functions that have singularities or that have a restricted domain. For SAS/IML functions, you can use the LOC function to keep only the input values that are in the domain of the function. (In a scalar language, such as the SAS DATA step, you can use IF-THEN/ELSE statements instead.) For functions with singularities, you can use the YAXIS statement in PROC SGPLOT to ensure that a graph of the function reveals interesting features and not just extreme values.

Post a Comment

Can't find that data? Search all variables in all data sets

Sometimes I can't remember where I put things. If I lose my glasses or garden tools, I am out of luck. But when I can't remember where I put some data, I have SAS to help me find it.

When I can remember the name of the data set, my computer's operating system has tools to help me find the file. However, sometimes I can remember the name of the variable, but not the data set name. Or I might remember that there are some variables that satisfy a statistical condition, but I can't remember their names or the name of the data set they're in.

In situations like these, you might think that I'd be stuck, but this article shows how to search for variables when you can't remember which data set contains them.

Three useful SAS/IML function for search through data sets

Last week I had two occasions where I had to conduct a search for variables that I only dimly remembered.

Base SAS programmers use SQL, SAS dictionaries, the DATA step, and other tools to search through libraries of SAS data sets. For example, you can use these tools to write a 'grep macro' to locate variables (in any data set) that contain a specific value.

In SAS/IML software there are three functions that enable you to search through multiple data sets:

  1. The DATASETS function returns the names of all SAS data sets in a specified libref.
  2. The CONTENTS function returns the names of all variables in a specified data set.
  3. The USE statement supports opening a data set when the name of the data set is stored in a character variable.

I've previously used the DATASETS function and the USE statement to read hundreds of data sets in a loop. I've used the CONTENTS function to find variable names that are common to multiple data sets.

Find a variable that has a specified name

Suppose you can remember the name of a variable, but you cannot remember the name of the data set that the variable is in. No problem! The following SAS/IML function loops through all data sets in the SASHELP library (or any other libref) and finds all data sets that contain the specified variable. To illustrate the technique, this program searches all data sets for a variable named "weight":

proc iml;
lib = "sashelp";          /* where to look? */
ds = datasets(lib);       /* vector contains names of all data sets */
/* Find data sets that contain a certain variable name */
do i = 1 to nrow(ds);
   varnames = upcase( contents(lib, ds[i]) ); /* var names in i_th data set */
   if any(varnames = "WEIGHT") then 
      print (ds[i]);

The loop is easy and intuitive. This example does not use the USE statement to open the data sets because you are not interested in a variable's data, just its name.

Find a pair of strongly correlated variables

The next example finds variables that have an extreme value of some statistic. Specifically, the goal is to find a pair of variables that have a large positive or negative correlation.

You can write a loop that reads all numerical variables in the SASHELP library and computes all pairwise correlations. There are three tricks in the following program:

  1. Some data sets might not have any numeric variables, and Pearson correlations are meaningless for these data sets. The program uses the _ALL_ keyword on the READ statement to read the data. If a data set contains even one numerical variable, it will be read into the X matrix. If not, the matrix X will be a character matrix, and you can use the TYPE function to check whether X is numeric.
  2. You can restrict your attention to correlations in the lower triangular portion of the correlation matrix. The program uses the StrictLowerTriangular function from a previous post about how to extract the lower triangular elements of a matrix.
  3. There are some data sets that have only two nonmissing observations or that have two variables that are a perfectly correlated (1 or -1). The following statements ignores these non-interesting cases.
start StrictLowerTriangular(X);
   if ncol(X)=1 then return ( {} ); /* return empty matrix */
   v = cusum( 1 || (ncol(X):2) );
   return( remove(vech(X), v) );
/* find data sets variables in SASHELP that have the most extreme correlations */
dsNames = j(2, 1, BlankStr(32));  /* remember data set names for min/max */
corrVals = {1, -1};               /* default values for min and max values */
do i = 1 to nrow(ds);
   dsName = lib + "." + strip(ds[i]);
   use (dsName);
   read all var _ALL_ into x;     /* 1. read all numerical vars in i_th data set */
   close (dsName);
   if type(x)='N' then do;      
      w = StrictLowerTriangular( corr(x, "pearson", "pairwise") );  /* 2 */
      if ^IsEmpty(w) then do;     /* check to see if there is a new min or max */
         if (min(w) > -1) & (min(w) < corrVals[1]) then do;  /* 3 */
            corrVals[1] = min(w);
            dsNames[1] = ds[i];
         if (max(w) < 1) & (max(w) > corrVals[2]) then do;   /* 3 */
            corrVals[2] = max(w);
            dsNames[2] = ds[i];
print dsNames corrVals[f=best20.];

The output shows that the most negatively correlated variables are in the Sashelp.Yr1001 data set. You can reread that data set to find the names of the variables. In a similar way, the Sashelp.Enso data set contains two variables that are almost perfectly positively correlated because they are measuring the same quantity in different units.

You can use this technique to investigate a wide variety of statistics. Which variables have the largest range? The largest standard deviation? You can also write your own "grep function" to locate the character variables that have specified values. The resulting function is simpler and shorter than the "grep macro" in Base SAS.

Post a Comment

The CUSUM-LAG trick in SAS/IML

Every year near Halloween I write a trick-and-treat article in which I demonstrate a simple programming trick that is a real treat to use. This year's trick features two of my favorite functions, the CUSUM function and the LAG function. By using these function, you can compute the rows of a matrix (or indices of a vector) that correspond to the beginning and end of groups.


The picture at left illustrates the problem to solve. You have data that contains a grouping or ID variable, and you want to find the first and last observation that corresponds to each group. In the picture, the ID variable is named Group. The beginning the groups are indicated by the green arrows: rows 1, 3, 6, and 8. The end of the groups are located at the red arrows: rows 2, 5, 7, and 10.

Often it is the case that you already know the sizes of the groups from some previous calculation. If you know the sizes, the CUSUM and LAG functions enable you to directly construct the vectors that contain the row numbers. (Of course, if the group sizes are all equal, you can easily solve this problem by using the DO function.)

Recall that the CUSUM function computes the cumulative sum of the elements in a vector. I've used the CUSUM function in more than a dozen blog posts over the years. The LAG function shifts a vector of values and is useful for any vectorized operations that involve comparing or manipulating adjacent values.

To see the CUSUM-LAG trick in action, assume that you know the sizes of each group. For the example in the picture, the group sizes are 2, 3, 2, and 3. The following function returns the vector of rows that begin and end each group::

proc iml;
/* Return kx2 matrix that contains the first and last elements 
   for k groups that have sizes s[1], s[2],...,s[k]. The i_th row contains
   the first and last index for the i_th group. */
start ByGroupIndices( s );
   size = colvec(s);              /* make sure you have a column vector */
   endIdx = cusum(size);          /* locations of ending index */
   beginIdx = 1 + lag(endIdx);    /* begin after each ending index ... */
   beginIdx[1] = 1;               /*    ...and at 1  */
   return ( beginIdx || endIdx );
size = {2, 3, 2, 3};              /* sizes for this example */
idx = ByGroupIndices(size);
print idx[c={"Begin" "End"}];

The CUSUM function computes the end of each group by accumulating the group sizes. The LAG function merely returns a vector of the same size, but with a missing value in the first element and all other elements shifted down one position. To that vector, you add 1 because the next group starts immediately after the previous group ends. Lastly, you just replace the missing value in the first position with 1 because the first group starts at the first row.

I've used the CUSUM-LAG trick many times. It is useful when you are simulating data in which the size of the sample is also being simulated. For example, just last week I used it to simulate contingency tables. Several years ago I used this trick to expand data by using frequencies. In general, you can use it as an alternative to the UNIQUEBY statement to perform "BY-group processing" in the SAS/IML language.

Happy Halloween to all my readers. I hope you find this trick to be a real treat!

Post a Comment

Monte Carlo simulation for contingency tables in SAS

The FREQ procedure in SAS supports computing exact p-values for many statistical tests. For small and mid-sized problems, the procedure runs very quickly. However, even though PROC FREQ uses efficient methods to avoid unnecessary computations, the computational time required by exact tests might be prohibitively expensive for certain tables. If so, you can use a Monte Carlo approach to randomly generate tables that satisfy the null hypothesis for the test and evaluate the test statistic on those tables.

This article shows how to generate Monte Carlo estimates for exact tests in SAS. For many tests, you can use the EXACT statement in PROC FREQ, which supports the MC option for computing Monte Carlo estimates. You can also use SAS/IML to simulate many random contingency tables, compute the statistic on each table, and thereby approximate the sampling distribution of the test statistic.

An example of an exact chi-square test in SAS


Let's see how an exact test works for a familiar test like the (Pearson) chi-square test for independence of rows and columns.

The adjacent 3 x 3 table (the nine number inside the heavy rectangle) appears in Agresti, Wackerly, and Boyett (1979). If you run a chi-square test in PROC FREQ, the value of the chi-square statistic as 14.81 and the p-value of 0.0051. However, most of the cells in this table have small counts. Consequently, the procedure will also issue a warning: "WARNING: 89% of the cells have expected counts less than 5. (Asymptotic) Chi-Square may not be a valid test." This is a classic example that calls for an exact test, which you can compute by using the EXACT statement, as follows:

proc freq data=TestTable;
weight Count;
table Row*Col / norow nocol nopct;
exact pchi;    /* run an exact chi-square test */

The output tells you that the exact p-value is 0.0038. Therefore you would reject (at the 0.05 significance level) the null hypothesis that the rows and columns are independent.

Monte Carlo estimates of p-values by using PROC FREQ

The exact test finishes almost instantly because the table is small, both in terms of sample size (N=31) and in terms of dimensions (3 x 3). Monte Carlo sampling (simulation) enables you to handle bigger samples and dimensions. It is easy to get Monte Carlo estimates from PROC FREQ: you merely add the MC option in the EXACT statement, as follows:

proc freq data=TestTable;
weight Count;
table Row*Col / norow nocol nopct;
exact pchi / MC;     /* request Monte Carlo estimate */

Because I did not provide a random number seed, the Monte Carlo simulation is seeded by the time of day, which means that you will get a different answer each time you run the program. By default, the EXACT statement generates 10,000 random tables that have the same row and column sum as the observed table. It evaluates the chi-square statistic on each table and counts how many statistics have a value equal to or more extreme than the observed values. For this simulation, 36 of the 10,000 random tables satisfied the condition. The proportion of extreme statistics (0.0036) is the Monte Carlo estimate for the p-value of the test. For more precise estimates, you can use the N= suboption to increase the number of samples for the Monte Carlo simulation.

Run the Monte Carlo simulation yourself

As I wrote last week, you can use the SAS/IML language to implement an algorithm that simulates contingency tables from the null distribution, which assumes no association between the row variable and the column variable. Let's see how you can use that algorithm to implement your own Monte Carlo estimate for the chi-square statistic.

You can download the Monte Carlo program, which has the following parts:
  1. Compute the chi-square statistic on the observed table.
  2. Simulate 10,000 random tables from the null distribution and evaluate the test statistic on each.
  3. Compute the p-value as the proportion of statistics that are at least as extreme as the observed statistic.
  4. Optionally, compute a Wald confidence interval for the p-value.
  5. Optionally, create a histogram of the 10,000 statistics, which visualizes the approximate sampling distribution. Mark the value of the observed statistic.

The first step is to define the observed table and compute the test statistic on that table:

proc iml;
load module=(RandContingency);
T = {10 1 6,
      3 5 0,
      5 0 1};
c = T[+, ];  r = T[ ,+];
/* 1. Compute the chi-square statistic on the observed table. */
E = r*c / c[+];         /* expected under null model of independence */
q0 = sum((T-E)##2 / E); /* observed chi-square statistic */
print q0;

The printed value is 14.81, which agrees with the output from PROC FREQ.

The following module implements the second step. The module simulates tables from the null distribution and computes the chi-square statistic on each. A column vector of statistics is returned.

/* 2. Simulate tables from null distribution; evaluate test statistic on each. */
start MCChiSq(tbl, NRep);
   c = tbl[+, ];
   r = tbl[ ,+];
   E = r*c / c[+];   /* expected under null model of independence */
   q = j(NRep,1);
   do i = 1 to nRep;
      A = RandContingency(c, r); /* sample from null distribution */
      q[i] = sum((A-E)##2 / E);  /* evaluate test statistic */
   return( q );
call randseed(54321);
q = MCChiSq(T, 10000);   /* simulate 10,000 tables */

After you have computed the statistics, the rest is easy. The following function computes the usual estimate and confidence interval for a binomial proportion:

/* 3 & 4. Compute the p-value and confidence interval */
/* Let x be binary 0/1 vector. Compute estimate for proportion. Use asymptotic 
   standard error to construct two-sided Wald confidence interval. */
start BinomialCI(x, alpha=0.05);
   p = mean(x);                       /* estimate proportion of 1s */
   se = sqrt(p*(1-p) / (nrow(x)-1));  /* standard error */
   z = quantile("Normal", 1-alpha/2); /* two-sided */
   LowerCL = p - z*se;
   UpperCL = p + z*se;
   return( p || max(LowerCL, 0) || min(UpperCL, 1) );   /* CL for proportion is in [0,1] */
x = (q>=q0);                    /* binary vector */
est = BinomialCI(x, 0.01);      /* compute 99% CL to match PROC FREQ */
print est[L="Binomial 99% CI" c={"Est" "LowerCL" "UpperCL"}];

For this simulation, 40 of the 10,000 random tables had a value of the test statistic that was at least as extreme as the observed table. The estimate (0.004) and the 99% confidence limits are consistent with the Monte Carlo estimates from PROC FREQ.

Lastly, you might want to use your Monte Carlo computations to visualize the approximate sampling distribution for the statistic. You can create a histogram of the 10,000 statistics and mark the location of the observed test statistic by using a reference line:

/* 5. Optionally, create a histogram and mark the observed statistic. */
reflineStmt = "refline " + char(q0) + " / axis=x;";
call Histogram(q) other=reflineStmt;

The last two steps demonstrate techniques that are useful in other situations. For example, you can use them to visualize a bootstrap distribution and to estimate p-values for permutation tests and other resampling methods.

Post a Comment

Exact tests in PROC FREQ: What, when, and how

Did you know that the FREQ procedure in SAS can compute exact p-values for more than 20 statistical tests and statistics that are associated with contingency table? Mamma mia! That's a veritable smorgasbord of options!

Some of the tests are specifically for one-way tables or 2 x 2 tables, but many apply to two-way tables of any dimension. You can specify exact p-values by using the EXACT statement in PROC FREQ.

The "Details" section of the PROC FREQ documentation lists all of the tests and statistics that support exact p-values or exact confidence intervals. With so many tests and options, it can be challenging to determine which tests are available for which kinds of tables. To better understand the choices, I classified the options according to the dimensions of table: one-way tables, 2 x 2 tables, general two-way tables, and stratified 2 x 2 tables. In the following lists, each description mentions the name of the option on the EXACT statement that specifies the computation.

One-way tables

For one-way tables, PROC FREQ provides exact p-values for the following tests:

  • Binomial proportion tests (BINOMIAL). The same option also provides exact (Clopper-Pearson) confidence limits for binomial proportions.
  • Chi-square goodness-of-fit test (PCHI and LRCHI options). You can tests a null hypothesis of equal proportions, or you can specify the proportions.

Two-way tables

For two-way tables, PROC FREQ provides exact p-values for the following tests:

  • Chi-square tests, including the Pearson chi-square test (PCHI), likelihood ratio chi-square test (LRCHI), and Mantel-Haenszel chi-square test (MHCHI), or use the CHISQ option to get all three.
  • Fisher’s exact test (FISHER).
  • Jonckheere-Terpstra test (JT), which assumes that a column represents an ordinal response variables.
  • Cochran-Armitage test for trend (TREND).

There are also exact p-values for the following statistics:

  • Kendall’s tau-b and Stuart’s tau-c statistics (TAUB and TAUC).
  • Somers’s D statistics (SMDCR and SMDRC).
  • Pearson and Spearman correlation coefficients (PCORR and SCORR).
  • Simple and weighted kappa coefficients (KAPPA and WTKAPPA).

2 x 2 tables

In addition to the tests for general two-way tables, for 2 x 2 tables PROC FREQ provides exact p-values for the following tests:

  • McNemar’s exact test for matched pairs (MCNEM).
  • Exact confidence limits for the odds ratio (OR).
  • Unconditional exact test for the proportion difference (BARNARD).
  • Exact unconditional confidence limits for the proportion difference (RISKDIFF) and for the relative risk (RELRISK).

Stratified 2 x 2 tables

For stratified 2 x 2 tables, PROC FREQ provides the following:

  • Zelen’s exact test for equal odds ratios (EQOR).
  • An exact test for the common odds ratio, which also provides exact confidence limits for the common odds ratio (COMOR).

When were these tests first available in SAS?

The previous sections contain a lot of information. Because not everyone is running SAS 9.4, I was curious as to when some of these tests were introduced. I was able to compile the following (non-exhaustive) list by trolling through various "What's New" documents:

  • SAS Version 6: Fisher's exact test was introduced.
  • SAS Version 8: The EXACT statement was introduced in SAS version 8. It contained exact p-values for binomial proportions in one-way tables, the many chi-square tests, and Fisher's exact test.
  • SAS 9.1: Exact confidence limits for the common odds ratio and related tests.
  • SAS 9.2: Exact unconditional confidence limits for the proportion (risk) difference and Zelen’s exact test for equal odds ratios.
  • SAS 9.3: Exact unconditional confidence limits for the relative risk.
  • SAS 9.4: The MIDP option in the EXACT statement produces exact mid p-values for several tests.

How does PROC FREQ compute exact p-values?

I've discussed what exact computations are available and when some of them became available, but what about how the computation works?

Because tables are integer-valued, you can theoretically enumerate all tables that satisfy a particular null hypothesis. For example, in a recent article about 2 x 2 tables, I was able to write down the nine possible tables that satisfied a particular constraint and to assign probabilities for each distinct table. Consequently, if you compute a statistic for each table, you can exactly calculate the probability that the statistic is more extreme than a particular observed value. That probability is the exact p-value. (For a more precise definition, see the "Exact Statistics" section of the documentation.)

In practice, PROC FREQ does not directly enumerate all of the possible tables that satisfy the null hypothesis for each test. For some of the exact tests for a general two-way table, it uses a network algorithm of Mehta and Patel (1983), which is more efficient than direct enumeration. The algorithm enables the procedure to compute exact p-values for many small and moderately sized tables.

Post a Comment

Simulate contingency tables with fixed row and column sums in SAS

How do you simulate a contingency table that has a specified row and column sum? Last week I showed how to simulate a random 2 x 2 contingency table when the marginal frequencies are specified. This article generalizes to random r x c frequency tables (also called cross-tabulations) that have the same marginal row and column sums as an observed table of counts.

For example, consider the following table:


The column sums are c = {18 6 7} and the row sums are r = {17, 8, 6}. There are many other tables that have the same row sums and column sums. For example, the following three tables satisfy those conditions:


In a previous blog post, I presented a standard algorithm (see Gentle, 2003, p. 206) that can generate an r x c table one column at a time. The first column is a random draw from the multivariate hypergeometric distribution with a total number of elements equal to the observed count for that column. After generating the first column, you subtract the counts from the marginal row counts. You then iteratively apply the same algorithm to the second column, and so forth.

The hypergeometric algorithm requires some special handling for certain degenerate situations. In contrast, a simpler algorithm from the algorithm of Agresti, Wackerly, and Boyett (1979) is mathematically equivalent but easier to implement. You can download a SAS/IML program that implements both these algorithms.

Assume the columns sums are given by the vector c = {c1, c2, ..., cm} and the row sums are given by r = {r1, r2, ..., rn}. Think about putting balls of m different colors into an urn. You put in c1 balls of the first color, c2 balls of the second color, and so forth. Mix up the balls. Now pull out r1 balls, count how many are of each color, and write those numbers in the first row of a table. Then pull out r2 balls, record the color counts in the second row of the table, and so forth.

The following SAS/IML module implements the algorithm in about 15 lines of code. For your convenience, the code includes the definition of TabulateLevels function, which counts the balls of every color, even if some colors have zero counts:

proc iml;
/* Output levels and frequencies for categories in x, including all reference levels. */
start TabulateLevels(OutLevels, OutFreq, x, refSet);
   call tabulate(levels, freq, x);        /* compute the observed frequencies */
   OutLevels = union(refSet, levels);     /* union of data and reference set */ 
   idx = loc(element(OutLevels, levels)); /* find the observed categories */
   OutFreq = j(1, ncol(OutLevels), 0);    /* set all frequencies to 0 */
   OutFreq[idx] = freq;                   /* overwrite observed frequencies */
/* Algorithm from Agresti, Wackerly, and Boyett (1979), Psychometrika */
start RandContingency(_c, _r);
   c = rowvec(_c);   m = ncol(c);
   r = colvec(_r);   n = nrow(r);
   tbl = j(n, m, 0);
   refSet = 1:m;
   vals = repeat(refSet, c);    /* vector replication: */
   perm = sample(vals, ,"WOR"); /* n=number of elements of vals */
   e = cusum(r);                /* ending indices */
   s = 1 // (1+e[1:n-1]);       /* starting indices */
   do j = 1 to n;
      v = perm[ s[j]:e[j] ];    /* pull out r1, r2, etc */
      call TabulateLevels(lev, freq, v, refSet);
      tbl[j, ] = freq;          /* put counts in jth row */
   return( tbl );

Every time you call the RandContingency function it generates a random table with the specified row and column sums. The following statements define a 3 x 3 table and generate three random tables, which appear at the beginning of this article. Each random table is a draw from the null distribution that assumes independence of rows and columns:

call randseed(12345);
Table = {10 1 6,
          3 5 0,
          5 0 1};
c = Table[+, ];
r = Table[ ,+];
do i = 1 to 3;
   RandTbl = RandContingency(c, r);
   print RandTbl;

The example is small, but the algorithm can handle larger tables, both in terms of the dimension of the table and in terms of the sample size. For example, the following example defines row and column sums for a 500 x 20 table. The sample size for this table is 30,000.

/* try a big 500 x 20 table */
r = j(500,  1,   60);
c = j(  1, 20, 1500);
tbl = RandContingency(c, r);
ods graphics / width=600px height=2000px;
call heatmapcont(tbl);

The table is too big to print in this article, but you can create a heat map to that shows the distribution of values. The adjacent image shows the first 40 rows of the heat map. The cells in the table vary between 0 (white) and 11 (dark blue), with about half of the cells having values 3 or 4. For tables of this size, you can create thousands of tables in a few seconds. For smaller tables, you can simulate ten thousands tables in less than a second.

In a future blog post I will show how to use the RandContingency function to simulate random tables and carry out Monte Carlo estimation of exact p-values.

Post a Comment