Size matters: Preserving the aspect ratio of the data in ODS graphics

When creating a statistical graphic such as a line plot or a scatter plot, it is sometimes important to preserve the aspect ratio of the data. For example, if the range of the X and Y variables are equal, it can be useful to display the data in a square region. This is important when you want to visualize the distance between points, as in certain multivariate statistics, or to visually compare variances. It is also important if you are plotting polygons and want a square to look like a square.

This article presents two ways to create ODS statistical graphics in SAS in which the scale of the data is preserved by the plot. You can use the ASPECT= option in PROC SGPLOT and the OVERLAYEQUATED layout in the Graph Template Language (GTL).

Data scale versus physical measurements of a graph

Usually the width and height of a graph does not depend on the data. By default, the size of an ODS statistical graphic in SAS is a certain number of pixels on the screen or a certain number of centimeters for graphs written to other ODS destinations (such as PDF or RTF). When you request a scatter plot, the minimum and maximum value of each coordinate is used to determine the range of the axes. However, the physical dimensions of the axes (in pixels or centimeters) depends on the titles, labels, tick marks, legends, margins, font sizes, and many other features.

For example, the following data has two variables. The X and Y variables both have a minimum value of 0 and a maximum value of 1. Therefore the range of each variable is 1. The default graph has a 4:3 ratio of width to height, so when you create a scatter plot, the physical lengths (in pixels) of the X and Y axes are not equal:

data S;                   /* XRange = YRange = [0, 1] */
input x y @@;
-1   -1     -0.75 -0.5   -0.5 -0.75  -0.25 0   0 0.5
0.25 -0.25   0.5   0.75   0.75 0.25   1    1
ods graphics / reset;      /* use default width and height */
title "Default Graph: 640 x 480 pixels";
title2 "Aspect Ratio 4:3";
proc sgplot data=S;
   scatter x=x y=y;
   xaxis grid; 
   yaxis grid;

You can click on the graph to see the original size. The graph area occupies 640 x 480 pixels. However, because of labels and titles and such, the region that contains the data (also called the wall area) is about 555 pixels wide and 388 pixels tall, which is obviously not square. You can see that each cell in the grid represents a square with side length 0.5, but the cells do not appear square on the screen because of the aspect ratio of the graph.

Setting the aspect ratio

Prior to SAS 9.4, PROC SGPLOT did not provide an easy way to set the aspect ratio of the wall area. You had to use trial and error to adjust the width of the graph until the wall area looked approximately square. For example, you could start the process by submitting ODS GRAPHICS / WIDTH=400px HEIGHT=400px;.

However, in SAS 9.4 you can use the ASPECT= option on the PROC SGPLOT statement to tell PROC SGPLOT to make the wall area (data region) square, as follows:

title "Graph: 640 x 480 pixels";
title2 "Aspect Ratio 1:1";
proc sgplot data=S aspect=1;     /* set physical dimensions of axes equal */
   scatter x=x y=y;
   xaxis grid; 
   yaxis grid;

Although the graph size has not changed, the wall area (which contains the data) is now square. The wall area is approximately 370 pixels in both directions.

Notice that graph has a lot of white space to the left and right of the wall area. You can adjust the width of the graph to get rid of the extra space.

This technique also works for other aspect ratios. For example, if the range of the Y variable is 2, you can use ASPECT=2 to set the wall area to be twice as high as it is wide.

The wall area is square because the range of the X variable equals the range of the Y variable, and the margins in the wall area (set by using the OFFSETMIN= and OFFSETMAX= options) are also equal. If your X and Y ranges are not exactly equal, read on.

Setting the range of the axes

In practice, the range of the X axis might not exactly equal the range of the Y axis. In that case, you can use the MIN= and MAX= options on the XAXIS and YAXIS statements to set the ranges of each variable to a common range. For example, in principal component analysis, the principal component scores are often plotted on a common scale. The following call to PROC PRINCOMP creates variables PRIN1, PRIN2, and PRIN3 that contain the principal component scores for numerical variables in the Sashelp.Iris data set:

proc princomp data=Sashelp.Iris N=3 out=OutPCA noprint;
   var SepalWidth SepalLength PetalWidth PetalLength;
proc means data=OutPCA N min max mean std;
   var Prin:;

You can see that the range of the three variables are not equal. However, you can use the ASPECT=1 option to display the scores so that one unit in the horizontal direction is the same number of centimeters as one unit in the vertical direction. The MIN= and MAX= options are used so that the ranges of the X and Y variables are equal:

ods graphics / width=480px height=480px;
title "Principal Component Scores";
title2 "Aspect Ratio 1:1";
proc sgplot data=OutPCA aspect=1;
   scatter x=Prin1 y=Prin2 / group=Species;
   xaxis grid min=-2.8 max=3.3; /* values=(-3 to 3) valueshint; */
   yaxis grid min=-2.8 max=3.3; /* values=(-3 to 3) valueshint; */

In spite of titles, legends, and labels, the wall area is a square. The width of the graph was reduced so that there is less blank space to the left and right of the wall area.

Notice the comments in the call to PROC SGPLOT. The comments indicate how you can explicitly set values for the axes, if necessary. You can use the VALUES= option to set the tick values. You can use the VALUESHINT option to tell PROC SGPLOT that these values are merely "hints": the tick values should not be used to extend the length of an axes beyond the range of the data.

Automating the process with GTL

I like PROC SGPLOT, but if you are running a version of SAS prior to 9.4, you can still obtain equated axes by using the GTL and PROC RENDER. The trick is to use the OVERLAYEQUATED layout, rather than the usual OVERLAY layout. The OVERLAYEQUATED layout ensures that the physical dimensions of the wall area is proportional to the aspect ratio of the data ranges. The following example uses the output from the PROC PRINCOMP analysis in the previous section:

proc template;                        /* scatter plot with equated axes */
define statgraph ScatterEquateTmplt;
dynamic _X _Y _Title;                 /* dynamic variables */
 entrytitle _Title;                   /* specify title at run time (optional) */
  layout overlayequated /             /* units of x and y proportions as pixesl */
         xaxisopts=(griddisplay=on)   /* put X axis options here */
         yaxisopts=(griddisplay=on);  /* put Y axis options here */
    scatterplot x=_X y=_Y;            /* specify variables at run time */
proc sgrender data=outPCA template=ScatterEquateTmplt; 
   dynamic _X='Prin1' _Y='Prin2' _Title="Equated Axes";

The output is not shown, but is similar to the graph in the previous section. The nice thing about using the GTL is that it supports the EQUATETYPE= option, which enables you to specify how to handle axes ranges that are not equal.

In summary, there are two ways to make sure that the physical dimensions of data area (wall area) of a graph accurately represents distances in the data coordinate system. You can use the GTL and the OVERLAYEQUATED layout, as shown in this section, or you can use the ASPECT= option in PROC SGPLOT if you have SAS 9.4. Although it is not always necessary to equate the X and Y axis, SAS supports it when you need it.

Post a Comment

Extracting elements from a matrix: rows, columns, submatrices, and indices

A matrix is a convenient way to store an array of numbers. However, often you need to extract certain elements from a matrix. The SAS/IML language aupports two ways to extract elements: by using subscripts or by using indices. Use subscripts when you are extracting a rectangular portion of a matrix, such as a row, a column, or a submatrix. Use indices when you want to extract values from a non-rectangular pattern.

Extracting rows, columns, and submatrices

You can extract a submatrix by using subscripts to specify the rows and columns of a matrix. Use square brackets to specify subscripts. For example, if A is a SAS/IML matrix, the following are submatrices:

  • The expression A[2,1] is a scalar that is formed from the second row and the first column of A.
  • The expression A[2, ] specifies the second row of A. The column subscript is empty, which means “use all columns.”
  • The expression A[ , {1 3}] specifies the first and third columns of A. The row subscript is empty, which means “use all rows.”
  • The expression A[3:4, 1:2] specifies a 2 x 2 submatrix that contains the elements that are in the intersection of the third and fourth rows of A and the first and second columns of A.

The following SAS/IML statements demonstrate extracting submatrices:

proc iml;
A = shape(1:30, 5);     /* 5 x 6 matrix */
scalar = A[2,1];
row = A[2, ];           /* 2nd row */
cols = A[ , {3 1}];     /* 3rd and 1st columns */
matrix =  A[3:4, 1:2];  /* 2 x 2 matrix: (A[3,1] || A[3,2]) // 
                                         (A[4,1] || A[4,2]) */
print A, scalar, row, cols, matrix;

The previous examples were adapted from Wicklin (2013) "Getting Started with the SAS/IML Language", which I recommend for programmers who are starting to learn the SAS/IML language.

Extracting diagonals and triangular elements

Non-rectangular patterns are common in statistical programming. Examples include the matrix diagonal and the lower triangular portion of a square matrix. The SAS/IML provides special functions for extracting diagonal and triangular regions:

For example, the following statements extract the main diagonal, the lower triangular elements in row-major order, and the lower triangular elements in column-major order:

proc iml;
S = shape(1:16, 4);    /* 5 x 5 matrix */
v = vecdiag(S);
L_row = symsqr(S);
L_col = vech(S);
print S, v, L_row, L_col;

Extracting arbitrary patterns of elements

For the extraction of arbitrary elements, you should use indices. SAS/IML software stores matrices in row-major order, which means the elements are enumerated as you move across the first row, then across the second row, and so forth. However, notice that you do not know the subscripts for A[3] unless you know the shape of A. If A is a 3 x 3 matrix, A[3] corresponds to A[1,3]. However, if A is a 2 x 2 matrix, A[3] corresponds to A[2,1].

The SUB2NDX function enables you to convert subscript information into the equivalent indices. For example, suppose that B is a 5 x 5 matrix and you want to extract the following elements: B[5,2], B[2,4], B[4,3], B[3,1], and B[1,5]. The following statements convert the subscripts into indices and extract the elements:

proc iml;
B = shape(1:25, 5);   /* 5 x 5 matrix */
subscripts = {5 2,  2 4,  4 3,  3 1,  1 5};  /* five (row,col) subscripts */
ndx = sub2ndx(dimension(B), subscripts);
vals = B[ndx];
print vals;

A powerful advantage of indices is that you can use them to assign values as well as to extract values. For example, if v is a five-element column vector, the expression B[ndx] = v assigns the values v to the elements of B. Notice that this is a vectorized operation. If you do not use indices, you would probably write a DO loop that iterates over the subscripts. In general, vector operations are more efficient than looping operations.

The ability to access elements in an arbitrary order is a big advantage for SAS/IML programmers. Whereas the DATA step processes one observation at a time, the SAS/IML language enables you to access SAS data in whatever order makes sense for the algorithm that you are writing.

Post a Comment

Determine whether a SAS product is licensed

Sometimes you are writing a program that needs to find out whether a particular SAS product (like SAS/ETS, SAS/QC, or SAS/OR) is licensed. I was reminded of this fact when I wrote last week's blog post about how to create a map with PROC SGPLOT. Although the SGPLOT procedure is included with Base SAS, the data set that I used (MAPS.US) is only available at sites that have licensed SAS/GRAPH software.

If you just want to find out what products are licensed at your site, but you don't need this information as part of a program, you can use the SETINIT procedure:

proc setinit; 
Product expiration dates:
---Base SAS Software          26OCT2016
---SAS/STAT                   26OCT2016
---SAS/GRAPH                  26OCT2016
---SAS/ETS                    26OCT2016
---SAS/IML                    26OCT2016

The SAS log contains all the licensed products. However, this information is hard to use in a program, which is necessary if you are writing a program that you intend to distribute to many sites. (Perhaps you intend to post it on the SAS/IML File Exchange>)

You can use the SYSPROD function to determine at run time whether a SAS product is licensed. For example, the following SAS/IML program tests whether the SAS/GRAPH product is installed. If not, it aborts the program and prints an error message. If so, it reads the MAPS.USCity data and performs a computation:

proc iml;
if sysprod("graph")=0 then 
   ABORT "SAS/GRAPH is not licensed at this site";
else do;
   use maps.USCity;  read all var {City StateCode Pop};  close maps.USCity;
   City = strip(City);
   idx = Pop[<:>];        /* find index of city with largest population */
   print (City[idx])[L="Largest City"]
         (Pop[idx])[L="Population" F=COMMA10.];
   idx = Pop[>:<];        /* find index of city with smallest population */
   print (City[idx])[L="Smallest City"]
         (Pop[idx])[L="Population" F=COMMA10.];

The output of the program is shown. The program uses the index maximum operator to discover that New York City has the largest population in the data set. Similarly, the index minimum operator is used to return one of the 17 "ghost towns" in the data set.

You can use the SYSPROD function in the DATA step and in SAS/IML. For SAS macro programs, you can use the %SYSPROD macro function. The documentation for %SYSPROD has an example program that tests whether SAS/GRAPH is installed.

By the way, for this example you could use the LIBREF function to check whether the MAPS libref exists. However, the SYSPROD function makes it clear that you are checking for the existing of a product, not the existence of a libref.

For a fun exercise, print out the complete set of ghost towns (WHERE POP=0) in the MAPS.USCity data set. Which states have the most ghost towns?

Post a Comment

Create a map with PROC SGPLOT


Did you know that you can use the POLYGON statement in PROC SGPLOT to draw a map? The graph at the left shows the 48 contiguous states of the US, overlaid with markers that indicate the locations of major cities. The plot was created by using the POLYGON statement, which was introduced in SAS 9.4m1.

The POLYGON statement is a versatile and powerful statement. My colleague Sanjay Matange has shown how to use the POLYGON statement to create custom plots such as a variable-width bar chart. He has also shown how to use the POLYGONPLOT statement in the Graph Template Language (GTL) to create panels of maps, and he mentioned that the POLYGON statement in PROC SGPLOT acts similarly.

Identifying polygons

You need to look carefully at Sanjay's program to see how he constructs an ID variable that identifies each polygon. For map data, it is common for regions to be defined in a nested hierarchical manner. For example, the MAPS.US data set that is distributed with SAS/GRAPH software has two variables (STATE and SEGMENT) that together define the polygons that make up the map of US states. (For example, Michigan has two segments that correspond to Lower Peninsula and the Upper Peninsula.) Sanjay uses a standard trick that concatenate multiple variable values into a single ID variable, and uses the concatenated values to identify each polygon.

Because the MAPS.US data set is sorted by the STATE and SEGMENT, you can generate an ID variable without using concatenation. Instead, you can use BY-group processing to generate an ID variable for each polygonal segment. The following DATA step defines the PolyID variable and increments its value every time a new segment is encountered. The PolyID variable is specified on the ID= variable in the POLYGON statement in order to create a graph of the regions in the data set:

data US;
   set maps.US;
   where StateCode NOT IN ("AK", "HI", "PR");
   by State Segment;
   if first.Segment then PolyID+1;    /* create ID variable for polygons */
ods graphics / antialiasmax=1500 labelmax=1500; /* enable anti-aliasing and labels */
title "Outline of 48 US States";
proc sgplot data=US;
   polygon x=x y=y ID=PolyID / fill outline;

Notice that the MAPS.US data set uses projected radians as the coordinates for the state boundaries. You can multiply these values by 180/π to convert to degrees. You can use the GPROJECT procedure in SAS/GRAPH software to project data sets that provide longitude and latitude variables.

Overlays on polygon plots

You can overlay data on a polygon plot. In particular, you can use the BUBBLE, TEXT, SCATTER, and VECTOR statements, as well as additional POLYGON statements to overlay additional spatial information on a map. The following DATA step reads the coordinates of a few large cities from the MAPS.USCITY data set and concatenates that information with the state-boundary data. The SCATTER statement overlays the cities and labels them.

data All; 
   set maps.uscity(where=(statecode NOT IN ("AK" "HI" "PR") & pop > 5e5)
                   rename=(x=cityX y=cityY))
title "US States with Some Large Cities";
proc sgplot data=All noautolegend;
   polygon x=x y=y ID=PolyID / fill outline transparency=0.75 group=StateCode;
   scatter x=cityX y=cityY / datalabel=city;
   xaxis display=none;
   yaxis display=none;

The graph is shown at the top of this article. Notice that the GROUP= option in the POLYGON statement is used to color each state.

With additional work, you can use the GROUP= option to create a choropleth map. You can use the DATTRMAP= option on the PROC SGPLOT statement to specify colors for the various group values. Alternatively, you can use the STYLEATTRS statement to assign colors for groups.

The SGPLOT procedure does not replicate the functionality of the SAS/GRAPH software. The mapping procedures in SAS/GRAPH software (which includes GREDUCE, GPROJECT, GMAP, and MAPIMPORT) enable you to manipulate and visualize map data in more complex ways. However, the POLYGON statement in PROC SGPLOT enables you to create some simple maps that visualize countries, states, and counties, and to overlay graphical elements by using other SGPLOT statements.

Post a Comment

Label markers in graphs by using the values of several variables

In many procedures, the ID statement is used to identify observations by specifying an identifying variable, such as a name or a patient ID. In many regression procedures, you can specify multiple ID variables, and all variables are copied into output data sets that contain observation-wise statistics such as predicted values and residuals.

Values of an ID variable can also be used to label markers on a graph. For example, the following call to PROC REG models the weights of 19 students as a linear function of their heights. The PLOTS=COOKSD(LABEL) option requests a plot of Cook's D statistic, which gives information about observations that appear to have a large influence on the parameter estimates. The LABEL option requests that extreme values of the Cook's D statistic be labeled by using the ID variable. Let's see what happens if you specify two variables on the ID statement:

proc reg data=sashelp.class plots(only)=cooksd(label);
model weight = height;
id name age;             /* specify TWO ID variables */
ods select CooksDPlot;

I've added a red rectangle that surrounds the labeled marker. Although two variables (NAME and AGE) were specified in the ID statement, the graph clearly shows that only the first variable (NAME) was used as a label. The documentation for the PLOTS= option tells us that this is the expected behavior: "If you specify one or more ID variables..., then the first ID variable [that] you specify is used for the labeling."

Most procedures exhibit a similar behavior, although a few (LOGISTIC, TRANSREG, ...) provide tooltips or other support for multiple ID variables.

For the procedures that support labeling by only one variable, there is an easy workaround: create a new ID variable by concatenating several other variables.

Concatenating multiple ID variables

You can use the concatenation operator (||) in the DATA step to concatenate two variables together. The following statements concatenate values of the NAME and AGE variables. The TRIM function can be used to remove trailing blanks from the NAME variable. In many cases, the DATA step will automatically convert the numeric AGE variable to a character value, but the following statements use the PUT function to explicitly convert a numeric variable to a character value prior to concatenation:

/* concatenate NAME and AGE to make new ID variable */
data class;
set sashelp.class;
length labl $12;    /* NAME holds 8 characters. Allow for delimiter and age */
labl = trim(Name) || ": " || put(age, 2.);  /* convert two-digit age to character */
proc reg data=class plots(only)=cooksd(label);
model weight = height;
id labl;               /* label by using new variable */
ods select CooksDPlot;

With this change, each extreme point in the graph is labeled with the values of the NAME and AGE variables.

Labeling markers in ODS graphics

You can use the same trick to label a marker in a graph created by PROC SGPLOT. The canonical example is a scatter plot in which you want to label all observations. Because the DATALABEL= option supports only a single variable, you have to use concatenation if you want to see the values of multiple variables:

proc sgplot data=class;
scatter x=height y=weight / datalabel = labl;

A drawback of the "concatenation technique" is that most graphs do not have room to display long labels for many points. The DATALABEL= option uses a collision-avoidance algorithm to avoid overplotting labels, but if you have hundreds or thousands of markers, it is best to use another mechanism for labeling markers.

One option to display multiple labels is to label only certain observations in the graph, such as observations that have extreme values. Another option, which works if you are going to examine the graph on your computer, is to use the TIP= option to create a tooltip for the scatter plot. With a tooltip, you can hover a pointer over a marker and view a pop-up window that displays the values of multiple variables.

Post a Comment

Guessing games, ensemble averages, and the wisdom of the crowd


How much does this big pumpkin weigh? One of the cafeterias at SAS invited patrons to post their guesses on an internal social network at SAS. There was no prize for the correct guess; it was just a fun Halloween-week activity.

I recognized this as an opportunity to apply the "wisdom of the crowd" to create a potentially winning guess. Because the guesses were being posted to an online community, I could see all the guesses. My hypothesis was that the average of other people's guesses would be an accurate prediction for the pumpkin's weight. I decided to wait five days before submitting my prediction.

The following SAS data set shows the complete set of guesses for the pumpkin's weight. The first line after the DATALINES statement shows the 10 guesses that were submitted during the first five days. The second line shows my guess, which was 36 pounds. The third line shows subsequent guesses. The call to PROC MEANS shows the analysis of the first 10 guesses:

data guess;
N = _N_;
input Weight @@;
sum + Weight;
label Mean = "Running Mean"
      N = "Patron ID";
if N > 2 then Mean = sum / N;
37.5 35 47.5 14.5 25.5 36.6 17.2 51 32.4 37.5
39.7 42 43.6 46 35.2 36.75 28.7 42.4 40.2 48
proc means data=guess(obs=10) N mean median min max;
var Weight;

The output from PROC MEANS shows that the mean value of the first 10 guesses was 33.47. Because the range of guesses included values that I considered to be extreme, I decided instead to use the median value (35.8), which I rounded to 36 pounds. That was my guess.

Over the next few days 10 additional guesses were posted. Finally, the pumpkin's weight was revealed: it was 35 pounds. My prediction, which was based on the "wisdom" of 10 previous guesses, was within 3% of the true value. However, I did not win the contest because in fact the true weight of the pumpkin was guessed by the second person to enter the contest!

The following graph shows the sequence of guesses. My guess is marked with a star. The true weight of the pumpkin is indicated by a horizontal line. The running mean is shown as a jagged line. You can see that the running mean is very close to the true value after only a few guesses. The running mean was closer to the true value than most of the guesses.

data Augment;
set guess;
if N = 11 then Rick = 36; else Rick = .;
label Rick = "Rick's Guess";
proc sgplot data=Augment;
scatter x=N y=Weight / legendlabel="Guess";
scatter x=N y=Rick / markerattrs=(symbol=starfilled size=12 color=red);
series x=N y=Mean;
xaxis grid; yaxis grid;
refline 35 / axis=y lineattrs=(color=blue);

Using the "wisdom of the crowd" to estimate a weight has a long history that pre-dates this pumpkin-guessing contest. In 1906 Francis Galton, who introduced many important concepts in statistics, was at a county fair when he noticed a contest that encouraged fairgoers to guess the weight of an ox. After the contest ended, he obtained the paper guesses for almost 800 villagers. He computed the median guess and noticed that it was within 1% of the true weight. He called this phenomenon vox populi or "the voice of the people."

Statistically speaking, there is a difference between Galton's procedure and my own. In the pumpkin-guessing contest, everyone could see the other guesses, which means that the guesses were not independent. Because others had guessed in the range 15–50 pounds, it was unlikely that someone new would guess an extreme value such as 10 or 100 pounds, which introduced bias. Furthermore, because only the first person to correctly guess the weight would be the winner, participants presumably modified their guesses and chose a weight that had not yet been submitted, thus preventing duplicate guesses.

Nevertheless, this simple experiment shows that combining multiple predictions can often result in a prediction that is superior to the original predictions. In statistics this is called an ensemble model or a consensus model. Ensemble models are used to predict the path and intensity of hurricanes and even to predict which movies you will enjoy watching on Netflix.

Taking a mean or median is a very simple kind of ensemble model. The GLMSELECT procedure in SAS/STAT software enables you to perform model averaging as part of constructing a parsimonious linear regression model. SAS Enterprise Miner contains several sophisticated predictive algorithms that use ensemble models; see Maldonado, et al. (2014), "Leveraging Ensemble Models in SAS Enterprise Miner."

Post a Comment

Point/Counterpoint: Symbolic versus mnemonic logical operators in SAS


In SAS, the DATA step and PROC SQL support mnemonic logical operators. The Boolean operators AND, OR, and NOT are used for evaluating logical expressions. The comparison operators are EQ (equal), NE (not equal), GT (greater than), LT (less than), GE (greater than or equal), and LE (less than or equal). These character-based operators are called mnemonic because their names make it easy to remember what the operator does.


Each mnemonic operator in SAS has an equivalent symbolic operator. The Boolean operators are & (AND), | (OR), and ^ (NOT). The comparison operators are = (EQ), ^= (NE), > (GT), < (LT), >= (GE), and <= (LE). The symbol for the NOT and NE operators can vary according to the computer that you use, and the tilde character (~) can be used in place of the caret (^).

Mnemonic operators tend to appear in older languages like FORTRAN, whereas symbolic operators are common in more recent languages like C/C++, although some relatively recent scripting languages like Perl, PHP, and Windows PowerShell also support mnemonic operators. SAS software has supported both operators in the DATA step since the very earliest days, but the SAS/IML language, which is more mathematically oriented, supports only the symbolic operators.

Functionally, the operators in SAS are equivalent, so which ones you use is largely a matter of personal preference. Since consistency and standards are essential when writing computer programming, which operators should you choose?

The following sections present arguments for using each type of operator. The argument for using the mnemonic operators is summarized by Mnemonic Norman. The argument for using symbols is summarized by Symbolic Sybil. Finally, there is a rejoinder by Practical Priya. Thanks to participants on the SAS-L discussion forum and several colleagues at SAS for sharing their thoughts on this matter. Hopefully Norman, Sybil, and Priya represent your views fairly and faithfully.

Use the mnemonic operators

Hi, I'm Mnemonic Norman, and I've been programming in SAS for more than 30 years. I write a lot of DATA step, SQL, and macro code. I exclusively use the mnemonic operators for the following reasons:

  1. Easy to type. I can touch-type the main alphabet, but I've never mastered typing symbols without looking down at my fingers. In addition, exotic symbols like | (OR) are not usually located in an easy-to-reach location on my keyboard. By using the mnemonic operators, I can avoid hitting the SHIFT key and can write programs faster.
  2. Easy to read. Even complex comparisons are easy to read because they form a sentence in English:
    if x gt 0 AND sex eq "MALE" then ...
  3. Easy to remember. There is a reason why these are called mnemonic operators! I program in several different languages, and each one uses a different NE operator. In SAS it is ^=. In Lua the NE operator is ~=, in Java it is !=, and the ANSI standard for SQL is <>. I use NE so I don't have to remember the correct symbol.
  4. Easy to communicate. My boss and clients are not statisticians. They can understand the mnemonic operators better than abstract symbols.
  5. Easy to see. I don't want to emphasize my age, but statistics show that most people's eyesight begins to diminish after age 40. I find the symbols | and ^ particularly difficult to see.
  6. Easy to distinguish assignment from comparison. I like to distinguish between assignment and logical comparison with equality, but SAS uses the = symbol for both. Therefore I use the equal sign for assignment and use EQ for logical comparison. For example, in the statement
    b = x EQ y;
    it is easy to see that b is a variable that holds a Boolean expression. The equivalent statement
    b = x = y;
    looks strange. (Furthermore, in the C language, this expression assigns the value of y to both b and x.)
  7. Easy to use macro variables. I reserve the ampersand for macro variables. If I see an expression like x&n, I immediately assume that the expression resolves to a name like x1 or x17. To avoid confusion with macro variables, I type x AND n when that is what I intend.
  8. Easy to cut and paste. Because the less-than and greater-than symbols are used to delimit tags in markup languages such as HTML and XML, they can disappear when used in Web pages. In fact, I dare you to try to post this comment to Rick's blog: "I use the expression 0 < x and y > 1." This is what you'll get: "I use the expression 0 1."

Use the symbolic operators

Hi, I'm Symbolic Sybil, and I've been programming in SAS for a few years. In school I studied math, statistics, and computer science. In addition to SAS, I program in C/C++, and R. I use symbolic operators exclusively, and here are reasons why:

  1. Consistent with mathematics. When a text book or journal presents an algorithm, the algorithm uses mathematical symbols. If you study Boolean logic, you use symbols. Symbols are a compact mechanism for representing complex logical conditions. Programs that implement mathematical ideas should use mathematical notation.
  2. Consistent with other modern languages. I don't use FORTRAN or SQL. I might write a DATA step to prepare data, then jump into PROC IML to write an analysis. Sometimes I call a package in R or a library in C++. I use symbols because all the languages that I use support them.
  3. Distinguish variables from operators. Symbols are not valid variable names, so it is easy see which tokens are operators and which are variables. Although Norman claims that symbols are hard to see, I argue that they stand out! If a data set has variables named EQ and LT, the expression EQ > LT is more readable than the equivalent expression EQ GT LT.
  4. Enforce coding discipline. Some of Norman's arguments are the result of lazy programming habits. The only reason he can't remember symbols is because he doesn't use them regularly. If you put spaces around your operators, you will never confuse x&n and x & n. As to remembering which operators are supported by which programming language, that is an occupational hazard. We are highly paid professionals, so learn to live with it. I don't think the solution is to use even more operators!
  5. Easy to communicate. I disagree with Norman's claim that his non-statistical boss and clients will understand character-based operators easier. How patronizing! Did they drop out of school in the third grade? Furthermore, in the modern world, we need to be inclusive and respectful of different cultures. The character-based operators are Anglocentric and might not be easy to remember if your client is not a native English speaker. In Spanish, "greater than" is "mayor que" and "equal" is "igual". In contrast, mathematical symbols are universal.

Use them both, but be consistent

Hi, I'm Pratical Priya. There is no a need to start a flame war or to make this an either/or debate. As a famous computer scientist wrote, "the nice thing about standards is that you have so many to choose from."

I used symbols exclusively until I consulted on a project where the client insisted that we use mnemonic operators. Eventually I gained an appreciation for mnemonic operators. I think they are easier to see and are more readable for experts and non-experts alike.

Today I use a combination of symbols and mnemonic operators. Like Norman, I find the logical operators AND, OR, and NOT easier to type and to read than the symbols &, |, and ^. For the relational (comparison) operators, I always use <, >, and =. I learned these symbols in school and they are universally understood.

I argue that a hybrid approach is best: In the DATA step I use mnemonic Boolean operators but use symbols for comparison operators. This presents a clear visual separation between clauses that FORM Boolean expression and clauses that OPERATE ON logical expressions, like this:
if x = 5 AND missing(y) OR y < z then ...

However, I'm embarrassed to admit that I do not consistently use symbols for the comparison operators. I also use NE, which is inconsistent with my scheme but is more readable than ^=. If my keyboard had a "not equals" symbol (≠), I'd use it, but until then I'm sticking with NE.

Your turn: Which logical operators do you use and why?

Norman, Sybil, and Priya have made some good points. Who do you agree with? What rules do you follow so that your SAS programs use logical operators in a readable, consistent manner? Leave a comment, but as Norman said, be careful typing < and >. You might want to use the HTML tags &lt; and &gt;.

Post a Comment

Trap and cap: Avoid division-by-zero and domain errors when evaluating functions

Statistical programmers often need to evaluate complicated expressions that contain square roots, logarithms, and other functions whose domain is restricted. Similarly, you might need to evaluate a rational expression in which the denominator of the expression can be zero. In these cases, it is important to avoid evaluating a function at any point that is outside the function's domain.

This article shows a technique that I call "trap and cap." The "trap" part of the technique requires using logical conditions to prevent your software from evaluating an expression that is outside the domain of the function. The "cap" part is useful for visualizing functions that have singularities or whose range spans many orders of magnitude.

This technique is applicable to any computer programming language, and it goes by many names. It is one tool in the larger toolbox of defensive programming techniques.

This article uses a SAS/IML function to illustrate the technique, but the same principles apply to the SAS DATA step and PROC FCMP. For an example that uses FCMP, see my article about defining a log transformation in PROC FCMP.

ERROR: Invalid argument to function

The following SAS/IML function involves a logarithm and a rational expression. If you naively try to evaluate the function on the interval [-2, 3], an error occurs:

proc iml;
/* Original function module: No checks on domain of function */
start Func(x);
   numer = exp(-2*x) # (4*x-3)##3 # log(3*x);       /* not defined for x <= 0 */
   denom = x-1;                                     /* zero when x=1 */
   return( numer / denom );
x = do(-2, 3, 0.01);
f = Func(x);
ERROR: (execution) Invalid argument to function.

 count     : number of occurrences is 200
 operation : LOG at line 881 column 40

The error says that there were 200 invalid arguments to the LOG function. The program stops running when it encounters the error, so it doesn't encounter the second problem, which is that the expression numer / denom is undefined when denom is zero.

Experienced SAS programmers might know that the DATA step automatically plugs in missing values and issue warnings for the LOG function. However, I don't like warnings any more than I like errors: I want my program to run cleanly!

Trap bad input values

Because the function contains a logarithm, the function is undefined for x ≤ 0. Because the function is a rational expression and the denominator is zero when x = 1, the function is also undefined at that value.

The goal of the "trap" technique is to prevent an expression from being evaluated at points where it is undefined. Accordingly, the best approach is to compute the arguments of any function that has a restricted domain (log, sqrt, arc-trigonometric functions,...) and then use the LOC function to determine the input values that are within the domain of all the relevant functions. The function is only evaluated at input values that pass all domain tests, as shown in the following example:

proc iml;
/* Trap bad domain values; return missing values when x outside domain */
start Func(x);
   f = j(nrow(x), ncol(x), .);    /* initialize return values to missing */
   /* trap bad domain values */
   d1 = 3*x;                      /* argument to LOG */
   d2 = x-1;                      /* denominator */
   domainIdx = loc( d1>0 & d2^=0 );     /* find values inside domain */
   if ncol(domainIdx)=0 then return(f); /* all inputs invalid; return missing */
   t = x[domainIdx];              /* elements of t are valid */
   numer = exp(-2*t) # (4*t-3)##3 # log(3*t);
   denom = t-1;
   f[domainIdx] = numer / denom;  /* evaluate f(t) */
   return( f );
x = do(-2, 3, 0.01);
f = Func(x);

This call does not encounter any errors. The function returns missing values for input values that are outside its domain.

In practice, I advise that you avoid testing floating-point numbers for equality with zero. You can use the ABS function to check that quantities are not close to zero. Consequently, a preferable trapping condition follows:

   domainIdx = loc( d1>0 & abs(d2)>1e-14 );  /* find values inside domain */

I also recommend that you use the CONSTANT function to check for domain values that would result in numerical underflows or overflows, which is especially useful if your expression contains an exponential term.

Cap it! Graphing functions with singularities

You have successfully evaluated the function even though it has two singularities. Let's see what happens if you naively graph the function. The function has a horizontal asymptote at y=0 and vertical asymptotes at x=0 and x=1, so you can use the REFLINE statement in PROC SGPLOT to draw reference lines that aid the visualization. In SAS/IML, you can use the SCATTER subroutine to create a scatter plot, and pass the REFLINE statements by using the OTHER= option, as follows:

refStmt = "refline 0/axis=y; refline 0 1/axis=x;"; /* add asymptotes and axes */
call scatter(x, f) other=refStmt;

The graph is not very useful (look at the vertical scale!) because the magnitude of the function at one point dwarfs the function values at other points. Consequently, you don't obtain a good visualization of the function's shape. This is often the case when visualizing a function that has singularities.

To obtain a better visualization, manually cap the minimum and maximum value of the graph of the function. In the SGPLOT procedure, you can use the MIN= and MAX= options on the YAXIS statement to cap the view range. If you are using the CALL SCATTER routine in SAS/IML, you can add the YAXIS statement to the OTHER= option, as follows:

/* cap the view at +/-5 */
title "Sketch of Function";
title2 "-5 < y < 5";
capStmt = "yaxis min=-5 max=5;";
call scatter(x,f) other=capStmt + refStmt;

This latest graph provides a good overview of the function's shape and behavior. For intervals on which the function is continuous, you can get a nicer graph if you use the SERIES subroutine to connect the points and hide the markers. The following statements re-evaluate the function on the interval (0, 1) and caps the vertical scale by using -1 < y < 1:

title "Zoom into [0, 1]";
title2 "-1 < y < 1";
x = do(0, 1, 0.01);
f = Func(x);
capStmt = "yaxis min=-1 max=1;";
call series(x,f) other=capStmt + refStmt;

The resulting graph shows that the function has two roots in the interval (0, 1) and a local maximum near x=0.43. A visualization like this is often an important step in finding the roots or local extrema of a function.

In summary, the trap-and-cap technique enables you evaluate functions that have singularities or that have a restricted domain. For SAS/IML functions, you can use the LOC function to keep only the input values that are in the domain of the function. (In a scalar language, such as the SAS DATA step, you can use IF-THEN/ELSE statements instead.) For functions with singularities, you can use the YAXIS statement in PROC SGPLOT to ensure that a graph of the function reveals interesting features and not just extreme values.

Post a Comment

Can't find that data? Search all variables in all data sets

Sometimes I can't remember where I put things. If I lose my glasses or garden tools, I am out of luck. But when I can't remember where I put some data, I have SAS to help me find it.

When I can remember the name of the data set, my computer's operating system has tools to help me find the file. However, sometimes I can remember the name of the variable, but not the data set name. Or I might remember that there are some variables that satisfy a statistical condition, but I can't remember their names or the name of the data set they're in.

In situations like these, you might think that I'd be stuck, but this article shows how to search for variables when you can't remember which data set contains them.

Three useful SAS/IML function for search through data sets

Last week I had two occasions where I had to conduct a search for variables that I only dimly remembered.

Base SAS programmers use SQL, SAS dictionaries, the DATA step, and other tools to search through libraries of SAS data sets. For example, you can use these tools to write a 'grep macro' to locate variables (in any data set) that contain a specific value.

In SAS/IML software there are three functions that enable you to search through multiple data sets:

  1. The DATASETS function returns the names of all SAS data sets in a specified libref.
  2. The CONTENTS function returns the names of all variables in a specified data set.
  3. The USE statement supports opening a data set when the name of the data set is stored in a character variable.

I've previously used the DATASETS function and the USE statement to read hundreds of data sets in a loop. I've used the CONTENTS function to find variable names that are common to multiple data sets.

Find a variable that has a specified name

Suppose you can remember the name of a variable, but you cannot remember the name of the data set that the variable is in. No problem! The following SAS/IML function loops through all data sets in the SASHELP library (or any other libref) and finds all data sets that contain the specified variable. To illustrate the technique, this program searches all data sets for a variable named "weight":

proc iml;
lib = "sashelp";          /* where to look? */
ds = datasets(lib);       /* vector contains names of all data sets */
/* Find data sets that contain a certain variable name */
do i = 1 to nrow(ds);
   varnames = upcase( contents(lib, ds[i]) ); /* var names in i_th data set */
   if any(varnames = "WEIGHT") then 
      print (ds[i]);

The loop is easy and intuitive. This example does not use the USE statement to open the data sets because you are not interested in a variable's data, just its name.

Find a pair of strongly correlated variables

The next example finds variables that have an extreme value of some statistic. Specifically, the goal is to find a pair of variables that have a large positive or negative correlation.

You can write a loop that reads all numerical variables in the SASHELP library and computes all pairwise correlations. There are three tricks in the following program:

  1. Some data sets might not have any numeric variables, and Pearson correlations are meaningless for these data sets. The program uses the _ALL_ keyword on the READ statement to read the data. If a data set contains even one numerical variable, it will be read into the X matrix. If not, the matrix X will be a character matrix, and you can use the TYPE function to check whether X is numeric.
  2. You can restrict your attention to correlations in the lower triangular portion of the correlation matrix. The program uses the StrictLowerTriangular function from a previous post about how to extract the lower triangular elements of a matrix.
  3. There are some data sets that have only two nonmissing observations or that have two variables that are a perfectly correlated (1 or -1). The following statements ignores these non-interesting cases.
start StrictLowerTriangular(X);
   if ncol(X)=1 then return ( {} ); /* return empty matrix */
   v = cusum( 1 || (ncol(X):2) );
   return( remove(vech(X), v) );
/* find data sets variables in SASHELP that have the most extreme correlations */
dsNames = j(2, 1, BlankStr(32));  /* remember data set names for min/max */
corrVals = {1, -1};               /* default values for min and max values */
do i = 1 to nrow(ds);
   dsName = lib + "." + strip(ds[i]);
   use (dsName);
   read all var _ALL_ into x;     /* 1. read all numerical vars in i_th data set */
   close (dsName);
   if type(x)='N' then do;      
      w = StrictLowerTriangular( corr(x, "pearson", "pairwise") );  /* 2 */
      if ^IsEmpty(w) then do;     /* check to see if there is a new min or max */
         if (min(w) > -1) & (min(w) < corrVals[1]) then do;  /* 3 */
            corrVals[1] = min(w);
            dsNames[1] = ds[i];
         if (max(w) < 1) & (max(w) > corrVals[2]) then do;   /* 3 */
            corrVals[2] = max(w);
            dsNames[2] = ds[i];
print dsNames corrVals[f=best20.];

The output shows that the most negatively correlated variables are in the Sashelp.Yr1001 data set. You can reread that data set to find the names of the variables. In a similar way, the Sashelp.Enso data set contains two variables that are almost perfectly positively correlated because they are measuring the same quantity in different units.

You can use this technique to investigate a wide variety of statistics. Which variables have the largest range? The largest standard deviation? You can also write your own "grep function" to locate the character variables that have specified values. The resulting function is simpler and shorter than the "grep macro" in Base SAS.

Post a Comment

The CUSUM-LAG trick in SAS/IML

Every year near Halloween I write a trick-and-treat article in which I demonstrate a simple programming trick that is a real treat to use. This year's trick features two of my favorite functions, the CUSUM function and the LAG function. By using these function, you can compute the rows of a matrix (or indices of a vector) that correspond to the beginning and end of groups.


The picture at left illustrates the problem to solve. You have data that contains a grouping or ID variable, and you want to find the first and last observation that corresponds to each group. In the picture, the ID variable is named Group. The beginning the groups are indicated by the green arrows: rows 1, 3, 6, and 8. The end of the groups are located at the red arrows: rows 2, 5, 7, and 10.

Often it is the case that you already know the sizes of the groups from some previous calculation. If you know the sizes, the CUSUM and LAG functions enable you to directly construct the vectors that contain the row numbers. (Of course, if the group sizes are all equal, you can easily solve this problem by using the DO function.)

Recall that the CUSUM function computes the cumulative sum of the elements in a vector. I've used the CUSUM function in more than a dozen blog posts over the years. The LAG function shifts a vector of values and is useful for any vectorized operations that involve comparing or manipulating adjacent values.

To see the CUSUM-LAG trick in action, assume that you know the sizes of each group. For the example in the picture, the group sizes are 2, 3, 2, and 3. The following function returns the vector of rows that begin and end each group::

proc iml;
/* Return kx2 matrix that contains the first and last elements 
   for k groups that have sizes s[1], s[2],...,s[k]. The i_th row contains
   the first and last index for the i_th group. */
start ByGroupIndices( s );
   size = colvec(s);              /* make sure you have a column vector */
   endIdx = cusum(size);          /* locations of ending index */
   beginIdx = 1 + lag(endIdx);    /* begin after each ending index ... */
   beginIdx[1] = 1;               /*    ...and at 1  */
   return ( beginIdx || endIdx );
size = {2, 3, 2, 3};              /* sizes for this example */
idx = ByGroupIndices(size);
print idx[c={"Begin" "End"}];

The CUSUM function computes the end of each group by accumulating the group sizes. The LAG function merely returns a vector of the same size, but with a missing value in the first element and all other elements shifted down one position. To that vector, you add 1 because the next group starts immediately after the previous group ends. Lastly, you just replace the missing value in the first position with 1 because the first group starts at the first row.

I've used the CUSUM-LAG trick many times. It is useful when you are simulating data in which the size of the sample is also being simulated. For example, just last week I used it to simulate contingency tables. Several years ago I used this trick to expand data by using frequencies. In general, you can use it to perform "BY-group processing" in the SAS/IML language.

Happy Halloween to all my readers. I hope you find this trick to be a real treat!

Post a Comment