Sample with replacement and unequal probability in SAS

How do you sample with replacement in SAS when the probability of choosing each observation varies? I was asked this question recently. The programmer thought he could use PROC SURVEYSELECT to generate the samples, but he wasn't sure which sampling technique he should use to sample with unequal probability. This article describes how to sample with replacement and unequal probability in SAS.

Sample with replacement and unequal probability with PROC SURVEYSELECT

The programmer's confusion is completely understandable. The SURVEYSELECT documentation is written for survey statisticians, who use specialized language to describe sampling methods. To a survey statistician, sampling with unequal probability is known as sampling with probability proportional to size (PPS), which is often abbreviated as PPS sampling. The SURVEYSELECT procedure has many methods for PPS sampling. For PPS sampling with replacement, specify the METHOD=PPS_WR option.

Sampling in proportion to size has many applications. For example, if you want to survey people at your company, you could randomly select a certain number of people in each building, where the probability of selection is proportional to the number of people who work in each building. Or you could use PPS sampling to obtain a representative sample across departments.

The following example demonstrates sampling with replacement and with unequal probability. Suppose a small town has five busy intersections. The town planners believe that the probability of an accident at an intersection is proportional to the traffic volume. They want to simulate the locations of 500 accidents by using PPS sampling with replacement, where the relative traffic volumes determine the probability of an accident's location.

The following data shows the annual average daily traffic data for each intersection. The call to the SURVEYSELECT procedure uses METHOD=PPS_WR and N=500 to simulate 500 accidents for these intersections. The the SIZE statement specifies the relative traffic, which determines the probability of an accident in each intersection.

data Traffic;
label VehiclesPerDay = "Average Annualized Daily Traffic";
input Intersection $21. VehiclesPerDay;
format VehiclesPerDay comma.;
Crash Pkwy/Danger Rd  25000
Fast Dr/Danger Rd     20000
Crazy St/Smash Blvd   17000
Crazy St/Collision Dr 14000
Crash Pkwy/Dent Dr    10000
/* sample with replacement, probability proportional to size */
proc surveyselect noprint data=Traffic out=Sample
     seed=123 N=500;       /* use OUTHITS option if you want 500 obs */
   size VehiclesPerDay;    /* specify the probability variable */
proc print data=Sample noobs; 
   var Intersection VehiclesPerDay NumberHits ExpectedHits; 
Sample with replacement and unequal probability

As you can see, the counts of crashes in the simulated sample are close to their expected values. Each time you run a simulation you will get slightly different values. You can use the REPS= option in the PROC SURVEYSELECT statement to generate multiple samples.

I have blogged about this technique before, but my previous focus was on simulating data from a multinomial distribution. The METHOD=PPS_WR option generates counts that follow a multinomial distribution with proportion equal to the standardized "size" variable (which is VehiclesPerDay / sum(VehiclesPerDay)).

If you want a data set that contains all 500 draws, rather than the counts, you can add the OUTHITS option to the PROC SURVEYSELECT statement.

Sample with replacement and unequal probability with PROC IML

For SAS/IML programmers, it might be more convenient to simulate data within PROC IML. The SAS/IML language provides two routines for sampling with replacement and with unequal probability. If you want only the counts (as shown in the previous example) you can use the RANDMULTINOMIAL function and specify the proportion of the traffic that passes through each intersection, as follows:

proc iml;
call randseed(54321);
use Traffic;
   read all var {"Intersection" "VehiclesPerDay"};
Proportion = VehiclesPerDay / sum(VehiclesPerDay);   
Counts = RandMultinomial(1, 500, Proportion);    /* 1 sample of 500 events */

If instead you want the full sample (all 500 values in a random ordering), you can use the SAMPLE function. The third argument to the SAMPLE function enables you to specify whether the sampling is done with or without replacement. The fourth argument enables you to specify the unequal sampling probabilities, as follows:

Sample = sample(Intersection, 500, "Replace", Proportion);

In summary, when you want to sample with replacement and with unequal probabilities, use the METHOD=PPS_WR option in PROC SURVEYSELECT or use the SAMPLE function in SAS/IML.

Post a Comment

Read data into vectors or into a matrix: Which is better?

In the SAS/IML language, you can read data from a SAS data set into a set of vectors (each with their own name) or into a single matrix. Beginning programmers might wonder about the advantages of each approach. When should you read data into vectors? When should you read data into a matrix?

Read data into SAS/IML vectors

You can specify the names of data set variables in the SAS/IML READ statement, as follows:

proc iml;
use Sashelp.Class;                       /* open the data set */
read all var {"Name" "Height" "Weight"}; /* read 3 vars into vectors */
close Sashelp.Class;                     /* close the data set */

The previous statements create three vectors, whose names are the same as the variable names. You can perform univariate analyses on the vectors, such as descriptive statistics. You can also create new variables from arbitrary transformations of the vectors, such as the following computation of the body mass index:

BMI = weight / height##2 * 703;
print BMI[rowname=Name];

Some of the advantages of reading data into vectors are:

  • Variables are given informative names.
  • You can use a single READ statement to read character variables and numerical variables.

When you load summarized data, you might want to read the variables into vectors. For example, to read the ParameterEstimates table from a regression analysis, you probably want to read the variable names, parameter estimates, standard errors, and p-values into separate vectors.

Read data into a SAS/IML matrix

You can use the INTO clause in the READ statement to read data set variables into a SAS/IML matrix. All the variables have to be the same type, such as numeric. For example, the following statements read three numeric variables into a matrix:

proc iml;
varNames =  {"Age" "Height" "Weight"};   /* these vars have the same type */
use Sashelp.Class;                       /* open the data set */
read all var varNames into X;            /* read 3 vars into matrix */
close Sashelp.Class;                     /* close the data set */

The matrix X contains the raw data. Each column is a variable; each row is an observation. For many descriptive statistics, you can use a single function call to compute statistics across all columns. You can also compute multivariate statistics such as a correlation matrix:

mean = mean(X);                          /* mean of each column */
corr = corr(X);                          /* correlation matrix */
print mean[colname=varNames],
      corr[colname=varNames rowname=varNames];

You can use this technique to create vectors whose names are different from the names of data set variables. For example, in my blog posts I often load data into vectors named x and y to emphasize that the subsequent analysis will work for any data, not just for the example data.

Some of the advantages of reading data into a matrix are:

  • You can compute descriptive statistics for all columns by using a single function call.
  • You can sort, transpose, or reorder columns of the data.
  • You can compute row operations, such as the sum across rows.
  • You can compute multivariate statistics such as finding complete cases or computing a correlation matrix.
  • Many statistical analyses, such as least squares regression, have a natural formulation in terms of matrix operations.

I usually read raw data into a matrix and summarized data into vectors, but as you can see, there are advantages to both approaches.

What technique do you use to read data into SAS/IML?

Post a Comment

Rolling statistics in SAS/IML

Last week I showed how to use PROC EXPAND to compute moving averages and other rolling statistics in SAS. Unfortunately, PROC EXPAND is part of SAS/ETS software and not every SAS site has a license for SAS/ETS. For simple moving averages, you can write a DATA step program, as discussed in previous post. However, for complex rolling statistics, the SAS/IML language, which enables you to easily access previous observations (and even future ones!), is a more powerful tool for computing rolling statistics.

This article shows how to implement various rolling statistics in SAS/IML. To keep the explanations and programs simple, the functions assume that there are no missing values in the data. The article "What is a moving average" explains the mathematical formulas used in this post.

A simple moving average function

The key to computing most rolling statistics is to define a rolling window of observations. At each time point, you extract the observations in the rolling window and use them to compute the statistic. You then move on to the next time point and repeat the computation. You might need to perform special computations at the beginning of the time series.

The following SAS/IML program implements a simple moving average. The arguments to the MA function are a column vector, Y, of time series values and a scalar value, k, which indicates the number of values to use for each moving average computation. The program loops over elements in Y. For each element, the program computes the mean of the current element and previous k-1 values.

proc iml;
/* Simple moving average of k data values.
   First k-1 values are assigned the mean of all previous values.
   Inputs:     y     column vector of length N >= k
               k     number of data points to use to construct each average
start MA(y, k);
   MA = j(nrow(y), 1, .);
   do i = 1 to nrow(y);
      idx = max(1,(i-k+1)):i;   /* rolling window of data points */
      MA[i] = mean( y[idx] );   /* compute average */
   return ( MA );

The first k-1 values require special handling because these values have fewer than k-1 prior observations to average. You could handle these special values by using a separate loop. However, I chose to use the expression max(1, (i-k+1)) to select the first element for the rolling mean computation. When i is less than k, this expression returns 1 for the first element, and the program computes the mean of the first i values. Otherwise, this expression returns i minus k-1 (which is i-k+1) for the first element, and the program computes the mean of k values.

The most important part of this computation is enumerating the time points to use in the computation (for example, idx = (i-k+1):i;) followed by extracting the associated data (for example, y[idx]). With these two expressions, you can compute any rolling statistic. For example, by changing the function call from MEAN to STD, you can compute a rolling standard deviation. The rolling median, rolling minimum, and rolling maximum are also trivial to implement. By changing the time points, you can compute rolling statistics for centered windows. If your data contain several variables, you can compute a rolling correlation.

A weighted moving average function

The following function computes a weighted moving average. The arguments to the WMA function are a column data vector, Y, and a vector of weights that has k elements. For each time point, wk (the last weight) is the weight for current data value, wk-1 is for the previous data value, and so forth. The function internally standardizes the weights so that they sum to unity. (This ordering was chosen so that the WMA function uses the same syntax as PROC EXPAND.) This function handles the first few values in a separate loop:

/* Weighted moving average of k data values.
   First k-1 values are assigned the weighted mean of all preceding values.
   Inputs:     y     column vector of length N >= k
               wt    column vector of weights. w[k] is weight for most recent 
                      data; wt[1] is for most distant data value.  The function 
                     internally standardizes the weights so that sum(wt)=1.
   Example call: WMA  = WMA(y, 1:5);
start WMA(y, wt);
   w = colvec(wt) / sum(wt);       /* standardize weights so that sum(w)=1 */
   k = nrow(w);
   MA = j(nrow(y), 1, .);
   /* handle first k values separately */
   do i = 1 to k-1;
      wIdx = k-i+1:k;                 /* index for previous i weights */
      tIdx = 1:i;                     /* index for previous i data values */
      MA[i] = sum(wt[wIdx]#y[tIdx]) / sum(wt[wIdx]);  /* weighted average */
   /* main computation: average of current and previous k-1 data values */
   do i = k to nrow(y);
      idx = (i-k+1):i;               /* rolling window of k data points */
      MA[i] = sum( w#y[idx] );       /* weighted sum of k data values */
   return ( MA );

Notice that the function requires computing a weighted mean, which is described in a previous article.

An exponentially weighted moving average function

An exponentially weighted moving average is defined recursively. The average at time t is a weighted average of the data point at time t and the average from time t-1. The relative weights are determined by the smoothing parameter, α. The following function implements that definition:

/* Exponentially weighted moving average (EWMA) with smoothing parameter alpha.
   Inputs:      y     column vector of length N
                alpha scalar value 0 < alpha < 1
start EWMA(y, alpha);
   MA = j(nrow(y), 1, .);
   MA[1] = y[1];              /* initialize first value of smoother */
   do i = 2 to nrow(y);
      MA[i] = alpha*y[i] + (1-alpha)*MA[i-1];
   return ( MA );

The three moving average functions are now defined. You can read the time series data into a vector and call the functions. If necessary, you can write the rolling statistics to a SAS data set:

/* read time series data */
use Sashelp.Air;  
   read all var "date" into t;
   read all var "air" into y;
MA   = MA(y, 5);           /* moving average, k=5 */
WMA  = WMA(y, 1:5);        /* weighted moving average */
EWMA = EWMA(y, 0.3);       /* exponentially WMA, alpha=0.3 */
create out var{t y MA WMA EWMA};  append;  close out;

You can use the SGPLOT procedure to visualize the rolling statistics, as shown in my previous article.

Vectorizing time series computations

The experienced SAS/IML programmer will notice that these functions are not heavily vectorized. The MA and WMA computations use vectors of length k to compute the means and weighted means, respectively. It is possible to write these functions by using a matrix operation, but if the time series has N data points, the transformation matrix is an N x N lower-triangular banded matrix, which requires a lot of memory for large values of N.

Notice that the EWMA function uses scalar quantities inside a loop. For time series computations that use lagged data values, you can sometimes vectorize the time series computations. However, for operations that are defined recursively, such as the EWMA, the effort required to vectorize the computation might exceed the benefit you gain from the vectorization. In many cases, a function that uses a simple loop is fast and easy to maintain.


This article presents functions for computing rolling statistics in SAS/IML. Examples included a simple moving average (MA), a weighted moving average (WMA), and an exponentially weighted moving average (EWMA). The article describes how to modify these function to compute other rolling statistics in SAS.

Computations of rolling statistics might not be easy to vectorize. Even when you can vectorize a computation, a simpler approach might run faster.

Post a Comment

Group processing in SAS: The NOTSORTED option

Novice SAS programmers quickly learn the advantages of using PROC SORT to sort data, followed by a BY-group analysis of the sorted data. A typical example is to analyze demographic data by state or by ZIP code. A BY statement enables you to produce multiple analyses from a single procedure call.

In the usual BY-group processing of data, the data are sorted by the BY variables. However, there are situations in which you might not want to sort the data. For example, sorting can be an expensive operation for a huge data set. If your data are already grouped by ZIP code, you might want to analyze the data for each ZIP code in the order that they appear in the data set. You can sort the summarized statistics later, if necessary, which will be much faster than sorting the raw data.

SAS supports BY-group analysis of unsorted data. When you construct the BY statement, use the NOTSORTED option to tell a SAS procedure that groups are to be handled in the order in which they appear in the data.

The NOTSORTED option in the BY statement

The following example uses the NOSORTED option in the BY statement to analyze data by groups. A teacher has recorded test scores for students in her class. The following DATA step creates 20 observations. There are four observations for each student:

data Grades;
input First $8. Last $10. @;
do Test = 1 to 4;
   input Score @; output;
format ID Z4.;
Tim     Albertson  93  89  78  84
Sharad  Gupta     100  95  92  98
Tim     Williams   85  82  70  74
Mandy   Albertson  95  86  90  95
Min     Chen       88  92  85  95

The teacher has committed a serious data-quality mistake: she has failed to include a unique identifier (a "student ID") for her students! Nevertheless, provided that adjacent students in the data set do not share the same first name, the teacher can use the NOTSORTED option in the BY statement to analyze the scores without sorting the data:

proc means data=Grades noprint;
   by First notsorted;      /* <== use NOTSORTED option by first names */
   var Score;
   output out=Averages mean=;
proc print data=Averages noobs; 
   var First Score; 
example of using the NOTSORTED option

The NOTSORTED option tells the procedure to analyze groups that are defined by the FIRST variable. It does not matter that the first names are not sorted or that there are two students named "Tim." The NOTSORTED option prevents the error message:

ERROR: Data set WORK.GRADES is not sorted in ascending sequence.

For this example, the data includes the last names of the students, so the teacher could sort the data by last name and first name and use the statement BY Last First to analyze the data. However, the NOTSORTED statement does not require sorted data, which can be a huge advantage. The NOTSORTED option is supported in almost every SAS procedure and DATA step—with the obvious exception of PROC SORT!

For more examples of using the NOTSORTED option in BY-group analyses, see Samudral and Giddings (2006).

Group processing of unsorted data in SAS/IML

SAS/IML software does not support a BY statement, but you can use various techniques to process groups. The two primary techniques are the UNIQUE-LOC technique and the UNIQUEBY technique. The UNIQUEBY technique can analyze data in the order that they appear, regardless of whether the data are sorted, as shown in the following program:

proc iml;
use Grades;   read all var {First Score};   close;
/* the UNIQUEBY function does not require sorted data */
uu = uniqueby(First);      /* get first row for each student */
result = j(nrow(uu), 1);   /* allocate vector to hold results */
u = uu // (nrow(First)+1); /* trick: append (N+1) to end of indices */
do i = 1 to nrow(u)-1;     /* for each group... */
   idx = u[i]:(u[i+1]-1);  /* get rows in group */
   Y = Score[idx];
   result[i,] = mean(Y);
print result[rowname=(First[uu]) colname="Mean" format=6.2];
example of using the UNIQUEBY function

In summary, if your data are grouped by an identifying variable, you can analyze the data without sorting. The identifying variable does not have to have unique values. In SAS procedures and the DATA step, you can use the NOTSORTED option in the BY statement. In PROC IML, you can use the UNIQUEBY function.

Post a Comment

Compute a moving average in SAS

A common question on SAS discussion forums is how to compute a moving average in SAS. This article shows how to use PROC EXPAND and contains links to articles that use the DATA step or macros to compute moving averages in SAS.

Moving average in SAS

In a previous post, I explained how to define a moving average and provided an example, which is shown here. The graph is a scatter plot of the monthly closing price for IBM stock over a 20-year period. The three curves are moving averages. The "MA" curve is a five-point (trailing) moving average. The "WMA" curve is a weighted moving average with weights 1 through 5. (When computing the weighted moving average at time t, the value yt has weight 5, the value yt-1 has weight 4, the value yt-2 has weight 3, and so forth.) The "EWMA" curve is an exponentially weighted moving average with smoothing factor α = 0.3.

This article shows how to use the EXPAND procedure in SAS/ETS software to compute a simple moving average, a weighted moving average, and an exponentially weighted moving average in SAS. For an overview of PROC EXPAND and its many capabilities, I recommend reading the short paper "Stupid Human Tricks with PROC EXPAND" by David Cassell (2010).

Because not every SAS customer has a license for SAS/ETS software, there are links at the end of this article that show how to compute a simple moving average in SAS by using the DATA step.

Create an example time series

Before you can compute a moving average in SAS, you need data. The following call to PROC SORT creates an example time series with 233 observations. There are no missing values. The data are sorted by the time variable, T. The variable Y contains the monthly closing price of IBM stock during a 20-year period.

/* create example data: IBM stock price */
title "Monthly IBM Stock Price";
proc sort data=sashelp.stocks(where=(STOCK='IBM') rename=(Date=t Close=y)) 
  by t;

Compute a moving average in SAS by using PROC EXPAND

PROC EXPAND computes many kinds of moving averages and other rolling statistics, such as rolling standard deviations, correlations, and cumulative sums of squares.

In the procedure, the ID statement identifies the time variable, T. The data should be sorted by the ID variable. The CONVERT statement specifies the names of the input and output variables. The TRANSFORMOUT= option specifies the method and parameters that are used to compute the rolling statistics.

/* create three moving average curves */
proc expand data=Series out=out method=none;
   id t;
   convert y = MA   / transout=(movave 5);
   convert y = WMA  / transout=(movave(1 2 3 4 5)); 
   convert y = EWMA / transout=(ewma 0.3);

The example uses three CONVERT statements:

  • The first specifies that MA is an output variable that is computed as a (backward) moving average that uses five data values (k=5).
  • The second CONVERT statement specifies that WMA is an output variable that is a weighted moving average. The weights are automatically standardized by the procedure, so the formula is WMA(t) = (5yt + 4yt-1 + 3yt-2 + 2yt-3 + 1yt-4) / 15.
  • The third CONVERT statement specifies that EWMA is an output variable that is an exponentially weighted moving average with parameter 0.3.

Notice the METHOD=NONE option on the PROC EXPAND statement. By default, the EXPAND procedure fits cubic spline curves to the nonmissing values of variables. The METHOD=NONE options ensures that the raw data points are used to compute the moving averages, rather than interpolated values.

Visualizing moving averages

An important use of a moving average is to overlay a curve on a scatter plot of the raw data. This enables you to visualize short-term trends in the data. The following call to PROC SGPOT creates the graph at the top of this article:

proc sgplot data=out cycleattrs;
   series x=t y=MA   / name='MA'   legendlabel="MA(5)";
   series x=t y=WMA  / name='WMA'  legendlabel="WMA(1,2,3,4,5)";
   series x=t y=EWMA / name='EWMA' legendlabel="EWMA(0.3)";
   scatter x=t y=y;
   keylegend 'MA' 'WMA' 'EWMA';
   xaxis display=(nolabel) grid;
   yaxis label="Closing Price" grid;

To keep this article as simple as possible, I have not discussed how to handle missing data when computing moving averages. See the documentation for PROC EXPAND for various issues related to missing data. In particular, you can use the METHOD= option to specify how to interpolate missing values. You can also use transformation options to control how moving averages are defined for the first few data points.

Create a moving average in SAS by using the DATA step

If you do not have SAS/ETS software, the following references show how to use the SAS DATA step to compute simple moving averages by using the LAG function:

The DATA step, which is designed to handle one observation at a time, is not the best tool for time series computations, which naturally require multiple observations (lags and leads). In a future blog post, I will show how to write SAS/IML functions that compute simple, weighted, and exponentially weighted moving averages. The matrix language in PROC IML is easier to work with for computations that require accessing multiple time points.

Post a Comment

What is a moving average?

A moving average (also called a rolling average) is a statistical technique that is used to smooth a time series. Moving averages are used in finance, economics, and quality control. You can overlay a moving average curve on a time series to visualize how each value compares to a rolling average of previous values. For example, the following graph shows the monthly closing price of IBM stock over a 20-year period. Three kinds of moving averages are overlaid on a scatter plot of the data.

Moving average of stock price

The IBM stock price increased in some time periods and decreased in others. The moving-average curves help to visualize these trends and identify these time periods. For a simple moving average, the smoothness of a curve is determined by the number of time points, k, that is used to compute the moving average. Small values of k result in curves that reflect the short-term ups and downs of the data; large values of k undulate less. For stock charts that show daily prices, the 30-day moving average and the 5-day moving average are popular choices.

How do you define a moving average?

The most common moving averages are the simple moving average (MA), the weighted moving average (WMA), and the exponentially weighted moving average (EWMA). The following list provides a brief description and mathematical formula for these kinds of moving averages. See the Wikipedia article on moving averages for additional details.

Let {y0, y1, ..., yt, ...} be the time series that you want to smooth, where yt is the value of the response at time t.

  • The simple moving average at time t is the arithmetic mean of the series at yt and the previous k-1 time points. In symbols,
          MA(t; k) = (1/k) Σ yi
    where the summation is over the k values {yt-k+1, ..., yt}.
  • The weighted moving average (WMA) at time t is a weighted average of the series at yt and the previous k-1 time points. Typically the weights monotonically decrease so that data from "long ago" contribute less to the average than recent data. If the weights sum to unity (Σ wi = 1) then
          WMA(t; k) = Σ wi yi
    If the weights do not sum to unity, then divide that expression by Σ wi.
  • The exponentially weighted moving average (EWMA) does not use a finite rolling window. Instead of the parameter k, the EWMA uses a decay parameter α, where 0 < α < 1. The smoothed value at time t is defined recursively as
          EWMA(t; α) = α yt + (1 - α) EWMA(t-1; α)
    You can "unwind" this equation to obtain the EWMA as a WMA where the weights decrease geometrically. The choice of α determines the smoothness of the EWMA. A value of α ≈ 1 implies that older data contribute very little to the average. Conversely, small values of α imply that older data contribute to the moving average almost as much as newer data.

Each of these definitions contains an ambiguity for the first few values of the moving average. For example, if t < k, then there are fewer than k previous values in the MA and WMA methods. Some practitioners assign missing values to the first k-1 values, whereas others average the values even when fewer than k previous data points exist. For the EWMA, the recursive definition requires a value for EWMA(0; α), which is often chosen to be y0.

My next blog post shows how to compute various moving averages in SAS. The article shows how to create the IBM stock price example, which is a time series plot overlaid with MA, WMA, and EWMA curves.

Post a Comment

Banking to 45 degrees: Aspect ratios for time series plots

In SAS, the aspect ratio of a graph is the physical height of the graph divided by the physical width. Recently I demonstrated how to set the aspect ratio of graphs in SAS by using the ASPECT= option in PROC SGPLOT or by using the OVERLAYEQUATED statement in the Graph Template Language (GTL). I mentioned that a unit aspect ratio is important when you want to visually display distance between points in a scatter plot.

A second use of the aspect ratio is when plotting a time series. A time series is a sequence of line segments that connect data values (xi, yi), i = 0..N. Thus a plot involves N line segments, each with a slope and length. In the late 1980s and early '90s, William Cleveland and other researchers investigated how humans perceive graphical information. For a time series plot, the rate of change (slope) of the time series segments is important. Cleveland suggested that practitioners should use "banking to 45 degrees" as a rule-of-thumb that improves perception of the rate-of-change in the plot.

My SAS colleague Prashant Hebbar wrote a nice article about how to use the GTL to specify the aspect ratio of a graph so that the median absolute slope is 45 degrees. Whereas Prashant focused on creating the graphics, the current article shows how to compute aspect ratios by using three banking techniques from Cleveland.

Three computational methods for choosing an aspect ratio

The simplest computation of an aspect ratio is the median absolute slope (MAS) method (Cleveland, McGill, and McGill, JASA, 1988). As it name implies, it computing the median of the absolute value of the slopes. To obtain a graph in which median slope of the (physical) line segments is unity, set the aspect ratio of the graph to be the reciprocal of the median value.

The median-absolute-slope method is a simple way to choose an aspect ratio. A more complex analysis (Cleveland, Visualizing Data, 1993, p. 90) uses the orientation of the line segments. The orientation of a line segment is the quantity arctan(dy/dx) where dy is the change in physical units in the vertical direction and dx is the change in physical units in the horizontal direction.

Cleveland proposes setting the aspect ratio so that the average of the absolute values of the orientations is 45 degrees. This is called "banking the average orientation to 45 degrees" or the AO (absolute orientation) method. Another approach is to weight the line segments by their length (in physical units), compute a weighted mean, and find the aspect ratio that results an average weighted orientation of 45 degrees. This is called "banking the weighted average of the absolute orientations to 45 degrees" or the AWO (average weighted orientation) method.

The AO and AWO methods require solving for the aspect ratio that satisfies a nonlinear equation that involves the arctangent function. This is equivalent to finding the root of a function, and the easiest way to find a root in SAS is to use the FROOT function in SAS/IML software.

For a nice summary of these and other banking techniques, see Heer and Agrawala (2006). However, be aware that Heer and Agrawala define the aspect ratio as width divided by height, which is the reciprocal of the definition that is used by Cleveland (and SAS).

Banking to 45 degrees

To illustrate the various methods, I used the MELANOMA data set, which contains the incidences of melanoma (skin cancer) per one million people in the state of Connecticut that were observed each year from 1936–1972. This is one of the data sets in Visualizing Data (Cleveland, 1993). You can download all data sets in the book and create a libref to the data set directory. (My libref is called vizdata.)


The line plot shows the melanoma time series. The graph's subtitle indicates that the aspect ratio is 0.4, which means that the plot is about 40% as tall as it is wide. With this scaling, about half of the line segments have slopes that are between ±1. This aspect ratio was chosen by using the median-absolute-slope method.

Prashant has already shown how to use the DATA step and Base procedures to implement the median absolute slope method, so I will provide an implementation in the SAS/IML language. The following SAS/IML program reads the data and uses the DIF function to compute differences between yearly values. These values are then scaled by the range of the data. Assuming that the physical size of the axes are proportional to the data ranges (and ignoring offsets and margins), the ratio of these scaled differences are the physical slopes of the line segments. The following computation shows that the median slope is about 2.7. Therefore, the reciprocal of this value (0.37) is the scale factor for which half the slopes are between ±1 and then other half are more extreme. For convenience, I rounded that value to 0.4.

/* Data from Cleveland (1993) */
libname vizdata "C:\Users\frwick\Documents\My SAS Files\vizdata"; 
title "Melanoma Data";
proc iml;
use vizdata.Melanoma;               /* see Cleveland (1993, p. 158) */
   read all var "Year" into x;
   read all var "Incidence" into y;
dx = dif(x) / range(x);             /* scaled change in horizontal directions */ 
dy = dif(y) / range(y);             /* scaled change in vertical directions */ 
m = median( abs(dy/dx) );           /* median of absolute slopes */
print m[L="Median Slope"] (1/m)[L="Aspect Ratio"];

Three methods of banking to 45 degrees

I used SAS/IML to implement Cleveland's three techniques (MAS, AO, AWO) and ran each technique on the melanoma data. You can download the complete SAS/IML program that computes the aspect ratios. The following table gives the ideal aspect ratios for the melanoma data as computed by each of Cleveland's methods:


As you can see, the three methods are roughly in agreement for these data. An aspect ratio of approximately 0.4 (as shown previously) will create a time series plot in which the rate of change of the segments is apparent. If there is a large discrepancy between the values, Cleveland recommends using the average weighted orientation (AWO) technique.

The complete SAS program also analyzes carbon dioxide (CO2) data that were analyzed by Cleveland (1993, p. 159) and by Herr and Agrawala. Again, the three methods give similar results to each other.

You might wonder why I didn't choose an exact value such as 0.3518795 for the aspect ratio of the melanoma time series plot. The reason is that these "banking" computations are intended to guide the creation of statistical graphics, and using 0.4 (a 2:5 ratio of vertical to horizontal scales) is more interpretable. Think of the numbers as suggestions rather than mandates. As Cleveland, McGill, and McGill (1988) said, banking to 45 degrees "cannot be used in a fully automated way. (No data-analytic procedure can be.) It needs to be tempered with judgment."

So use your judgement in conjunction with these computations to create time series plots in SAS whose line segments are banked to 45 degrees. After you compute an appropriate aspect ratio, you can create the graph by using the ASPECT= option in PROC SGPLOT or the ASPECTRATIO option on a LAYOUT OVERLAY statement in GTL.

Post a Comment

Create a SAS macro variable that contains a list of values

Parameters in SAS procedures are specified a list of values that you manually type into the procedure syntax. For example, if you want to specify a list of percentile values in PROC UNIVARIATE, you need to type the values into the PCTLPTS= option as follows:

proc univariate noprint;
   var MPG_City;
   output out=pctl pctlpre=p 
          pctlpts=2.5 5 10 25 50 75 90 95 97.5; /* <== type long list */

Obviously, typing many values is tedious and error prone. Furthermore, you might have parameters values that are contained in a SAS data set because they were computed by a previous program. A few procedures support reading parameters directly from a SAS data set, but most require that you type the values manually. Wouldn't it be convenient if you could get the parameter values out of a data set and insert them into the syntax without having to type them?

This is a common question that SAS programmers ask on discussion forums. This article shows how you can read the values of a variable from a data set into a SAS macro variable. The macro variable will contain a blank-delimited list of values. You can then use the "macro list" to specify the parameters to a SAS procedure.

Read a values into a SAS macro variable

As an example, the following DATA step contains the parameter values for the PCTLPTS= option in the previous PROC UNIVARIATE call:

data ParamData;
input x @@;
2.5 5 10 25 50 75 90 95 97.5

The SQL procedure provides an easy way to concatenate the values of a variable into a macro variable. Use the SEPARATED BY keywords to specify a character to delimit the values. Usually the delimiter is a blank character, as below, but in some situations you might want to use a comma.

/* put all values into macro variable named ParamList */
proc sql noprint;                              
 select x into :ParamList separated by ' '
 from ParamData;
%put ParamList = &ParamList;   /* display list in SAS log */
ParamList = 2.5 5 10 25 50 75 90 95 97.5

The PROC SQL statement reads the values from the x variable in the ParamData data set. It converts them to character values and concatenates them into a blank-separated string, which is then stored into the macro variable named ParamList. To use the parameter list, simply use an ampersand (&) to reference the value of the macro variable, as follows:

proc univariate noprint;
   var MPG_City;
   output out=pctl pctlpre=p pctlpts=&ParamList; /* long list in macro */

Read values into a "macro list" from a SAS/IML vector

You can use a similar technique to create a macro variable that contains values that are in a SAS/IML vector. To create a macro variable, use ideas from the articles about concatenating a vector of values into to a string and about using the SYMPUTX subroutine. For convenience, the following statement create a SAS/IML module that implements the technique:

proc iml;
/* Concatenate values into a string separated by a delimiter (by default, 
   a blank). Create a macro variable with the specified name. */
start CreateMacro(values, macroName, delimiter=' ');
   if type(values)='N' then          
      y = rowvec( char(values) );   /* convert numeric to character */
      y = rowvec(values);
   s = rowcat(y + delimiter);       /* delimit and concatenate */
   s = substr(s, 1, nleng(s)-nleng(delimiter)); /* remove delimiter at end */
   call symputx(macroName, s);      /* create macro variable */
use ParamData; read all var {x}; close;
call CreateMacro(x, "ParamList");

When you call the CreateMacro module, you specify a vector of values and the name of a macro variable. By default, the module creates a blank-delimited list, but you can override that default. After the call, you can then use the macro variable in other SAS procedures.

By the way, you don't need a macro variable if you intend to call a SAS procedure directly from your SAS/IML program. You can pass SAS/IML vectors directly to SAS procedures by using the SUBMIT statement. However, when you do need a macro variable, this technique is worth remembering.

Post a Comment

Compute the centroid of a polygon in SAS

Recently I blogged about how to compute a weighted mean and showed that you can use a weighted mean to compute the center of mass for a system of N point masses in the plane. That led me to think about a related problem: computing the center of mass (called the centroid) for a planar polygon.

If you cut a convex polygon out of stiff cardboard, the centroid is the position where the polygon would balance on the tip of a needle. However, for non-convex polygons, the centroid might not be located in the interior of the polygon.

For a general planar figure, the mathematical formula for computing the centroid requires computing an integral. However, for polygons there are some well-known equations that involve only the locations of the vertices (provided that the vertices are enumerated in a consistent clockwise or counter-clockwise manner). The Wikipedia article about the centroid give a useful computational formula. It turns out that the (signed) area of a simple polygon with vertices (x0, y0), (x1, y1), ..., (xN-1, yN-1), is given by


In the formula, the vertex (xN, yN) is identified with the first vertex, (x0, y0). The centroid of the polygon occurs at the point (Cx, Cy), where

centroidEqn1 centroidEqn2

A polygon whose vertices are enumerated in a counter-clockwise manner will have a positive value for A. The formula for A will be negative when vertices are enumerated in a clockwise manner.

In the SAS language, arrays use 1-based indexing by convention. Therefore, when implementing the formula in SAS, it is customary for the vertices to be enumerated from 1 to N. The limits in the summation formulas are modified similarly. The (N+1)st vertex is identified with the first vertex.

Centroids are sometimes used as a position to label a country or state on a map. For SAS customers who have a license for SAS/GRAPH, the official SAS %CENTROID macro computes the centroid of a polygonal region. It is designed for use with map data sets, which have variables named ID, X, and Y. My colleague Sanjay Matange provides a similar macro in his blog post "A Macro for Polygon Area and Center."

A vectorized SAS/IML function that computes a centroid

Implementing the centroid formulas in SAS/IML is a good exercise in vectorization. Recall that function is vectorized if it avoids unnecessary loops and instead uses vector and matrix operations to perform the computations. For the area and centroid formulas, you can form a vector that contains all of the values that are to be summed. You can use the elementwise multiplication operator (#) to compute the elementwise product of two vectors. The SUM function adds up all elements in a vector.

The following function computes the centroid of a single polygon in a vectorized manner. The first few statements build the vectors from the values of the vertices; the last four statements compute the area and centroid by using vectorized operations. There are no loops.

proc iml;
/* _PolyCentroid: Return the centroid of a simple polygon.
   P  is an N x 2 matrix of (x,y) values for consectutive vertices. N > 2. */
start _PolyCentroid(P);
   lagIdx = 2:nrow(P) || 1;              /* vertices 2, 3, ..., N, 1 */
   xi   = P[ ,1];       yi = P[ ,2];     /* x_i and y_i for i=1..N   */
   xip1 = xi[lagIdx]; yip1 = yi[lagIdx]; /* x_{i+1} and y_{i+1}, x_{N+1}=x_1 */
   d = xi#yip1 - xip1#yi;                /* vector of difference of products */
   A = 0.5 * sum(d);                     /* signed area of polygon */
   cx = sum((xi + xip1) # d) / (6*A);    /* X coord of centroid */
   cy = sum((yi + yip1) # d) / (6*A);    /* Y coord of centroid */
   return( cx || cy );
/* test the function by using an L-shaped polygon */
L = {0 0, 6 0, 6 1, 2 1, 2 3, 0 3};
centroid = _PolyCentroid(L);
print centroid;

For this L-shaped polygon, the centroid is located outside the polygon. This example shows that the centroid might not be the best position for a label for a polygon. Examples of this phenomenon are easy to find on real maps. For example, the states of Louisiana and Florida have centroids that are near the boundaries of the states.

Compute the centroid of many polygons

The %CENTROID macro can compute the centroids for multiple polygons, assuming that there is an identifying categorical variable (called the ID variable) whose unique values distinguish one polygon from another.

If you want to compute the centroids of multiple polygons in SAS/IML, you can use the UNIQUEBY function to extract the rows that correspond to each polygon. For each polygon, the previous _PolyCentroid function is called, as follows:

/* If the polygon P has two columns, return the centroid. If it has three 
   columns, assume that the third column is an ID variable that identifies
   distinct polygons. Return the centroids of the multiple polygons. */
start PolyCentroid(P);
   if ncol(P)=2 then  return( _PolyCentroid(P) );
   ID = P[,3];
   u = uniqueby(ID);         /* starting index for each group */
   result = j(nrow(u), 2);   /* allocate vector to hold results */
   u = u // (nrow(ID)+1);    /* append (N+1) to end of indices */
   do i = 1 to nrow(u)-1;    /* for each group... */
      idx = u[i]:(u[i+1]-1); /* get rows in group */
      result[i,] = _PolyCentroid( P[idx, 1:2] ); 
   return( result );
/* test it on an example */
rect = {0 0,  2 0,  2 1,  0 1};
ID = j(nrow(L), 1, 1) // j(nrow(rect), 1, 2);
P = (L // rect) || ID;
centroids = PolyCentroid(P);
print centroids;

In summary, you can construct an efficient function in SAS/IML to compute the centroid of a simple polygon. The construction shows how to vectorize a computation by putting the vertices in vectors, rather than the less efficient alternative, which is to iterate over the vertices in a loop. Because polygons are often used to display geographical regions, I also included a function that iterates over polygons and computes the centroid of each.

Post a Comment

Twelve posts from 2015 that deserve a second look

I began 2016 by compiling a list of popular articles from my blog in 2015. This "People's Choice" list contains many interesting articles, but some of my personal favorites did not make the list. Today I present the "Editor's Choice" list of articles that deserve a second look.

I've grouped the articles into four broad topics: SAS programming, computational statistics, statistical graphics, and matrix computations.

SAS programming

These two articles use Base SAS to accomplish some useful tasks:

Computational statistics

For most analyses, you can call a built-in SAS procedure. However, I enjoy showing how to extend, modify, or explain a standard analyses by writing additional programming statements.

Statistical graphics

Although many SAS procedure automatically produce appropriate statistical graphics, it is sometimes necessary to use the SGPLOT or the Graph Template Language (GTL) to create your own graphics.

Matrix computations

From the many articles that I wrote about matrix computations, the following list presents three topics that SAS programmers often ask about on discussion forums:

There you have it, 12 topics that I think are worth a second look. Did I omit your favorite article from The DO Loop in 2015? Leave a comment.

Post a Comment