Create dummy variables in SAS

A dummy variable (also known as indicator variable) is a numeric variable that indicates the presence or absence of some level of a categorical variable. The word "dummy" does not imply that these variables are not smart. Rather, dummy variables serve as a substitute or a proxy for a categorical variable, just as a "crash-test dummy" is a substitute for a crash victim, or a "sewing dummy" is a dressmaker's proxy for the human body.

In regression and other statistical analyses, a categorical variable can be replaced by dummy variables. For example, a categorical variable with levels "Low," "Moderate," and "High" can be represented by using three binary dummy variables. The first dummy variable has the value 1 for observations that have the level "Low," and 0 for the other observations. The second dummy variable has the value 1 for observations that have the level "Moderate," and zero for the others. The third dummy variable encodes the "High" level.

There are many ways to construct dummy variables in SAS. Some programmers use the DATA step, but there is an easier way. This article discusses the GLMMOD procedure, which produces basic binary dummy variables. A follow-up article discusses other SAS procedures that create a design matrix for representating categorical variables.

Why generate dummy variables in SAS?

Many programmers never have to generate dummy variables in SAS because most SAS procedures that model categorical variables contain a CLASS statement. If a procedure contains a CLASS statement, then the procedure will automatically create and use dummy variables as part of the analysis.

However, it can be useful to create a SAS data set that explicitly contains a design matrix, which is a numerical matrix that use dummy variables to represent categorical variables. A design matrix also includes columns for continuous variables, the intercept term, and interaction effects. A few reasons to generate a design matrix are:

  • Students might need to create a design matrix so that they can fully understand the connections between regression models and matrix computations.
  • If a SAS procedure does not support a CLASS statement, you can use often use dummy variables in place of a classification variable. An example is PROC REG, which does not support the CLASS statement, although for most regression analyses you can use PROC GLM or PROC GLMSELECT. Another example is the MCMC procedure, whose documentation includes an example that creates a design matrix for a Bayesian regression model.
  • In simulation studies of regression models, it is easy to generate responses by using matrix computations with a numerical design matrix. It is harder to use classification variables directly.

PROC GLMMOD: Design matrices that use the GLM parameterization

The following DATA step create a data set with 10 observations. It has one continuous variable (Cholesterol) and two categorical variables. One categorical variable (Sex) has two levels and the other (BP_Status) has three levels.

data Patients;
   keep Cholesterol Sex BP_Status;
   set sashelp.heart;
   if 18 <= _N_ <= 27;
proc print;  var Cholesterol Sex BP_Status;  run;
Original data with categorical variables

The GLMMOD procedure can create dummy variables for each categorical variable. If a categorical variable contains k levels, the GLMMOD procedure creates k binary dummy variables. The GLMMOD procedure uses a syntax that is identical to the MODEL statement in PROC GLM, so it is very easy to use to create interaction effects.

The following call to PROC GLMMOD creates an output data set that contains the dummy variables. The output data set is named by using the OUTDESIGN= option. The OUTPARAM= option creates a second data set that associates each dummy variable to a level of a categorical variable:

proc glmmod data=Patients outdesign=GLMDesign outparm=GLMParm;
   class sex BP_Status;
   model Cholesterol = Sex BP_Status;
proc print data=GLMDesign; run;
proc print data=GLMParm; run;
Dummy variables in SAS for each level of the categorical variables

The OUTDESIGN= data set contains the design matrix, which includes variables named COL1, COL2, COL3, and so forth. The OUTPARM= data set associates levels of the original variables to the dummy variables. For these data, the GLMMOD procedure creates six binary columns. The first is the intercept column. The next two encode the Sex variable. The last three encode the BP_Status variable. If you specify interactions between the original variables, additional dummy variables are created. Notice that the order of the columns is the sort order of the values of their levels. For example, the "Female" column appears before the "Male" column.

When you use this design matrix in a regression analysis, the parameter estimates of main effects estimate the difference in the effects of each level compared to the last level (in alphabetical order). The following statements show that using the dummy variables in PROC REG give the same parameter estimates as are obtained by using the original classification variables in PROC GLM:

ods graphics off;
/* regression analysis by using dummy variables */
proc reg data=GLMDesign;
   DummyVars: model Cholesterol = COL2-COL6; /* dummy variables except intercept */
   ods select ParameterEstimates;
/* same analysis by using the CLASS statement */
proc glm data=Patients;
   class sex BP_Status;              /* generates dummy variables internally */
   model Cholesterol = Sex BP_Status / solution;
   ods select ParameterEstimates;
PROC REG output for dummy variables in SAS

The parameter estimates from PROC REG is shown. The parameter estimates from PROC GLM are identical. Notice that the parameter estimates for the last level are set to zero and the standard errors are assigned missing values. This occurs because the dummy variable for each categorical variable is redundant. For example, the second dummy variable for the Sex variable ("Males") is a linear combination of the intercept column and the dummy variable for "Females"). Similarly, the last dummy variable for the BP_Status variable ("Optimal") is a linear combination of the intercept column and the "High" and "Normal" dummy variables. By setting the parameter estimate to zero, the last column for each set of dummy variables does not contribute to the model.

For this reason, the GLM encoding is called a singular parameterization. In my next blog post I will present ways to parameterize levels of the categorical variables. These different parameterizations lead to nonsingular design matrices.

Post a Comment

A simple trick to include (and order!) all categories in SGPLOT legends

Last week Sanjay Matange wrote about a new SAS 9.4m3 option that enables you to show all categories in a graph legend, even when the data do not contain all the categories. Sanjay's example was a chart that showed medical conditions classified according to the scale "Mild," "Moderate," and "Severe." He showed how to display all these categories in a legend, even if the data set does not contain any instances of "Severe." The technique is valid for SAS 9.4m3, in which the DATTRMAP= data set supports a special column named SHOW that tells the legend the complete list of categories.

If you are running an earlier version of SAS, you can use a trick that accomplishes the same result. The trick is to create a set of all categories (in the order you want them to appear in the legend) and prepend these "fake observations" to the top of your data set. All other variables for the fake observations are set to missing values. When PROC SGPLOT reads the data for the categorical variable, it encounters all categories. However, the missing values in the other variables prevent the fake observations from appearing in the graph. (The exception is a graph that shows ONLY the categorical variable, but you can handle that case, too.)

Data that excludes a valid category

Let's create a data set that shows the problem. The SasHelp.Heart data set contains observations about patients in a medical study. The data set includes variables for the height and weight of the patient and a categorical variable called Weight_Status that has the values "Underweight," "Normal," and "Overweight." The following DATA step extracts a subset of 200 observations in which no patient is "Underweight." The call to PROC SGPLOT creates a scatter plot of the heights and weights and uses the GROUP= option to color each marker according to the Weight_Status variable.

data Heart;
set Sashelp.Heart;
where Weight >= 125;
keep Height Weight Weight_Status;
if _N_ <= 200;
/* This scatter plot shows three problems: 
   1) The order of GROUP= variable is unspecified (default is GROUPORDER=DATA)
   2) The colors are assigned to the wrong categories
   3) The "Underweight" category is missing from the legend */
title "This scatter plot has problems!";
proc sgplot data=Heart; 
  /* attempt to assign colors for underweight=green, normal=blue, overweight=red */
  styleattrs datacontrastcolors = (lightgreen blue lightred); 
  scatter x=height y=Weight / group=Weight_Status 
Legend does not show all categories

There are several problems with this scatter plot. I tried to use the STYLEATTRS statement to assign the colors green, blue, and red to the categories "Underweight," "Normal," and "Overweight," respectively. However, that effort was thwarted by the fact that the default order of the colors is determined by the order of the (two!) categories in the data set. How can I get the correct colors, and also get the legend to display the "Underweight" category?

A useful trick: Prepend fake data

The "fake data" trick is useful in many situations, not just for legends. I have used it to specify the order of a categorical variable in a graph or analysis. For example, it is useful for a PROC FREQ analysis because PROC FREQ supports an ORDER=DATA option.

The trick has three steps:

  1. Create a data set that contains only one categorical variable. Specify the complete set of possible values in the order in which you want the values to be displayed.
  2. Use the SET statement in the DATA step to append the real data after the fake data. The DATA step will automatically assign missing values to all unspecified variables for the fake observations. On the SET statement, use the IN= data set option to create a new indicator variable called FREQ. This new variable will have the value 0 for the fake observations and the value 1 for the real observations. (Or, if your data set already has a frequency variable, multiply the existing variable by 0 or 1.)
  3. Use the newly appended data set to plot the data as usual. When you use the GROUP= option, the legends, colors, and order of categories will appear correctly because the data now contains all categories. Missing values prevent the fake observations from appearing in your plots.

The following statements illustrate the three steps for the Weight_Status variable in the Heart data set:

/* Step 1: Create a data set that contains all categories, in order */
data AllCategories;
input Weight_Status $11.;
/* Step 2: Append the fake and real data. Create indicator variable. */
data Heart2;
set AllCategories         /* the fake data, which contains all categories */
    Heart(in=IsRealData); /* the original data */
Freq = IsRealData;        /* 1 for the real data; 0 for the fake data */
/* Step 3: Use appended data set and plot as usual */
title "Include All Categories in Legend";
proc sgplot data=Heart2; 
  styleattrs datacontrastcolors = (lightgreen blue lightred);
  scatter x=height y=Weight / group= Weight_Status 
                              markerattrs=(size=9 symbol=CircleFilled);
Legend shows all categories

In this graph, the legend display all possible categories, and the categories appear in the correct order. The STYLEATTRS statement has correctly assigned colors to categories because you knew the order of the categories in the data set.

Graphs of the categorical variable

Adding new observations can create problems if you aren't careful. For example, suppose you use the Heart2 data set and create a bar chart of the Weight_Status variable. Unless you correct for the fake data, the bar chart will show 203 observations and display a bar for the "Underweight" category, which is not part of the original data.

The solution to this dilemma is to use a FREQ= or WEIGHT= option when you create graphs of the modified variable. The DATA step that appended the fake data also added an indicator variable, which you can use to prevent SAS procedures from displaying or analyzing the fake data, as follows:

title "Restrict to Real Data";
proc sgplot data=Heart2; 
  vbar Weight_Status /  freq=Freq;   /* do not use the fake data */

Notice that the bar chart shows only the 200 original observations. The FREQ=Freq statement uses the indicator variable (Freq) to omit the fake data.

In summary, by prepending "fake data" to your data set, you can ensure that all categories appear in legends. As a bonus, the same trick enables you to specify the order of categories in a legend. In short, prepending fake data is a useful trick to add to your SAS toolbox of techniques.

Post a Comment

Four essential sampling methods in SAS

Many simulation and resampling tasks use one of four sampling methods. When you draw a random sample from a population, you can sample with or without replacement. At the same time, all individuals in the population might have equal probability of being selected, or some individuals might be more likely than others. Consequently, these four common sampling methods are shown in the following 2 x 2 table.

Four sampling methods in SAS: Sampling with and without replacement, with equal and unequal probability

The SURVEYSELECT procedure in SAS/STAT is one way to generate random samples. The previous table lists the four sampling methods, summarizes the SURVEYSELECT syntax for each method, and shows how to use the SAMPLE function in SAS/IML.

Sampling without replacement

When you sample without replacement, the size of the sample cannot exceed the number of items.

Simple random sampling (SRS): Survey statisticians use "SRS" for sampling without replacement and with equal probability. Dealing cards from a 52-card deck is an example of SRS. Use the METHOD=SRS option in PROC SURVEYSELECT to request simple random sampling.

Probability proportional to size (PPS): Survey statisticians use "PPS" for sampling without replacement and with unequal probability. As an example, suppose that you want to draw samples of colored marbles from an urn that contains colors in different proportions. The proportion of each color in the urn determines the expected count for each color in the sample. In PROC SURVEYSELECT, you can use method=PPS in conjunction with the SIZE statement to specify the relative sizes (or the probabilities) for the colors in the urn.

Sampling with replacement

When you sample with replacement, the size of the sample can be arbitrarily large.

Unrestricted random sampling (URS): Survey statisticians use "URS" for sampling with replacement and with equal probability. Rolling a six-sided die and recording the face that appears is an example of URS. Use the METHOD=URS option in PROC SURVEYSELECT to request unrestricted random sampling.

Probability proportional to size with replacement: Survey statisticians use "PPS with replacement" for sampling with replacement and with unequal probability. As an example, suppose that you want to repeatedly toss two dice and record the sum of the faces. The sum will be 2 (or 12) with probability 1/36. The sum will be 3 (or 11) with probability 2/36, will be 4 (or 10) with probability 3/36, and so forth. In PROC SURVEYSELECT, you can use the SIZE statement to specify the probability for each sum. You can use the METHOD=PPS_WR option (PPS sampling with replacement) to simulate random sums from a pair of dice.

These four sampling methods are useful to the statistical programmer because they are often used in simulation studies. For information about using the SAS DATA step and PROC SURVEYSELECT for basic sampling, see "Selecting Unrestricted and Simple Random with Replacement Samples Using Base SAS and PROC SURVEYSELECT (Chapman 2012)." See the PROC SURVEYSELECT documentation for a detailed explanation of these and many other sampling methods.

Post a Comment

Sample with replacement and unequal probability in SAS

How do you sample with replacement in SAS when the probability of choosing each observation varies? I was asked this question recently. The programmer thought he could use PROC SURVEYSELECT to generate the samples, but he wasn't sure which sampling technique he should use to sample with unequal probability. This article describes how to sample with replacement and unequal probability in SAS.

Sample with replacement and unequal probability with PROC SURVEYSELECT

The programmer's confusion is completely understandable. The SURVEYSELECT documentation is written for survey statisticians, who use specialized language to describe sampling methods. To a survey statistician, sampling with unequal probability is known as sampling with probability proportional to size (PPS), which is often abbreviated as PPS sampling. The SURVEYSELECT procedure has many methods for PPS sampling. For PPS sampling with replacement, specify the METHOD=PPS_WR option.

Sampling in proportion to size has many applications. For example, if you want to survey people at your company, you could randomly select a certain number of people in each building, where the probability of selection is proportional to the number of people who work in each building. Or you could use PPS sampling to obtain a representative sample across departments.

The following example demonstrates sampling with replacement and with unequal probability. Suppose a small town has five busy intersections. The town planners believe that the probability of an accident at an intersection is proportional to the traffic volume. They want to simulate the locations of 500 accidents by using PPS sampling with replacement, where the relative traffic volumes determine the probability of an accident's location.

The following data shows the annual average daily traffic data for each intersection. The call to the SURVEYSELECT procedure uses METHOD=PPS_WR and N=500 to simulate 500 accidents for these intersections. The the SIZE statement specifies the relative traffic, which determines the probability of an accident in each intersection.

data Traffic;
label VehiclesPerDay = "Average Annualized Daily Traffic";
input Intersection $21. VehiclesPerDay;
format VehiclesPerDay comma.;
Crash Pkwy/Danger Rd  25000
Fast Dr/Danger Rd     20000
Crazy St/Smash Blvd   17000
Crazy St/Collision Dr 14000
Crash Pkwy/Dent Dr    10000
/* sample with replacement, probability proportional to size */
proc surveyselect noprint data=Traffic out=Sample
     seed=123 N=500;       /* use OUTHITS option if you want 500 obs */
   size VehiclesPerDay;    /* specify the probability variable */
proc print data=Sample noobs; 
   var Intersection VehiclesPerDay NumberHits ExpectedHits; 
Sample with replacement and unequal probability

As you can see, the counts of crashes in the simulated sample are close to their expected values. Each time you run a simulation you will get slightly different values. You can use the REPS= option in the PROC SURVEYSELECT statement to generate multiple samples.

I have blogged about this technique before, but my previous focus was on simulating data from a multinomial distribution. The METHOD=PPS_WR option generates counts that follow a multinomial distribution with proportion equal to the standardized "size" variable (which is VehiclesPerDay / sum(VehiclesPerDay)).

If you want a data set that contains all 500 draws, rather than the counts, you can add the OUTHITS option to the PROC SURVEYSELECT statement.

Sample with replacement and unequal probability with PROC IML

For SAS/IML programmers, it might be more convenient to simulate data within PROC IML. The SAS/IML language provides two routines for sampling with replacement and with unequal probability. If you want only the counts (as shown in the previous example) you can use the RANDMULTINOMIAL function and specify the proportion of the traffic that passes through each intersection, as follows:

proc iml;
call randseed(54321);
use Traffic;
   read all var {"Intersection" "VehiclesPerDay"};
Proportion = VehiclesPerDay / sum(VehiclesPerDay);   
Counts = RandMultinomial(1, 500, Proportion);    /* 1 sample of 500 events */

If instead you want the full sample (all 500 values in a random ordering), you can use the SAMPLE function. The third argument to the SAMPLE function enables you to specify whether the sampling is done with or without replacement. The fourth argument enables you to specify the unequal sampling probabilities, as follows:

Sample = sample(Intersection, 500, "Replace", Proportion);

In summary, when you want to sample with replacement and with unequal probabilities, use the METHOD=PPS_WR option in PROC SURVEYSELECT or use the SAMPLE function in SAS/IML.

Post a Comment

Read data into vectors or into a matrix: Which is better?

In the SAS/IML language, you can read data from a SAS data set into a set of vectors (each with their own name) or into a single matrix. Beginning programmers might wonder about the advantages of each approach. When should you read data into vectors? When should you read data into a matrix?

Read data into SAS/IML vectors

You can specify the names of data set variables in the SAS/IML READ statement, as follows:

proc iml;
use Sashelp.Class;                       /* open the data set */
read all var {"Name" "Height" "Weight"}; /* read 3 vars into vectors */
close Sashelp.Class;                     /* close the data set */

The previous statements create three vectors, whose names are the same as the variable names. You can perform univariate analyses on the vectors, such as descriptive statistics. You can also create new variables from arbitrary transformations of the vectors, such as the following computation of the body mass index:

BMI = weight / height##2 * 703;
print BMI[rowname=Name];

Some of the advantages of reading data into vectors are:

  • Variables are given informative names.
  • You can use a single READ statement to read character variables and numerical variables.

When you load summarized data, you might want to read the variables into vectors. For example, to read the ParameterEstimates table from a regression analysis, you probably want to read the variable names, parameter estimates, standard errors, and p-values into separate vectors.

Read data into a SAS/IML matrix

You can use the INTO clause in the READ statement to read data set variables into a SAS/IML matrix. All the variables have to be the same type, such as numeric. For example, the following statements read three numeric variables into a matrix:

proc iml;
varNames =  {"Age" "Height" "Weight"};   /* these vars have the same type */
use Sashelp.Class;                       /* open the data set */
read all var varNames into X;            /* read 3 vars into matrix */
close Sashelp.Class;                     /* close the data set */

The matrix X contains the raw data. Each column is a variable; each row is an observation. For many descriptive statistics, you can use a single function call to compute statistics across all columns. You can also compute multivariate statistics such as a correlation matrix:

mean = mean(X);                          /* mean of each column */
corr = corr(X);                          /* correlation matrix */
print mean[colname=varNames],
      corr[colname=varNames rowname=varNames];

You can use this technique to create vectors whose names are different from the names of data set variables. For example, in my blog posts I often load data into vectors named x and y to emphasize that the subsequent analysis will work for any data, not just for the example data.

Some of the advantages of reading data into a matrix are:

  • You can compute descriptive statistics for all columns by using a single function call.
  • You can sort, transpose, or reorder columns of the data.
  • You can compute row operations, such as the sum across rows.
  • You can compute multivariate statistics such as finding complete cases or computing a correlation matrix.
  • Many statistical analyses, such as least squares regression, have a natural formulation in terms of matrix operations.

I usually read raw data into a matrix and summarized data into vectors, but as you can see, there are advantages to both approaches.

What technique do you use to read data into SAS/IML?

Post a Comment

Rolling statistics in SAS/IML

Last week I showed how to use PROC EXPAND to compute moving averages and other rolling statistics in SAS. Unfortunately, PROC EXPAND is part of SAS/ETS software and not every SAS site has a license for SAS/ETS. For simple moving averages, you can write a DATA step program, as discussed in previous post. However, for complex rolling statistics, the SAS/IML language, which enables you to easily access previous observations (and even future ones!), is a more powerful tool for computing rolling statistics.

This article shows how to implement various rolling statistics in SAS/IML. To keep the explanations and programs simple, the functions assume that there are no missing values in the data. The article "What is a moving average" explains the mathematical formulas used in this post.

A simple moving average function

The key to computing most rolling statistics is to define a rolling window of observations. At each time point, you extract the observations in the rolling window and use them to compute the statistic. You then move on to the next time point and repeat the computation. You might need to perform special computations at the beginning of the time series.

The following SAS/IML program implements a simple moving average. The arguments to the MA function are a column vector, Y, of time series values and a scalar value, k, which indicates the number of values to use for each moving average computation. The program loops over elements in Y. For each element, the program computes the mean of the current element and previous k-1 values.

proc iml;
/* Simple moving average of k data values.
   First k-1 values are assigned the mean of all previous values.
   Inputs:     y     column vector of length N >= k
               k     number of data points to use to construct each average
start MA(y, k);
   MA = j(nrow(y), 1, .);
   do i = 1 to nrow(y);
      idx = max(1,(i-k+1)):i;   /* rolling window of data points */
      MA[i] = mean( y[idx] );   /* compute average */
   return ( MA );

The first k-1 values require special handling because these values have fewer than k-1 prior observations to average. You could handle these special values by using a separate loop. However, I chose to use the expression max(1, (i-k+1)) to select the first element for the rolling mean computation. When i is less than k, this expression returns 1 for the first element, and the program computes the mean of the first i values. Otherwise, this expression returns i minus k-1 (which is i-k+1) for the first element, and the program computes the mean of k values.

The most important part of this computation is enumerating the time points to use in the computation (for example, idx = (i-k+1):i;) followed by extracting the associated data (for example, y[idx]). With these two expressions, you can compute any rolling statistic. For example, by changing the function call from MEAN to STD, you can compute a rolling standard deviation. The rolling median, rolling minimum, and rolling maximum are also trivial to implement. By changing the time points, you can compute rolling statistics for centered windows. If your data contain several variables, you can compute a rolling correlation.

A weighted moving average function

The following function computes a weighted moving average. The arguments to the WMA function are a column data vector, Y, and a vector of weights that has k elements. For each time point, wk (the last weight) is the weight for current data value, wk-1 is for the previous data value, and so forth. The function internally standardizes the weights so that they sum to unity. (This ordering was chosen so that the WMA function uses the same syntax as PROC EXPAND.) This function handles the first few values in a separate loop:

/* Weighted moving average of k data values.
   First k-1 values are assigned the weighted mean of all preceding values.
   Inputs:     y     column vector of length N >= k
               wt    column vector of weights. w[k] is weight for most recent 
                      data; wt[1] is for most distant data value.  The function 
                     internally standardizes the weights so that sum(wt)=1.
   Example call: WMA  = WMA(y, 1:5);
start WMA(y, wt);
   w = colvec(wt) / sum(wt);       /* standardize weights so that sum(w)=1 */
   k = nrow(w);
   MA = j(nrow(y), 1, .);
   /* handle first k values separately */
   do i = 1 to k-1;
      wIdx = k-i+1:k;                 /* index for previous i weights */
      tIdx = 1:i;                     /* index for previous i data values */
      MA[i] = sum(wt[wIdx]#y[tIdx]) / sum(wt[wIdx]);  /* weighted average */
   /* main computation: average of current and previous k-1 data values */
   do i = k to nrow(y);
      idx = (i-k+1):i;               /* rolling window of k data points */
      MA[i] = sum( w#y[idx] );       /* weighted sum of k data values */
   return ( MA );

Notice that the function requires computing a weighted mean, which is described in a previous article.

An exponentially weighted moving average function

An exponentially weighted moving average is defined recursively. The average at time t is a weighted average of the data point at time t and the average from time t-1. The relative weights are determined by the smoothing parameter, α. The following function implements that definition:

/* Exponentially weighted moving average (EWMA) with smoothing parameter alpha.
   Inputs:      y     column vector of length N
                alpha scalar value 0 < alpha < 1
start EWMA(y, alpha);
   MA = j(nrow(y), 1, .);
   MA[1] = y[1];              /* initialize first value of smoother */
   do i = 2 to nrow(y);
      MA[i] = alpha*y[i] + (1-alpha)*MA[i-1];
   return ( MA );

The three moving average functions are now defined. You can read the time series data into a vector and call the functions. If necessary, you can write the rolling statistics to a SAS data set:

/* read time series data */
use Sashelp.Air;  
   read all var "date" into t;
   read all var "air" into y;
MA   = MA(y, 5);           /* moving average, k=5 */
WMA  = WMA(y, 1:5);        /* weighted moving average */
EWMA = EWMA(y, 0.3);       /* exponentially WMA, alpha=0.3 */
create out var{t y MA WMA EWMA};  append;  close out;

You can use the SGPLOT procedure to visualize the rolling statistics, as shown in my previous article.

Vectorizing time series computations

The experienced SAS/IML programmer will notice that these functions are not heavily vectorized. The MA and WMA computations use vectors of length k to compute the means and weighted means, respectively. It is possible to write these functions by using a matrix operation, but if the time series has N data points, the transformation matrix is an N x N lower-triangular banded matrix, which requires a lot of memory for large values of N.

Notice that the EWMA function uses scalar quantities inside a loop. For time series computations that use lagged data values, you can sometimes vectorize the time series computations. However, for operations that are defined recursively, such as the EWMA, the effort required to vectorize the computation might exceed the benefit you gain from the vectorization. In many cases, a function that uses a simple loop is fast and easy to maintain.


This article presents functions for computing rolling statistics in SAS/IML. Examples included a simple moving average (MA), a weighted moving average (WMA), and an exponentially weighted moving average (EWMA). The article describes how to modify these function to compute other rolling statistics in SAS.

Computations of rolling statistics might not be easy to vectorize. Even when you can vectorize a computation, a simpler approach might run faster.

Post a Comment

Group processing in SAS: The NOTSORTED option

Novice SAS programmers quickly learn the advantages of using PROC SORT to sort data, followed by a BY-group analysis of the sorted data. A typical example is to analyze demographic data by state or by ZIP code. A BY statement enables you to produce multiple analyses from a single procedure call.

In the usual BY-group processing of data, the data are sorted by the BY variables. However, there are situations in which you might not want to sort the data. For example, sorting can be an expensive operation for a huge data set. If your data are already grouped by ZIP code, you might want to analyze the data for each ZIP code in the order that they appear in the data set. You can sort the summarized statistics later, if necessary, which will be much faster than sorting the raw data.

SAS supports BY-group analysis of unsorted data. When you construct the BY statement, use the NOTSORTED option to tell a SAS procedure that groups are to be handled in the order in which they appear in the data.

The NOTSORTED option in the BY statement

The following example uses the NOSORTED option in the BY statement to analyze data by groups. A teacher has recorded test scores for students in her class. The following DATA step creates 20 observations. There are four observations for each student:

data Grades;
input First $8. Last $10. @;
do Test = 1 to 4;
   input Score @; output;
format ID Z4.;
Tim     Albertson  93  89  78  84
Sharad  Gupta     100  95  92  98
Tim     Williams   85  82  70  74
Mandy   Albertson  95  86  90  95
Min     Chen       88  92  85  95

The teacher has committed a serious data-quality mistake: she has failed to include a unique identifier (a "student ID") for her students! Nevertheless, provided that adjacent students in the data set do not share the same first name, the teacher can use the NOTSORTED option in the BY statement to analyze the scores without sorting the data:

proc means data=Grades noprint;
   by First notsorted;      /* <== use NOTSORTED option by first names */
   var Score;
   output out=Averages mean=;
proc print data=Averages noobs; 
   var First Score; 
example of using the NOTSORTED option

The NOTSORTED option tells the procedure to analyze groups that are defined by the FIRST variable. It does not matter that the first names are not sorted or that there are two students named "Tim." The NOTSORTED option prevents the error message:

ERROR: Data set WORK.GRADES is not sorted in ascending sequence.

For this example, the data includes the last names of the students, so the teacher could sort the data by last name and first name and use the statement BY Last First to analyze the data. However, the NOTSORTED statement does not require sorted data, which can be a huge advantage. The NOTSORTED option is supported in almost every SAS procedure and DATA step—with the obvious exception of PROC SORT!

For more examples of using the NOTSORTED option in BY-group analyses, see Samudral and Giddings (2006).

Group processing of unsorted data in SAS/IML

SAS/IML software does not support a BY statement, but you can use various techniques to process groups. The two primary techniques are the UNIQUE-LOC technique and the UNIQUEBY technique. The UNIQUEBY technique can analyze data in the order that they appear, regardless of whether the data are sorted, as shown in the following program:

proc iml;
use Grades;   read all var {First Score};   close;
/* the UNIQUEBY function does not require sorted data */
uu = uniqueby(First);      /* get first row for each student */
result = j(nrow(uu), 1);   /* allocate vector to hold results */
u = uu // (nrow(First)+1); /* trick: append (N+1) to end of indices */
do i = 1 to nrow(u)-1;     /* for each group... */
   idx = u[i]:(u[i+1]-1);  /* get rows in group */
   Y = Score[idx];
   result[i,] = mean(Y);
print result[rowname=(First[uu]) colname="Mean" format=6.2];
example of using the UNIQUEBY function

In summary, if your data are grouped by an identifying variable, you can analyze the data without sorting. The identifying variable does not have to have unique values. In SAS procedures and the DATA step, you can use the NOTSORTED option in the BY statement. In PROC IML, you can use the UNIQUEBY function.

Post a Comment

Compute a moving average in SAS

A common question on SAS discussion forums is how to compute a moving average in SAS. This article shows how to use PROC EXPAND and contains links to articles that use the DATA step or macros to compute moving averages in SAS.

Moving average in SAS

In a previous post, I explained how to define a moving average and provided an example, which is shown here. The graph is a scatter plot of the monthly closing price for IBM stock over a 20-year period. The three curves are moving averages. The "MA" curve is a five-point (trailing) moving average. The "WMA" curve is a weighted moving average with weights 1 through 5. (When computing the weighted moving average at time t, the value yt has weight 5, the value yt-1 has weight 4, the value yt-2 has weight 3, and so forth.) The "EWMA" curve is an exponentially weighted moving average with smoothing factor α = 0.3.

This article shows how to use the EXPAND procedure in SAS/ETS software to compute a simple moving average, a weighted moving average, and an exponentially weighted moving average in SAS. For an overview of PROC EXPAND and its many capabilities, I recommend reading the short paper "Stupid Human Tricks with PROC EXPAND" by David Cassell (2010).

Because not every SAS customer has a license for SAS/ETS software, there are links at the end of this article that show how to compute a simple moving average in SAS by using the DATA step.

Create an example time series

Before you can compute a moving average in SAS, you need data. The following call to PROC SORT creates an example time series with 233 observations. There are no missing values. The data are sorted by the time variable, T. The variable Y contains the monthly closing price of IBM stock during a 20-year period.

/* create example data: IBM stock price */
title "Monthly IBM Stock Price";
proc sort data=sashelp.stocks(where=(STOCK='IBM') rename=(Date=t Close=y)) 
  by t;

Compute a moving average in SAS by using PROC EXPAND

PROC EXPAND computes many kinds of moving averages and other rolling statistics, such as rolling standard deviations, correlations, and cumulative sums of squares.

In the procedure, the ID statement identifies the time variable, T. The data should be sorted by the ID variable. The CONVERT statement specifies the names of the input and output variables. The TRANSFORMOUT= option specifies the method and parameters that are used to compute the rolling statistics.

/* create three moving average curves */
proc expand data=Series out=out method=none;
   id t;
   convert y = MA   / transout=(movave 5);
   convert y = WMA  / transout=(movave(1 2 3 4 5)); 
   convert y = EWMA / transout=(ewma 0.3);

The example uses three CONVERT statements:

  • The first specifies that MA is an output variable that is computed as a (backward) moving average that uses five data values (k=5).
  • The second CONVERT statement specifies that WMA is an output variable that is a weighted moving average. The weights are automatically standardized by the procedure, so the formula is WMA(t) = (5yt + 4yt-1 + 3yt-2 + 2yt-3 + 1yt-4) / 15.
  • The third CONVERT statement specifies that EWMA is an output variable that is an exponentially weighted moving average with parameter 0.3.

Notice the METHOD=NONE option on the PROC EXPAND statement. By default, the EXPAND procedure fits cubic spline curves to the nonmissing values of variables. The METHOD=NONE options ensures that the raw data points are used to compute the moving averages, rather than interpolated values.

Visualizing moving averages

An important use of a moving average is to overlay a curve on a scatter plot of the raw data. This enables you to visualize short-term trends in the data. The following call to PROC SGPOT creates the graph at the top of this article:

proc sgplot data=out cycleattrs;
   series x=t y=MA   / name='MA'   legendlabel="MA(5)";
   series x=t y=WMA  / name='WMA'  legendlabel="WMA(1,2,3,4,5)";
   series x=t y=EWMA / name='EWMA' legendlabel="EWMA(0.3)";
   scatter x=t y=y;
   keylegend 'MA' 'WMA' 'EWMA';
   xaxis display=(nolabel) grid;
   yaxis label="Closing Price" grid;

To keep this article as simple as possible, I have not discussed how to handle missing data when computing moving averages. See the documentation for PROC EXPAND for various issues related to missing data. In particular, you can use the METHOD= option to specify how to interpolate missing values. You can also use transformation options to control how moving averages are defined for the first few data points.

Create a moving average in SAS by using the DATA step

If you do not have SAS/ETS software, the following references show how to use the SAS DATA step to compute simple moving averages by using the LAG function:

The DATA step, which is designed to handle one observation at a time, is not the best tool for time series computations, which naturally require multiple observations (lags and leads). In a future blog post, I will show how to write SAS/IML functions that compute simple, weighted, and exponentially weighted moving averages. The matrix language in PROC IML is easier to work with for computations that require accessing multiple time points.

Post a Comment

What is a moving average?

A moving average (also called a rolling average) is a statistical technique that is used to smooth a time series. Moving averages are used in finance, economics, and quality control. You can overlay a moving average curve on a time series to visualize how each value compares to a rolling average of previous values. For example, the following graph shows the monthly closing price of IBM stock over a 20-year period. Three kinds of moving averages are overlaid on a scatter plot of the data.

Moving average of stock price

The IBM stock price increased in some time periods and decreased in others. The moving-average curves help to visualize these trends and identify these time periods. For a simple moving average, the smoothness of a curve is determined by the number of time points, k, that is used to compute the moving average. Small values of k result in curves that reflect the short-term ups and downs of the data; large values of k undulate less. For stock charts that show daily prices, the 30-day moving average and the 5-day moving average are popular choices.

How do you define a moving average?

The most common moving averages are the simple moving average (MA), the weighted moving average (WMA), and the exponentially weighted moving average (EWMA). The following list provides a brief description and mathematical formula for these kinds of moving averages. See the Wikipedia article on moving averages for additional details.

Let {y0, y1, ..., yt, ...} be the time series that you want to smooth, where yt is the value of the response at time t.

  • The simple moving average at time t is the arithmetic mean of the series at yt and the previous k-1 time points. In symbols,
          MA(t; k) = (1/k) Σ yi
    where the summation is over the k values {yt-k+1, ..., yt}.
  • The weighted moving average (WMA) at time t is a weighted average of the series at yt and the previous k-1 time points. Typically the weights monotonically decrease so that data from "long ago" contribute less to the average than recent data. If the weights sum to unity (Σ wi = 1) then
          WMA(t; k) = Σ wi yi
    If the weights do not sum to unity, then divide that expression by Σ wi.
  • The exponentially weighted moving average (EWMA) does not use a finite rolling window. Instead of the parameter k, the EWMA uses a decay parameter α, where 0 < α < 1. The smoothed value at time t is defined recursively as
          EWMA(t; α) = α yt + (1 - α) EWMA(t-1; α)
    You can "unwind" this equation to obtain the EWMA as a WMA where the weights decrease geometrically. The choice of α determines the smoothness of the EWMA. A value of α ≈ 1 implies that older data contribute very little to the average. Conversely, small values of α imply that older data contribute to the moving average almost as much as newer data.

Each of these definitions contains an ambiguity for the first few values of the moving average. For example, if t < k, then there are fewer than k previous values in the MA and WMA methods. Some practitioners assign missing values to the first k-1 values, whereas others average the values even when fewer than k previous data points exist. For the EWMA, the recursive definition requires a value for EWMA(0; α), which is often chosen to be y0.

My next blog post shows how to compute various moving averages in SAS. The article shows how to create the IBM stock price example, which is a time series plot overlaid with MA, WMA, and EWMA curves.

Post a Comment

Banking to 45 degrees: Aspect ratios for time series plots

In SAS, the aspect ratio of a graph is the physical height of the graph divided by the physical width. Recently I demonstrated how to set the aspect ratio of graphs in SAS by using the ASPECT= option in PROC SGPLOT or by using the OVERLAYEQUATED statement in the Graph Template Language (GTL). I mentioned that a unit aspect ratio is important when you want to visually display distance between points in a scatter plot.

A second use of the aspect ratio is when plotting a time series. A time series is a sequence of line segments that connect data values (xi, yi), i = 0..N. Thus a plot involves N line segments, each with a slope and length. In the late 1980s and early '90s, William Cleveland and other researchers investigated how humans perceive graphical information. For a time series plot, the rate of change (slope) of the time series segments is important. Cleveland suggested that practitioners should use "banking to 45 degrees" as a rule-of-thumb that improves perception of the rate-of-change in the plot.

My SAS colleague Prashant Hebbar wrote a nice article about how to use the GTL to specify the aspect ratio of a graph so that the median absolute slope is 45 degrees. Whereas Prashant focused on creating the graphics, the current article shows how to compute aspect ratios by using three banking techniques from Cleveland.

Three computational methods for choosing an aspect ratio

The simplest computation of an aspect ratio is the median absolute slope (MAS) method (Cleveland, McGill, and McGill, JASA, 1988). As it name implies, it computing the median of the absolute value of the slopes. To obtain a graph in which median slope of the (physical) line segments is unity, set the aspect ratio of the graph to be the reciprocal of the median value.

The median-absolute-slope method is a simple way to choose an aspect ratio. A more complex analysis (Cleveland, Visualizing Data, 1993, p. 90) uses the orientation of the line segments. The orientation of a line segment is the quantity arctan(dy/dx) where dy is the change in physical units in the vertical direction and dx is the change in physical units in the horizontal direction.

Cleveland proposes setting the aspect ratio so that the average of the absolute values of the orientations is 45 degrees. This is called "banking the average orientation to 45 degrees" or the AO (absolute orientation) method. Another approach is to weight the line segments by their length (in physical units), compute a weighted mean, and find the aspect ratio that results an average weighted orientation of 45 degrees. This is called "banking the weighted average of the absolute orientations to 45 degrees" or the AWO (average weighted orientation) method.

The AO and AWO methods require solving for the aspect ratio that satisfies a nonlinear equation that involves the arctangent function. This is equivalent to finding the root of a function, and the easiest way to find a root in SAS is to use the FROOT function in SAS/IML software.

For a nice summary of these and other banking techniques, see Heer and Agrawala (2006). However, be aware that Heer and Agrawala define the aspect ratio as width divided by height, which is the reciprocal of the definition that is used by Cleveland (and SAS).

Banking to 45 degrees

To illustrate the various methods, I used the MELANOMA data set, which contains the incidences of melanoma (skin cancer) per one million people in the state of Connecticut that were observed each year from 1936–1972. This is one of the data sets in Visualizing Data (Cleveland, 1993). You can download all data sets in the book and create a libref to the data set directory. (My libref is called vizdata.)


The line plot shows the melanoma time series. The graph's subtitle indicates that the aspect ratio is 0.4, which means that the plot is about 40% as tall as it is wide. With this scaling, about half of the line segments have slopes that are between ±1. This aspect ratio was chosen by using the median-absolute-slope method.

Prashant has already shown how to use the DATA step and Base procedures to implement the median absolute slope method, so I will provide an implementation in the SAS/IML language. The following SAS/IML program reads the data and uses the DIF function to compute differences between yearly values. These values are then scaled by the range of the data. Assuming that the physical size of the axes are proportional to the data ranges (and ignoring offsets and margins), the ratio of these scaled differences are the physical slopes of the line segments. The following computation shows that the median slope is about 2.7. Therefore, the reciprocal of this value (0.37) is the scale factor for which half the slopes are between ±1 and then other half are more extreme. For convenience, I rounded that value to 0.4.

/* Data from Cleveland (1993) */
libname vizdata "C:\Users\frwick\Documents\My SAS Files\vizdata"; 
title "Melanoma Data";
proc iml;
use vizdata.Melanoma;               /* see Cleveland (1993, p. 158) */
   read all var "Year" into x;
   read all var "Incidence" into y;
dx = dif(x) / range(x);             /* scaled change in horizontal directions */ 
dy = dif(y) / range(y);             /* scaled change in vertical directions */ 
m = median( abs(dy/dx) );           /* median of absolute slopes */
print m[L="Median Slope"] (1/m)[L="Aspect Ratio"];

Three methods of banking to 45 degrees

I used SAS/IML to implement Cleveland's three techniques (MAS, AO, AWO) and ran each technique on the melanoma data. You can download the complete SAS/IML program that computes the aspect ratios. The following table gives the ideal aspect ratios for the melanoma data as computed by each of Cleveland's methods:


As you can see, the three methods are roughly in agreement for these data. An aspect ratio of approximately 0.4 (as shown previously) will create a time series plot in which the rate of change of the segments is apparent. If there is a large discrepancy between the values, Cleveland recommends using the average weighted orientation (AWO) technique.

The complete SAS program also analyzes carbon dioxide (CO2) data that were analyzed by Cleveland (1993, p. 159) and by Herr and Agrawala. Again, the three methods give similar results to each other.

You might wonder why I didn't choose an exact value such as 0.3518795 for the aspect ratio of the melanoma time series plot. The reason is that these "banking" computations are intended to guide the creation of statistical graphics, and using 0.4 (a 2:5 ratio of vertical to horizontal scales) is more interpretable. Think of the numbers as suggestions rather than mandates. As Cleveland, McGill, and McGill (1988) said, banking to 45 degrees "cannot be used in a fully automated way. (No data-analytic procedure can be.) It needs to be tempered with judgment."

So use your judgement in conjunction with these computations to create time series plots in SAS whose line segments are banked to 45 degrees. After you compute an appropriate aspect ratio, you can create the graph by using the ASPECT= option in PROC SGPLOT or the ASPECTRATIO option on a LAYOUT OVERLAY statement in GTL.

Post a Comment