# Simulation in SAS: The slow way or the BY way

Over the past few years, and especially since I posted my article on eight tips to make your simulation run faster, I have received many emails (often with attached SAS programs) from SAS users who ask for advice about how to speed up their simulation code. For this reason, I am writing a book on Simulating Data with SAS that describes dozens of tips and techniques for writing efficient Monte Carlo simulations.

Of all the tips in the book, the simplest is also the most important: Never use a macro loop to create a simulation. I estimate that this is the root cause of inefficiency in 60% of the SAS simulations that I see.

### The basics of statistical simulation

A statistical simulation often consists of the following steps:

1. Simulate a random sample of size N from a statistical model.
2. Compute a statistic for the sample.
3. Repeat 1 and 2 many times and accumulate the results.
4. Examine the union of the statistics, which approximates the sampling distribution of the statistic and tells you how the statistic varies due to sampling variation.

For example, a simple simulation might investigate the distribution of the sample mean of a sample of size 10 that is drawn randomly from the uniform distribution on [0,1].

You can program this simulation in two ways: the slow way, which uses macro loops, or the fast way, which uses the SAS BY statement.

### Macro loops lead to slow simulations

It is understandable that some programmers look at the simulation algorithm and want to write a macro loop for the "repeat many times" portion of the algorithm. A first attempt at a simulation in SAS might look like this example:

```/****** DO NOT MIMIC THIS CODE: INEFFICIENT! ******/ %macro Simulate(N, NumSamples); options nonotes; /* prevents the SAS log from overflowing */ proc datasets nolist; delete OutStats; /* delete this data set if it exists */ run;   %do i = 1 %to &NumSamples; /* repeat many times (the dreaded macro loop!) */ data Temp; /* 1. create a sample of size &N */ do i = 1 to &N; x = rand("Uniform"); output; end; run;   proc means data=Temp noprint; /* 2. compute a statistic */ var x; output out=Out mean=SampleMean; run;   /* use PROC APPEND to accumulate statistics */ proc append base=OutStats data=Out; run; %end; options notes; %mend;   /* call macro to simulate data and compute statistics */ %Simulate(10, 1000)   /* 4. analyze the sampling distribution of the statistic */ proc univariate data=OutStats; histogram SampleMean; run;```

For each iteration of the macro loop, the program creates a data set (Temp) with one sample that contains 10 observations. The MEANS procedure runs and creates one statistic, the sample mean. Then the APPEND procedure adds the newly computed sample mean to a data set that contains the means of the previous samples. This process repeats 1,000 times. When the macro finishes, PROC UNIVARIATE analyzes the distribution of the sample means.

I'm sure you'll agree that this is just about the World's Simplest Simulation. The data simulation step is trivial, and computing a sample mean is also trivial. So how long does this World's Simplest Simulation take to complete?

About 60 seconds! This code is terribly inefficient!

Would you like to perform the same computation more than 100 times faster? Read on.

### Improve the performance of your simulation by using BY processing

The key to improving the simulation is to restructure the simulation algorithm as follows:

1. Simulate many random samples of size N from a statistical model.
2. Compute a statistic for each sample.
3. Examine the union of the statistics, which approximates the sampling distribution of the statistic and tells you how the statistic varies due to sampling variation.

To implement this restructured (but equivalent) algorithm, insert an extra DO loop inside the DATA step and use a BY statement in the procedure that computes the statistics, as shown in the following example:

```/* efficient simulation that calls a SAS procedure */ %let N = 10; %let NumSamples = 1000; data Uniform(keep=SampleID x); do SampleID = 1 to &NumSamples; /* 1. create many samples */ do i = 1 to &N; /* sample of size &N */ x = rand("Uniform"); output; end; end; run;   proc means data=Uniform noprint; by SampleID; /* 2. compute many statistics */ var x; output out=OutStats mean=SampleMean; run;   /* 3. analyze the sampling distribution of the statistic */ proc univariate data=OutStats; histogram SampleMean; run;```

The first step is to simulate the data. You already know how to write statements that simulate one random sample, so just add a DO loop around those statements. The second step is to compute the statistics for each sample. You already know how to compute one statistic, so just add a BY statement to the procedure syntax. That's it. It's a simple technique, but it makes a huge difference in performance.

How long does the BY-group analysis require? It's essentially instantaneous. You can use OPTIONS FULLSTIMER to time the operations. On my computer it takes about 0.07 seconds to simulate the data and the same amount of time to analyze it.

### Why BY-group processing is fast and macro processing is slow

The first published description of this technique that I know of is the article "A Remark on Efficient Simulations in SAS" by Ilya Novikov (2003, J. RSS). Novikov mentions that the macro approach minimizes disk space, but that the BY-group technique minimizes time. (He also thanks Phil Gibbs of SAS Technical Support, who has been teaching SAS customers this technique since the mid-1990s.) For the application in his paper, Novikov reports that the macro approach required five minutes on his Pentium III computer, whereas the BY-group technique completed in five seconds.

You can use a variation of this technique to do bootstrap computation in SAS. See David Cassell's 2007 SAS Global Forum paper, "Don't Be Loopy: Re-Sampling and Simulation the SAS Way" for a discussion of implementing bootstrap methods in SAS.

So why is one approach so slow and the other so fast? Programmers who work with matrix/vector languages such as SAS/IML software are familiar with the idea of vectorizing a computation. The idea is to perform a few matrix computations on matrices and vectors that hold a lot of data. This is much more efficient than looping over data and performing many scalar computations.

Although SAS is not a vector language, the same ideas apply. The macro approach suffers from a high "overhead-to-work" ratio. The DATA step and the MEANS procedure are called 1,000 times, but they generate or analyze only 10 observations in each call. This is inefficient because every time that SAS encounters a procedure call, it must parse the SAS code, open the data set, load data into memory, do the computation, close the data set, and exit the procedure. When a procedure computes complicated statistics on a large data set, these "overhead" costs are small relative to the computation performed by the procedure. However, for this example, the overhead costs are large relative to the computational work.

The BY-group approach has a low overhead-to-work ratio. The DATA step and PROC MEANS are each called once and they do a lot of work during each call.

So if you want to write an efficient simulation in SAS, use BY-group processing. It is often hundreds of times faster than writing a macro loop.

1. Fareeza
Posted July 18, 2012 at 1:19 pm | Permalink

I ran into a problem where the dataset was too big to work with efficiently using BY and using a macro with smaller portions of the dataset ended up being faster.
Basically, you may run into computational limits with BY groups based on the dataset size, RAM and HD space.

• Posted July 18, 2012 at 1:40 pm | Permalink

Do you remember what procedure you were using? Most (or all?) procedures compute with each BY group in sequence, so that the whole data set is not held in memory. I'd be interested in hearing more if you recall any details.

• Geoffrey Brent
Posted July 27, 2012 at 3:30 am | Permalink

%let n=1000; /* number of iterations */
%let m=1000000; /* size of one iteration */

%macro calculate_means();

options nonotes;
proc datasets nolist;
delete outstats;
run;

%do i=1 %to &n;

data temp;
do i=1 to &m;
x=rand("uniform");
output;
end;
run;

proc means data=temp noprint;
var x;
output out=out mean=samplemean;
run;
proc append base=outstats data=out;
run;
%end;

options notes;
%mend;

%calculate_means;

data test;
do sampleid=1 to &n;
do i=1 to &m;
x=rand("uniform");
output;
end;
end;
run;

proc means data=test noprint;
by sampleid;
var x;
output out=outstats mean=samplemean;
quit;

For small values of m, the 'macro' option was much faster: at m=1000 n=1000 the macro method took 16 seconds compared to 0.45 seconds using BY.

But when I increase m to 100000 the difference became negligible: 33 seconds for macro vs 34 for BY.

At m=1000,000 the macro method was substantially faster: 187 seconds, vs 473 seconds for BY. Note that the BY approach would be creating a file on the order of 3 gigabytes here, which may cause its own overhead.

• Posted July 27, 2012 at 7:44 am | Permalink

Thanks for the feedback. I think you meant to say "For small values of m, the 'macro' option was much SLOWER."

My main point is that every time you call a SAS procedure you incur overhead costs. You want to make sure that the work that the procedure does is substantial compared with those costs. For most simulation studies, the sample size (my &N, your &m) is small, because for large samples you can use asymptotic results to get standard errors and confidence intervals. I'd guess that 99% of simulations that I've seen are done for samples sizes less than 1,000.

As the sample sizes get large, other factors come into play. You report that m=100,000 is about the size at which the two methods become comparable for your hardware and that the BY-group approach is slower for huge samples. You have essentially rediscovered the reason behind the "hybrid method" proposed by Novikov and Oberman (2007), which combines the macro and BY-group approaches when doing massive simulations. In their paper, they note that both approaches are "non-optimal with respect to computing time for large simulations." [emphasis added] See their paper for discussion, examples, and a SAS macro that combines the two methods.

• Geoffrey Brent
Posted July 29, 2012 at 8:55 pm | Permalink

Yes, that should have been "macro is slower for small values of m".

FWIW, in my job a data set with 30k-100k observations is unremarkable, and some run well into the millions, so that may have coloured my views on what counts as a "small" data set. The fact that we deal with data sets of that size is a big part of why we're using SAS in the first place...

2. Doc Muhlbaier
Posted July 20, 2012 at 9:15 am | Permalink

"Performance," broadly spoke, it a tradeoff. The characteristics that most impact it are CPU speed, memory size, disk space, data transmission speed, and user effort.

In simulations, with relatively simple data structures, the BY approach generally works better as disk space is often a non-issue, even with large numbers of replications. Bootstrapping, on the other hand, can have both wide and long data, as well as more replicates; then the disk space and data transmission speed become parameters that need to be considered. Most SAS procedures are designed to minimize memory usage (not putting all of the data in memory at once), but there are a few that try; the individual procedure documentation helps here.

I included "user effort" in my tradeoff list because simple code can make a real difference in maintenance and understanding.

Doc Muhlbaier
Duke

3. Oksana Naumenko
Posted April 7, 2013 at 9:21 pm | Permalink

Could you please show an example where many data sets are created from a standard multivariate normal distribution? Examples of storing the generated data would be helpful, also.

1. By Using macro loops for simulation - The DO Loop on July 25, 2012 at 5:27 am

[...] Not only is the macro loop approach slow, but there are other undesirable side effects that can occur. For example, in the example code that I presented, I used the OPTION NONOTES statement and made the comment that the option "prevents the SAS log from overflowing." Someone asked me to explain what I meant by that comment. [...]

Rick Wicklin, PhD, is a senior researcher in computational statistics at SAS and is a principal developer of PROC IML and SAS/IML Studio.  His areas of expertise include computational statistics, statistical graphics, statistical simulation, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.