Simulation in SAS: The slow way or the BY way

Over the past few years, and especially since I posted my article on eight tips to make your simulation run faster, I have received many emails (often with attached SAS programs) from SAS users who ask for advice about how to speed up their simulation code. For this reason, I am writing a book on Simulating Data with SAS that describes dozens of tips and techniques for writing efficient Monte Carlo simulations.

Of all the tips in the book, the simplest is also the most important: Never use a macro loop to create a simulation. I estimate that this is the root cause of inefficiency in 60% of the SAS simulations that I see.

The basics of statistical simulation

A statistical simulation often consists of the following steps:

  1. Simulate a random sample of size N from a statistical model.
  2. Compute a statistic for the sample.
  3. Repeat 1 and 2 many times and accumulate the results.
  4. Examine the union of the statistics, which approximates the sampling distribution of the statistic and tells you how the statistic varies due to sampling variation.

For example, a simple simulation might investigate the distribution of the sample mean of a sample of size 10 that is drawn randomly from the uniform distribution on [0,1].

You can program this simulation in two ways: the slow way, which uses macro loops, or the fast way, which uses the SAS BY statement.

Macro loops lead to slow simulations

It is understandable that some programmers look at the simulation algorithm and want to write a macro loop for the "repeat many times" portion of the algorithm. A first attempt at a simulation in SAS might look like this example:

%macro Simulate(N, NumSamples);
options nonotes;     /* prevents the SAS log from overflowing */
proc datasets nolist; 
  delete OutStats;   /* delete this data set if it exists */
%do i = 1 %to &NumSamples;       /* repeat many times (the dreaded macro loop!) */
   data Temp;                    /* 1. create a sample of size &N */
   do i = 1 to &N;               
      x = rand("Uniform");
   proc means data=Temp noprint; /* 2. compute a statistic */
      var x;
      output out=Out mean=SampleMean;
   /* use PROC APPEND to accumulate statistics */
   proc append base=OutStats data=Out; run;
options notes;
/* call macro to simulate data and compute statistics */
%Simulate(10, 1000)
/* 4. analyze the sampling distribution of the statistic */
proc univariate data=OutStats;
   histogram SampleMean;

For each iteration of the macro loop, the program creates a data set (Temp) with one sample that contains 10 observations. The MEANS procedure runs and creates one statistic, the sample mean. Then the APPEND procedure adds the newly computed sample mean to a data set that contains the means of the previous samples. This process repeats 1,000 times. When the macro finishes, PROC UNIVARIATE analyzes the distribution of the sample means.

I'm sure you'll agree that this is just about the World's Simplest Simulation. The data simulation step is trivial, and computing a sample mean is also trivial. So how long does this World's Simplest Simulation take to complete?

About 60 seconds! This code is terribly inefficient!

Would you like to perform the same computation more than 100 times faster? Read on.

Improve the performance of your simulation by using BY processing

The key to improving the simulation is to restructure the simulation algorithm as follows:

  1. Simulate many random samples of size N from a statistical model.
  2. Compute a statistic for each sample.
  3. Examine the union of the statistics, which approximates the sampling distribution of the statistic and tells you how the statistic varies due to sampling variation.

To implement this restructured (but equivalent) algorithm, insert an extra DO loop inside the DATA step and use a BY statement in the procedure that computes the statistics, as shown in the following example:

/* efficient simulation that calls a SAS procedure */
%let N = 10;
%let NumSamples = 1000;
data Uniform(keep=SampleID x); 
do SampleID = 1 to &NumSamples;  /* 1. create many samples */
   do i = 1 to &N;               /* sample of size &N */
      x = rand("Uniform");
proc means data=Uniform noprint; 
   by SampleID;                  /* 2. compute many statistics */
   var x;
   output out=OutStats mean=SampleMean;
/* 3. analyze the sampling distribution of the statistic */
proc univariate data=OutStats;
   histogram SampleMean;

The first step is to simulate the data. You already know how to write statements that simulate one random sample, so just add a DO loop around those statements. The second step is to compute the statistics for each sample. You already know how to compute one statistic, so just add a BY statement to the procedure syntax. That's it. It's a simple technique, but it makes a huge difference in performance. (Remember to use the NOPRINT option so you don't get a "Log Window full" message!)

How long does the BY-group analysis require? It's essentially instantaneous. You can use OPTIONS FULLSTIMER to time the operations. On my computer it takes about 0.07 seconds to simulate the data and the same amount of time to analyze it.

Why BY-group processing is fast and macro processing is slow

The first published description of this technique that I know of is the article "A Remark on Efficient Simulations in SAS" by Ilya Novikov (2003, J. RSS). Novikov mentions that the macro approach minimizes disk space, but that the BY-group technique minimizes time. (He also thanks Phil Gibbs of SAS Technical Support, who has been teaching SAS customers this technique since the mid-1990s.) For the application in his paper, Novikov reports that the macro approach required five minutes on his Pentium III computer, whereas the BY-group technique completed in five seconds.

You can use a variation of this technique to do bootstrap computation in SAS. See David Cassell's 2007 SAS Global Forum paper, "Don't Be Loopy: Re-Sampling and Simulation the SAS Way" for a discussion of implementing bootstrap methods in SAS.

So why is one approach so slow and the other so fast? Programmers who work with matrix/vector languages such as SAS/IML software are familiar with the idea of vectorizing a computation. The idea is to perform a few matrix computations on matrices and vectors that hold a lot of data. This is much more efficient than looping over data and performing many scalar computations.

Although SAS is not a vector language, the same ideas apply. The macro approach suffers from a high "overhead-to-work" ratio. The DATA step and the MEANS procedure are called 1,000 times, but they generate or analyze only 10 observations in each call. This is inefficient because every time that SAS encounters a procedure call, it must parse the SAS code, open the data set, load data into memory, do the computation, close the data set, and exit the procedure. When a procedure computes complicated statistics on a large data set, these "overhead" costs are small relative to the computation performed by the procedure. However, for this example, the overhead costs are large relative to the computational work.

The BY-group approach has a low overhead-to-work ratio. The DATA step and PROC MEANS are each called once and they do a lot of work during each call.

So if you want to write an efficient simulation in SAS, use BY-group processing. It is often hundreds of times faster than writing a macro loop.

tags: Efficiency, SAS Programming, Simulation


  1. Fareeza
    Posted July 18, 2012 at 1:19 pm | Permalink

    I ran into a problem where the dataset was too big to work with efficiently using BY and using a macro with smaller portions of the dataset ended up being faster.
    Basically, you may run into computational limits with BY groups based on the dataset size, RAM and HD space.

    • Posted July 18, 2012 at 1:40 pm | Permalink

      Do you remember what procedure you were using? Most (or all?) procedures compute with each BY group in sequence, so that the whole data set is not held in memory. I'd be interested in hearing more if you recall any details.

      • Geoffrey Brent
        Posted July 27, 2012 at 3:30 am | Permalink

        I tested this advice out and had similar results to Fareeza:

        %let n=1000; /* number of iterations */
        %let m=1000000; /* size of one iteration */

        %macro calculate_means();

        options nonotes;
        proc datasets nolist;
        delete outstats;

        %do i=1 %to &n;

        data temp;
        do i=1 to &m;

        proc means data=temp noprint;
        var x;
        output out=out mean=samplemean;
        proc append base=outstats data=out;

        options notes;


        data test;
        do sampleid=1 to &n;
        do i=1 to &m;

        proc means data=test noprint;
        by sampleid;
        var x;
        output out=outstats mean=samplemean;

        For small values of m, the 'macro' option was much faster: at m=1000 n=1000 the macro method took 16 seconds compared to 0.45 seconds using BY.

        But when I increase m to 100000 the difference became negligible: 33 seconds for macro vs 34 for BY.

        At m=1000,000 the macro method was substantially faster: 187 seconds, vs 473 seconds for BY. Note that the BY approach would be creating a file on the order of 3 gigabytes here, which may cause its own overhead.

        • Posted July 27, 2012 at 7:44 am | Permalink

          Thanks for the feedback. I think you meant to say "For small values of m, the 'macro' option was much SLOWER."

          My main point is that every time you call a SAS procedure you incur overhead costs. You want to make sure that the work that the procedure does is substantial compared with those costs. For most simulation studies, the sample size (my &N, your &m) is small, because for large samples you can use asymptotic results to get standard errors and confidence intervals. I'd guess that 99% of simulations that I've seen are done for samples sizes less than 1,000.

          As the sample sizes get large, other factors come into play. You report that m=100,000 is about the size at which the two methods become comparable for your hardware and that the BY-group approach is slower for huge samples. You have essentially rediscovered the reason behind the "hybrid method" proposed by Novikov and Oberman (2007), which combines the macro and BY-group approaches when doing massive simulations. In their paper, they note that both approaches are "non-optimal with respect to computing time for large simulations." [emphasis added] See their paper for discussion, examples, and a SAS macro that combines the two methods.

          • Geoffrey Brent
            Posted July 29, 2012 at 8:55 pm | Permalink

            Yes, that should have been "macro is slower for small values of m".

            FWIW, in my job a data set with 30k-100k observations is unremarkable, and some run well into the millions, so that may have coloured my views on what counts as a "small" data set. The fact that we deal with data sets of that size is a big part of why we're using SAS in the first place...

  2. Doc Muhlbaier
    Posted July 20, 2012 at 9:15 am | Permalink

    "Performance," broadly spoke, it a tradeoff. The characteristics that most impact it are CPU speed, memory size, disk space, data transmission speed, and user effort.

    In simulations, with relatively simple data structures, the BY approach generally works better as disk space is often a non-issue, even with large numbers of replications. Bootstrapping, on the other hand, can have both wide and long data, as well as more replicates; then the disk space and data transmission speed become parameters that need to be considered. Most SAS procedures are designed to minimize memory usage (not putting all of the data in memory at once), but there are a few that try; the individual procedure documentation helps here.

    I included "user effort" in my tradeoff list because simple code can make a real difference in maintenance and understanding.

    Doc Muhlbaier

  3. Oksana Naumenko
    Posted April 7, 2013 at 9:21 pm | Permalink

    Could you please show an example where many data sets are created from a standard multivariate normal distribution? Examples of storing the generated data would be helpful, also.

  4. Dallas
    Posted June 3, 2013 at 1:12 pm | Permalink

    I can't tell you how much of a help this post was! I was doing the wrong way by using macros. It took over 24 hours for my simulation to complete. I ended up cutting my number of samples from 500 to 100 just to get it to run in about an hour. When I implemented your method, the whole thing ran in 33 seconds! Thank you so much for this time-saving information!

  5. Chris Meaney
    Posted November 25, 2013 at 10:36 am | Permalink

    Great post @Rick. The simulation you describe is embarrassingly parallel. As such, is there a way in SAS 9.3 to distribute your multiple BY statement tasks across different cores of your PC so that the code could run even faster? Something like parallel() or foreach() in R? A piece of example code would be greatly appreciated as I can't seem to find any such reference on the web.

    • Posted November 25, 2013 at 11:04 am | Permalink

      This simulation runs in a fraction of a second, so you don't need to parallelize it. In fact, if I run the hundreds of programs in my 300-page book Simulating Data with SAS, the cumulative time is only a few minutes, with the longest-running program requiring only about 30 seconds. By using the techniques in my book, you can write efficient programs that run quickly.

      That said, many SAS procedures have been multithreaded since SAS 9.2, and will automatically use multiple cores on your PC. For simulations in which you are varying one or more parameter (e.g., power computations), there are various schemes for distributing computations. Do a web search that includes the terms "sas grid doninger" and you'll get dozens of hits.

  6. Hatem
    Posted August 28, 2014 at 5:24 pm | Permalink

    How to simulate sample data in the interval [0,1]? I mean simulated data include value zero and value one and also values between 0 and 1 using SAS package

    • Posted August 29, 2014 at 6:04 am | Permalink

      I would advise you not to worry about whether 0 and 1 are possible values that are returned from a random number generator. The uniform distribution on the closed unit interval (U[0,1]) and on the open interval (U(0,1)) have the same cumulative distribution. Consequently, they are identical distributions. Therefore use the RAND("uniform") function in SAS to sample from the uniform distribution as described in this article on generating uniform random numbers.

  7. kifayat
    Posted September 26, 2014 at 10:33 am | Permalink

    Dear All
    I Need Simulation codes for Location based routing protocols for under water sensor networks.
    please help me

  8. Rasheed
    Posted July 5, 2015 at 12:47 am | Permalink

    I have multiple data sets named
    dsnAC16 dsnAC17 dsnAC18 dsnAC19 dsnAC110.........dsnAC125
    dsnAC26 dsnAC27 dsnAC28 dsnAC29 dsnAC210.........dsnAC225
    dsnAC56 dsnAC57 dsnAC58 dsnAC59 dsnAC510.........dsnAC525

    I want to combine all above data set and I am using following code but getting some errors pls guide
    I want to do this task without macro loop
    %let i=5; %let pat=20;

    data combineAC;
    do %let k = 1 to &i;

    do %let h= %eval(1+&i) to %eval(&pats+&i);



  9. Kat
    Posted August 13, 2015 at 4:33 pm | Permalink

    Hi Rick,

    Thank you for these tips! I had started my simulations using a macro, and have updated my code as much as possible (using the "by" statement) and it runs much faster! I have a question about figuring out where "warnings" occur, though ... my log is giving warnings that look like in about 15 or so of the 500 simulations I'm running, the logistic model is not converging (this is expected). But, it doesn't say which simulation numbers these warnings are occurring at (and the output dataset has values for all simulations, so it doesn't help to look there). Any way I could have the simulation number written to in the log for those warnings when I've used the "by" command?


    • Posted August 13, 2015 at 4:53 pm | Permalink

      The log contains NOTES that give the BY group information, but that's not what you want to use. Probably you are using the OUTEST= option on the PROC LOGISTIC step to get the parameter estimates. The OUTEST= data set contains a variable called _STATUS_ which has values like "0 Converged" or "1 Warning." Thus if you only want to analyze the parameter estimates for which the model converged, you can say
      WHERE _STATUS_ = "0 Converged";
      or if you want to see the runs that failed
      WHERE _STATUS_ ^= "0 Converged";

      For more details, see my blog post about logistic simulation.

      • Kat
        Posted August 13, 2015 at 10:30 pm | Permalink

        Great, thanks so much for the quick reply!

  10. IVY zhang
    Posted September 30, 2015 at 6:14 pm | Permalink

    Thank you for the posting. This is very helpful. My questions, when running 1000 simulations for negative binomial regression, there is one simulation having some extreme value and run into ERROR: Floating Point Overflow. And the whole procedure will be terminated. And there will be no output. Is there anyway we can skip this particular simulation and still do rest of the all?

    • Posted September 30, 2015 at 9:10 pm | Permalink

      Yes. Identify the statement that is overflowing and trap the condition before it overflows. For example, if the overflow is in calling the EXP function, use the CONSTANT function to make sure that the argument never exceeds LOGBIG. For details and an example, see the article "Constants in SAS".

  11. Kat
    Posted October 22, 2015 at 11:00 am | Permalink

    Hi Rick,

    I have been using the BY-group processing for running simulations, which has worked well and saved me lots of time up to this point! However, I need to increase the "sample size" per simulation as well as the number of simulations I'm running, and now SAS is giving me errors in that it is "unable to allocate sufficient memory". (For example, I need 5,000 observations per simulated dataset, and am running 3,000 simulations -- so the base dataset has 15 million observations. Then, I'm repeating that by 12 different outcomes, so there would be 180 million observations.) I have tried increasing the memory used, but that did not help. My only other thought is to create multiple datasets (each as large as possible) and repeat the BY-group processing on each dataset. Would you have any other solutions/suggestions?


    • Posted October 22, 2015 at 11:16 am | Permalink

      You've done all the right things. 180 million observations is not huge by modern standards, so you might be running out of space in WORK. Your SAS administrator might be able to help you allocate more space for the WORK directory. Or you can use a permanent libref, but remember to delete those files when you are finished!

      Your idea to break up the problem into smaller ones is quite reasonable. That is one of the tips I give for running huge simulations in Chapter 6 of Simulating Data with SAS (particularly section 6.4.5).

6 Trackbacks

  1. By Using macro loops for simulation - The DO Loop on July 25, 2012 at 5:27 am

    [...] Not only is the macro loop approach slow, but there are other undesirable side effects that can occur. For example, in the example code that I presented, I used the OPTION NONOTES statement and made the comment that the option "prevents the SAS log from overflowing." Someone asked me to explain what I meant by that comment. [...]

  2. [...] my article "Simulation in SAS: The slow way or the BY way," I showed how to use BY-group processing rather than a macro loop in order to efficiently analyze [...]

  3. [...] I have written previously, use BY-group processing to carry out efficient simulation and analysis in SAS. Also, be sure to suppress the display of tables and graphs during the analysis by using the the [...]

  4. [...] enjoy this week's free book excerpt. Additionally, Rick kindly provided a link to his blog post Simulation in SAS: The slow way or the BY way - for an example of what NOT to do in Base [...]

  5. […] The program in my previous post consists of four steps. (Or five, if you count creating a data set as a separate step.) To simulate multiple samples, put a DO loop around Step 4, the step that generates a random binary response vector from the probabilities that were computed for each observation in the model. The following program writes a single data set that contains 100 samples. Each sample is identified by an ordinal variable named SampleID. The data set is in the form required for BY-group processing, which is the efficient way to simulate data with SAS. […]

  6. […] during a data analysis, lack-of-convergence during a simulation study can be maddening. Recall that the efficient way to implement a simulation study in SAS is to use BY-group processing to analyze thousands of independent samples. A simulation study is designed to run in batch mode […]

Post a Comment

Your email is never published nor shared. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>