An easy way to run thousands of regressions in SAS

3

A common question on SAS discussion forums is how to repeat an analysis multiple times. Most programmers know that the most efficient way to analyze one model across many subsets of the data (perhaps each country or each state) is to sort the data and use a BY statement to repeat the analysis for each unique value of one or more categorical variables. But did you know that a BY-group analysis can sometimes be used to replace macro loops? This article shows how you can efficiently run hundreds or thousands of different regression models by restructuring the data.

One model: Many samples

As I've written before, BY-group analysis is also an efficient way to analyze simulated sample or bootstrapped samples. I like to tell people that you can choose "the slow way or the BY way" to analyze many samples.

In that phrase, "the slow way" refers to the act of writing a macro loop that calls a SAS procedure to analyze one sample. The statistics for all the samples are later aggregated, often by using PROC APPEND. As I (and others) have written, macro loops that call a procedure hundreds or thousands of time are relatively slow.

As a general rule, if you find yourself programming a macro loop that calls the same procedure many times, you should ask yourself whether the program can be restructured to take advantage of BY-group processing.

Stuck in a macro loop? BY-group processing can be more efficient. #SASTip Click To Tweet

Many models: One sample

There is another application of BY-group processing, which can be incredibly useful when it is applicable. Suppose that you have wide data with many variables: Y, X1, X2, ..., X1000. Suppose further that you want to compute the 1000 single-variable regression models of the form Y=Xi, where i = 1 to 1000.

One way to run 1000 regressions would be to write a macro that contains a %DO loop that calls PROC REG 1000 times. The basic form of the macro would look like this:

%macro RunReg(DSName, NumVars);
...
%do i = 1 %to &NumVars;                    /* repeat for each x&i */
   proc reg data=&DSName noprint
            outest=PE(rename=(x&i=Value)); /* save parameter estimates */
   model Y = x&i;                          /* model Y = x_i */
   quit;
 
   /* ...then accumulate statistics... */
%end;
%mend;

The OUTEST= option saves the parameter estimates in a data set. You can aggregate the statistics by using PROC APPEND or the DATA step.

If you use a macro loop to do this computation, it will take a long time for all the reasons stated in the article "The slow way or the BY way." Fortunately, there is a more efficient alternative.

The BY way for many models

An alternative way to analyze those 1000 regression models is to transpose the data to long form and use a BY-group analysis. Whereas the macro loop might take a few minutes to run, the BY-group method might complete in less than a second. You can download a test program and compare the time required for each method by using the link at the end of this article.

To run a BY-group analysis:

  1. Transpose the data from wide to long form. As part of this process, you need to create a variable (the BY-group variable) that will be unique for each model.
  2. Sort the data by the BY-group variable.
  3. Run the SAS procedure, which uses the BY statement to specify each model.

1. Transpose the data

In the following code, the explanatory variables are read into an array X. The name of each variable is stored by using the VNAME function, which returns the name of the variable that is in the i_th element of the array X. If the original data had N observations and p explanatory variables, the LONG data set contains Np observations.

/* 1. transpose from wide (Y, X1 ,...,X100) to long (varNum VarName Y Value) */
data Long;
set Wide;                       /* <== specify data set name HERE         */
array x [*] x1-x&nCont;         /* <== specify explanatory variables HERE */
do varNum = 1 to dim(x);
   VarName = vname(x[varNum]);  /* variable name in char var */
   Value = x[varNum];           /* value for each variable for each obs */
   output;
end;
drop x:;
run;

2. Sort the data

In order to perform a BY-group analysis in SAS, sort the data by the BY-group variable. You can use the VARNUM variable if you want to preserve the order of the variables in the wide data. Or you can sort by the name of the variable, as done in the following call to PROC SORT:

/* 2. Sort by BY-group variable */
proc sort data=Long;  by VarName;  run;

3. Run the analyses

You can now call a SAS procedure one time to compute all regression models:

/* 3. Call PROC REG and use BY statement to compute all regressions */
proc reg data=Long noprint outest=PE;
by VarName;
model Y = Value;
quit;
 
/* Look at the results */
proc print data=PE(obs=5);
var VarName Intercept Value;
run;

The PE data set contains the parameter estimates for every single-variable regression of Y onto Xi. The table shows the parameter estimates for the first few models. Notice that the models are presented in the order of the BY-group variable, which for this example is the alphabetical order of the name of the explanatory variables.

Conclusions

You can download the complete SAS program that generates example data and runs many regressions. The program computes the regression estimates two ways: by using a macro loop (the SLOW way) and by transforming the data to long form and using BY-group analysis (the BY way).

This technique is applicable when the models all have a similar form. In this example, the models were of the form Y=Xi, but a similar result would work for GLM models such as Y=A|Xi, where A is a fixed classification variable. Of course, you could also use generalized linear models such as logistic regression.

Can you think of other ways to use this trick? Leave a comment.

Share

About Author

Rick Wicklin

Distinguished Researcher in Computational Statistics

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of PROC IML and SAS/IML Studio. His areas of expertise include computational statistics, simulation, statistical graphics, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.

3 Comments

  1. Peter Lancashire on

    Ah transpose - a neat trick. BY group processing is good and efficient when you can do everything in one PROC. The output for each group is gathered together. It suits the monolithic procedure design for which SAS is known. However, I almost always find I need to do something with the output from the PROC, usually with ODS graphics. In that case, the output becomes scattered. The introduction of PROC PLM also seems not to fit with the BY group processing model in that two PROCs are required, so you get two groups of output.
    .
    I know that you can use PROC DOCUMENT to reorganise the output but it seems to know nothing about BY groups, so some complex messing about is required. Here's a suggestion: make PROC DOCUMENT able to re-order pages into BY groups.

    • Rick Wicklin

      Hi Peter,
      I believe PROC DOCUMENT will do what you want. The BY-group information is part of the path for the tables and graphs. You have to use the LIST statement to examine the output. Then you can use the REPLAY statement to replay the items in BY-group order. But you are correct that there is some manual work. it is not automatic.

  2. Pingback: Reorder the output from a BY-group analysis in SAS - The DO Loop

Leave A Reply

Back to Top