A common question on SAS discussion forums is how to repeat an analysis multiple times. Most programmers know that the most efficient way to analyze one model across many subsets of the data (perhaps each country or each state) is to sort the data and use a BY statement to repeat the analysis for each unique value of one or more categorical variables. But did you know that a BY-group analysis can sometimes be used to replace macro loops? This article shows how you can efficiently run hundreds or thousands of different regression models by restructuring the data.
One model: Many samples
As I've written before, BY-group analysis is also an efficient way to analyze simulated sample or bootstrapped samples. I like to tell people that you can choose "the slow way or the BY way" to analyze many samples.
In that phrase, "the slow way" refers to the act of writing a macro loop that calls a SAS procedure to analyze one sample. The statistics for all the samples are later aggregated, often by using PROC APPEND. As I (and others) have written, macro loops that call a procedure hundreds or thousands of time are relatively slow.
As a general rule, if you find yourself programming a macro loop that calls the same procedure many times, you should ask yourself whether the program can be restructured to take advantage of BY-group processing.Stuck in a macro loop? BY-group processing can be more efficient. #SASTip Click To Tweet
Many models: One sample
There is another application of BY-group processing, which can be incredibly useful when it is applicable. Suppose that you have wide data with many variables: Y, X1, X2, ..., X1000. Suppose further that you want to compute the 1000 single-variable regression models of the form Y=Xi, where i = 1 to 1000.
One way to run 1000 regressions would be to write a macro that contains a %DO loop that calls PROC REG 1000 times. The basic form of the macro would look like this:
%macro RunReg(DSName, NumVars); ... %do i = 1 %to &NumVars; /* repeat for each x&i */ proc reg data=&DSName noprint outest=PE(rename=(x&i=Value)); /* save parameter estimates */ model Y = x&i; /* model Y = x_i */ quit; /* ...then accumulate statistics... */ %end; %mend;
The OUTEST= option saves the parameter estimates in a data set. You can aggregate the statistics by using PROC APPEND or the DATA step.
If you use a macro loop to do this computation, it will take a long time for all the reasons stated in the article "The slow way or the BY way." Fortunately, there is a more efficient alternative.
The BY way for many models
An alternative way to analyze those 1000 regression models is to transpose the data to long form and use a BY-group analysis. Whereas the macro loop might take a few minutes to run, the BY-group method might complete in less than a second. You can download a test program and compare the time required for each method by using the link at the end of this article.
To run a BY-group analysis:
- Transpose the data from wide to long form. As part of this process, you need to create a variable (the BY-group variable) that will be unique for each model.
- Sort the data by the BY-group variable.
- Run the SAS procedure, which uses the BY statement to specify each model.
1. Transpose the data
In the following code, the explanatory variables are read into an array X. The name of each variable is stored by using the VNAME function, which returns the name of the variable that is in the i_th element of the array X. If the original data had N observations and p explanatory variables, the LONG data set contains Np observations.
/* 1. transpose from wide (Y, X1 ,...,X100) to long (varNum VarName Y Value) */ data Long; set Wide; /* <== specify data set name HERE */ array x [*] x1-x&nCont; /* <== specify explanatory variables HERE */ do varNum = 1 to dim(x); VarName = vname(x[varNum]); /* variable name in char var */ Value = x[varNum]; /* value for each variable for each obs */ output; end; drop x:; run;
2. Sort the data
In order to perform a BY-group analysis in SAS, sort the data by the BY-group variable. You can use the VARNUM variable if you want to preserve the order of the variables in the wide data. Or you can sort by the name of the variable, as done in the following call to PROC SORT:
/* 2. Sort by BY-group variable */ proc sort data=Long; by VarName; run;
3. Run the analyses
You can now call a SAS procedure one time to compute all regression models:
/* 3. Call PROC REG and use BY statement to compute all regressions */ proc reg data=Long noprint outest=PE; by VarName; model Y = Value; quit; /* Look at the results */ proc print data=PE(obs=5); var VarName Intercept Value; run;
The PE data set contains the parameter estimates for every single-variable regression of Y onto Xi. The table shows the parameter estimates for the first few models. Notice that the models are presented in the order of the BY-group variable, which for this example is the alphabetical order of the name of the explanatory variables.
You can download the complete SAS program that generates example data and runs many regressions. The program computes the regression estimates two ways: by using a macro loop (the SLOW way) and by transforming the data to long form and using BY-group analysis (the BY way).
This technique is applicable when the models all have a similar form. In this example, the models were of the form Y=Xi, but a similar result would work for GLM models such as Y=A|Xi, where A is a fixed classification variable. Of course, you could also use generalized linear models such as logistic regression.
Can you think of other ways to use this trick? Leave a comment.