Last week I read an interesting paper by Bob Rodriguez: "Statistical Model Building for Large, Complex Data: Five New Directions in SAS/STAT Software." In it, Rodriguez summarizes five modern techniques for building predictive models and highlights recent SAS/STAT procedures that implement those techniques. The paper discusses the following high-performance (HP) procedures in SAS/STAT software:
- The GLMSELECT procedure builds parsimonious linear regression models. This procedure supports modern effect selection techniques such as LASSO, LAR, and the elastic net. (Technically, the GLMSELECT procedure pre-dates the HP procedures. However, it is multithreaded. The HPREG procedure provides similar functionality in a distributed environment.)
- The HPGENSELECT procedure builds generalized linear models such as Poisson models, zero-inflated models, Tweedie models, and more. The procedure supports the selection of effects.
- The QUANTSELECT procedure builds quantile regression models and features effect selection.
- The GAMPL procedure fits generalized additive models. I have previously shown the power of the GAMPL procedure to fit binary response data.
- The HPSPLIT procedure builds classification and regression trees. Tree-based procedures are a popular choice when you want a nonparametric model that is interpretable in terms of business rules.
If you are unfamiliar with these newer procedures, I encourage you to read the Rodriguez paper. Although they are designed for distributed computations, you can also run these procedures in single-machine mode.Five procedures for building statistical models in #SAS for large complex data Click To Tweet
Thoughts on the "SELECT" procedures
The GLMSELECT, HPGENSELECT, and (HP)QUANTSELECT procedures support choosing a small number of effects from among a large list of possible effects. Some statisticians call these "variable selection" procedures, but "effect selection" is a more appropriate term because an important use case is to use the procedures to select important interaction effects from the list of all second order interactions. Regardless of what you call them, the procedures automatically select a small number of effects that provide a good predictive model from among hundreds or thousands of possible effects. You can either accept the selected model or use traditional modeling techniques to refine the model from among the selected candidate effects.
While thinking about the use of the word "SELECT" in the names of these procedures, it occurred to me that there is another SAS/STAT procedure that contains the word SELECT, and that is the SURVEYSELECT procedure. However, the SURVEYSELECT procedure is different in that it randomly selects observations (rows in the design matrix) whereas the previous procedures select variables (columns in the design matrix).
This is not the only example of this observation-variable dichotomy in SAS/STAT. The cluster procedures all have "CLUS" as part of their names. The ACECLUS, CLUSTER, FASTCLUS, and MODECLUS procedures all attempt to group observations into clusters. However, the VARCLUS procedure is a dimension reduction technique that groups variables together.