Which variables are in the final selected model?

2

When you use a regression procedure in SAS that supports variable selection (GLMSELECT or QUANTSELECT), did you know that the procedures automatically produce a macro variable that contains the names of the selected variables? This article provides examples and details. A previous article provides an overview of the 'SELECT' procedures in SAS for building statistical models.

The final model in PROC GLMSELECT

PROC GLMSELECT uses variable selection techniques such as LAR and LASSO to fit a parsimonious linear model from a large number of potential regressors. The following call to PROC GLMSELECT is adapted from the "Getting Started" example from the documentation, which models the log-transformed salaries of baseball players by using on-the-field statistics, player characteristics, and team attributes. When PROC GLMSELECT runs, it creates the _GLSIND macro variable, which contains the names of the effects in the final model.

/* GLMSELECT creates &_GLSInd for selected model */
proc glmselect data=Sashelp.Baseball;
   class league division;
   model logSalary = nAtBat nHits nHome nRuns nRBI nBB
                  yrMajor crAtBat crHits crHome crRuns crRbi
                  crBB league division nOuts nAssts nError / selection=lasso;
quit;
 
%put &_GLSInd;      /* display names of effects for selected model */
 
/* Use _GLSInd to call PROC GLM on final selected model */
proc glm data=Sashelp.Baseball;
   class league division;
   model logSalary = &_GLSInd / solution;
quit;
--- SAS Log ---
%put &_GLSInd;
   nHits nRBI nBB YrMajor CrHits CrRuns CrRbi

The MODEL statement in PROC GLMSELECT includes 18 independent variables, but the final LASSO model contains only seven variables. The _GLSInd macro contains the name of the selected variables. As shown in the example, the macro can be used in subsequent analyses. The example uses the macro on the MODEL statement of PROC GLM. The _GLSIND variable contains the original variable names, which means that if you use the CLASS statement (or EFFECT statement) in the GLMSELECT model, you should specify the same statement in subsequent procedures.

If you want to use the selected variables in a procedure that does not support the CLASS statement (or EFFECT statement), you should use the OUTDESIGN= option in PROC GLMSELECT to generate a design matrix and use the _GLSMOD macro variable to specify the names of the dummy variables in the final model. An example is given at the end of this article.

The final model in PROC QUANTSELECT

Similarly, PROC QUANTSELECT creates a macro variable (named _QSLInd) that names the independent variables in the final selected model. For example, the following example models the conditional 90th percentile of the LogSalary variable for the same data set:

/* QUANTSELECT creates _QRSInd for selected model */
proc quantselect data=sashelp.baseball;
class league division;
model logSalary = nAtBat nHits nHome nRuns nRBI nBB
                  yrMajor crAtBat crHits crHome crRuns crRbi
                  crBB league division nOuts nAssts nError / 
                  selection=lasso quantile=0.9;
run;
 
%put &_QRSInd;
--- SAS Log ---
%put &_QRSInd;
   nRBI CrHome nBB nRuns

When modeling the top 10% of salaries, the selected model includes two factors (career home runs and the number of runs scored in the previous year) that did not appear when predicting the conditional mean of the salaries. You can use the _QRSInd macro to build a predictive model in PROC QUANTREG.

PROC HPQUANTSELECT also creates a macro variable that contains the final selected independent variables. The macro is named _HPQRSIND.

Macros created by other selection procedures in SAS

The HPGENSELECT and LOGISTIC procedures, which can perform variable selection for generalized linear models, do not create a macro variable that contains the selected variables. However, the STEPDISC procedure creates a macro variable named _STDVar that contains the names of the quantitative variables that best discriminate among the classes in a discriminant analysis.

What happens if you use splines or split classification variables?

You might wonder what happens if you use the EFFECT statement to generate spline effects and/or use the SPLIT option on the CLASS statement to enable individual levels of a classification variable to enter/leave the model independently of other levels. (For the LASSO, LAR, and ELASTICNET methods, all spline effects and classification effects are split.) In this case, the final model is likely to contain only certain levels of a categorical variable or only certain basis functions for a spline effect. Consequently, probably you will need to use the design matrix and _GLSMOD macro variable in subsequent procedures.

For example, the following example creates a spline effect with five internal knots and splits the levels of two categorical variables as part of selecting a LASSO model. The final model contains the "Type='Sedan'" level, the "Origin='Asia'" level, and the first and fourth spline basis for the Weight variable:

proc glmselect data=Sashelp.Cars(where=(Type^="Hybrid"))
               outdesign=GLSSplitDesign;      /* create design matrix */
  class origin type / split;                  /* LASSO will force the SPLIT option */
  effect splWt  = spline(weight / details naturalcubic basis=tpf(noint) knotmethod=percentiles(5));
  model mpg_city = origin | type | splWt @ 2  /  selection=Lasso;
quit;
 
%put &_GLSInd;    /* model refers to effects that are not part of the input data */
%put &_GLSMod;    /* model refers to dummy variables that are in the OUTDESIGN= data set */
 
/* for split variables, use the _GLSMOD macro and the OUTDESIGN= data set */
proc glm data=GLSSplitDesign;
   model mpg_city = &_GLSMod / solution;
quit;
--- SAS Log ---
%put &_GLSInd;
   Type_Sedan Type_Sports Origin_Asia*Type_Sedan splWt:1 splWt:4 splWt:4*Origin_USA
%put &_GLSMod;
   Type_Sedan Type_Sports Origin_Asia_Type_Sedan splWt_1 splWt_4 splWt_4_Origin_USA

Notice that the _GLSIND macro contains names such as "splWt:1" that are not in the input data set. However, the names in the _GLSMOD macro are all valid columns in the GLSSplitDesign data set, which was created by using the OUTDESIGN= option. So if you are using the LASSO, LAR, or ELASTICNET methods of variable selection or if you manually specify the SPLIT option, you will want to use the _GLSMOD macro and the design data set in subsequent analyses.

Share

About Author

Rick Wicklin

Distinguished Researcher in Computational Statistics

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of SAS/IML software. His areas of expertise include computational statistics, simulation, statistical graphics, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.

2 Comments

  1. This is a very easy method to find the best model.
    But somehow I got a error message from running "Proc GLMSELECT; model logSalary = &_GLSInd / solution;"
    It indicates the 'solution' in model statement is not recognized. I wonder why.

Leave A Reply

Back to Top