Meaningful names for columns of a design matrix


A programmer recently asked a question on a SAS discussion forum about design matrices for categorical variables. He had generated a design matrix by using PROC GLMMOD and wanted to use the design columns in a subsequent procedure. However, the columns were named COL1, COL2, COL3,..., so he couldn't tell which dummy variables correspond to each categorical variable. The following example illustrates his situation for the Weight_Status and Smoking_Status variables in the Sashelp.Heart data set:

proc glmmod data=Sashelp.Heart outdesign=GLMDesign;
   class Weight_Status Smoking_Status;
   model Cholesterol = Weight_Status Smoking_Status;
   ods select Parameters;
proc contents data=GLMDesign varnum short; run;
The association betwen columns in a design matrix and levels of the original categorical variables

The "Parameters" table shows the association between columns of the design matrix and levels of the categorical variables in the model. The output from PROC CONTENTS shows that the columns of the design matrix (as stored in the GLMDesign data set) are named COL1, COL2, and so forth. On the discussion forum, I showed how to save the "Parameters" table to a SAS data and use a DATA _NULL_ step to form macro variables that you can use to associate the design columns to the original variables.

In retrospect, I missed a golden opportunity to mention that the GLMMOD procedure (which always produces a singular design matrix) is not as friendly or powerful as other procedures in SAS that can generate design matrices. In particular, the GLMSELECT and TRANSREG procedures can create design matrices for many different parameterizations of the classification variables. Furthermore, these procedures automatically create macro variables that tell you the names of the columns in the design matrix.

As a reminder, a major reason to create a design matrix is to perform an analysis with a SAS procedure that does not support a CLASS statement. For example, the documentation of the MCMC procedure shows how to use PROC TRANSREG to create a design matrix as preparation for performing a Bayesian regression analysis. For details of the many ways to generate design matrices in SAS, see my previous article, "Four ways to create a design matrix in SAS."

GLMSELECT: An easier way to associate columns of a design matrix

The GLMSELECT procedure supports the OUTDESIGN= option, which enables you to output a design matrix for the variables in a regression model. The GLMSELECT procedure has the following advantages of the GLMMOD procedure:

  • The procedure supports the EFFECT statement, which you can use to define spline effects, collection effects, and more. The OUTDESIGN= option creates columns for any effect that you can define.
  • The procedure supports nonsingular parameterizations such as the 'effect' or 'reference' parameterizations (PARAM=EFFECT or PARAM=REF, respectively).
  • The dummy variables have meaningful names of the form VarName_Level, where VarName is the name of a categorical variable and Level is one of its values.
  • The procedure automatically creates a macro variable (called _GLSMod) that contains the names of the columns of the design matrix.

The following statements create a design matrix for the Weight_Status and Smoking_Status variables. The design matrix uses the 'effect coding' parameterization and is written to the GLSDesign data set. The value of the _GLSMod macro is displayed in the SAS log and can be used in a subsequent procedure call:

proc glmselect data=Sashelp.Heart outdesign(fullmodel)=GLSDesign noprint;
   class Weight_Status(ref='Normal')
         Smoking_Status(ref='Non-smoker') / param=effect;
   model Cholesterol = Weight_Status Smoking_Status / selection=none; /* include ALL dummy variables */
ods select Position;
proc contents data=GLSDesign varnum; run;   /* display all variables in OUTDESIGN= data set */
%put &_GLSMod;                              /* names of columns in design matrix */
--- SAS Log ---
Weight_Status_Overweight Weight_Status_Underweight
Smoking_Status_Heavy__16_25_ Smoking_Status_Light__1_5_
Smoking_Status_Moderate__6_15_ Smoking_Status_Very_Heavy____25_

The output from PROC CONTENTS show the variables in the GLSDesign data set, which are INTERCEPT, CHOLESTEROL (the response variable in the model), and the dummy variables listed in the _GLSMod macro variable. By default, the dummy variables have the pattern VarName_Level, although you can use the OUTDESIGN(PREFIX=prefix) option to generate column names of the form prefix1, prefix2, and so forth, where prefix is a user-specified prefix. You can see that values such as "Heavy (16-25)" are transformed into valid names for SAS variables.

In this example, the _GLSMOD macro contains the names of ALL dummy variables because the SELECTION=NONE option is used. If you use an option (such as SELECTION=LASSO) to perform variable selection, the _GLSMOD variable will contain the names of the dummy variables that are selected in the final model. This means that you can use the _GLSMOD variable to perform additional analyses. For example, if you want to use POC REG to output collinearity diagnostics for the variables in the final model, you could execute the following:

/* use the variables in the final model in a procedure that does not support a CLASS statement */
proc reg data=GLSDesign plots=none; 
   model Cholesterol = &_GLSMod / collin;

TRANSREG: Dummy variables and transformed variable

In a similar way, the TRANSREG procedure supports the DESIGN option, which adds dummy variables to the output data set, as shown in the following call. The syntax is different, but the dummy variables in the TRGDesign data set are formed by the same 'effect coding' as in the previous example. The procedure automatically creates a macro variable (called _TRGInd) that contains the names of the columns of the design matrix.

proc transreg data=Sashelp.Heart design;
   model class(Weight_Status Smoking_Status / EFFECTS 
          zero="Normal"      "Non-smoker");
   output out=TRGDesign;
%put &_TRGInd;
Weight_StatusOverweight Weight_StatusUnderweight Smoking_StatusHeavy__16_25_ Smoking_StatusLight__1_5_
Smoking_StatusModerate__6_15_ Smoking_StatusVery_Heavy____25_

The dummy variables contain the same values as in the previous example, but the two procedures construct slightly different names. The GLMSELECT names contain an extra underscore. For example, a dummy variable in the GLSDesign data set is Weight_Status_Overweight whereas the corresponding variable in the TRGDesign data set is Weight_StatusOverweight.

It is worth noting the PROC TRANSREG also supports macro variables that specify the names of dependent variables and transformed variables. You can even use the MACRO option to specify the name of the macro variables that are created. For details, see the documentation for the PROC TRANSREG statement.

In summary, when you generate a design matrix by using the GLMSELECT or TRANSREG procedures, the procedures create dummy variables that have meaningful names of the form VarName_Level or VarNameLevel. The procedures each create a macro variable that contains the names of the dummy variables. The _GLSMOD macro that is created by PROC GLMSELECT contains the names of dummy variables in the final selected model, so use SELECTION=NONE if you want all names. These macros make it easy for a programmer to refer to columns of the design matrix.


About Author

Rick Wicklin

Distinguished Researcher in Computational Statistics

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of SAS/IML software. His areas of expertise include computational statistics, simulation, statistical graphics, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.

1 Comment

  1. Pingback: The best way to generate dummy variables in SAS - The DO Loop

Leave A Reply

Back to Top