Encodings of CLASS variables in SAS regression procedures: A cheat sheet


SAS regression procedures support several parameterizations of classification variables. When a categorical variable is used as an explanatory variable in a regression model, the procedure generates dummy variables that are used to construct a design matrix for the model. The process of forming columns in a design matrix is called a parameterization or encoding. In SAS, most regression procedures use either the GLM encoding, the EFFECT encoding, or the REFERENCE encoding. This article summarizes the default and optional encodings for each regression procedure in SAS/STAT. In many SAS procedures, you can use the PARAM= option to change the default encoding.

The documentation section "Parameterization of Model Effects" provides a complete list of the encodings in SAS and shows how the design matrices are constructed from the levels. (The levels are the values of a classification variable.) Pasta (2005) gives examples and further discussion.

Default and optional encodings for SAS regression procedures

The following SAS regression procedures support the CLASS statement or a similar syntax. The columns GLM, REFERENCE, and EFFECT indicate the three most common encodings. The word "Default" indicates the default encoding. For procedures that support the PARAM= option, the column indicates the supported encodings. The word All means that the procedure supports the complete list of SAS encodings. Most procedures default to using the GLM encoding; the exceptions are highlighted.

ANOVA Default
CATMOD Default
FMM Default
GAM Default
GAMPL Default Yes GLM | REF
GEE Default
GENMOD Default Yes Yes All
GLM Default
GLMSELECT Default Yes Yes All
HP regression procedures Default Yes GLM | REF
ICPHREG Default Yes Yes All
LOGISTIC Yes Yes Default All
MIXED Default
ORTHOREG Default Yes Yes All
PLS Default
PROBIT Default
PHREG Yes Default Yes All
QUANTSELECT Default Yes Yes All
RMTSREG Default Yes Yes All
SURVEYPHREG Default Yes Yes All
TRANSREG Yes Default Yes

A few comments:

  • The REFERENCE encoding is the default for PHREG and TRANSREG.
  • The EFFECT encoding is the default for CATMOD, LOGISTIC, and SURVEYLOGISTIC.
  • The HP regression procedures all use the GLM encoding by default and support only PARAM=GLM or PARAM=REF. The HP regression procedures include HPFMM, HPGENSELECT, HPLMIXED, HPLOGISTIC, HPNLMOD, HPPLS, HPQUANTSELECT, and HPREG. In spite of its name, GAMPL is also an HP procedure. In spite of its name, HPMIXED is NOT an HP procedure!
  • PROC LOGISTIC and PROC HPLOGISTIC use different default encodings.
  • CATMOD does not have a CLASS statement because all variables are assumed to be categorical.
  • PROC TRANSREG does not support a CLASS statement. Instead, it uses a CLASS() transformation list. It uses different syntax to support parameter encodings.

How to interpret main effects for the SAS encodings

The GLM parameterization is a singular parameterization. The other encodings are nonsingular. The "Other Parameterizations" section of the documentation gives a simple one-sentence summary of how to interpret the parameter estimates for the main effects in each encoding:

  • The GLM encoding estimates the difference in the effect of each level compared to the reference level. You can use the REF= option to specify the reference level. By default, the reference level is the last ordered level. The design matrix for the GLM encoding is singular.
  • The REFERENCE encoding estimates the difference in the effect of each nonreference level compared to the effect of the reference level. You can use the REF= option to specify the reference level. By default, the reference level is the last ordered level. Notice that the REFERENCE encoding gives the same interpretation as the GLM encoding. The difference is that the design matrix for the REFERENCE encoding excludes the column for the reference level, so the design matrix for the REFERENCE encoding is (usually) nonsingular.
  • The EFFECT encoding estimates the difference in the effect of each nonreference level compared to the average effect over all levels.

This article lists the various encodings that are supported for each SAS regression procedures. I hope you will find it to be a useful reference. If I've missed your favorite regression procedure, let me know in the comments.


About Author

Rick Wicklin

Distinguished Researcher in Computational Statistics

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of SAS/IML software. His areas of expertise include computational statistics, simulation, statistical graphics, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.

1 Comment

  1. Pingback: The best way to generate dummy variables in SAS - The DO Loop

Leave A Reply

Back to Top