Tell Me about those Pesky CLASS Variables, Part 3: Effect Coding

1

In the previous blog on CLASS variables, we developed the idea of creating design variables and examined GLM and reference coding. Another coding scheme, known as ‘Effect Coding’ or ‘Deviation from the Mean Coding’ is also commonly used. Consider our previous scenario of modeling the average amount spent on a credit card (SPEND) as a function of the variable INCOME (which has three levels: Low, Medium, and High). Here is the simple data set we’ll use:

Effect coding creates design variables by using -1s, 0s, and +1s. We replace the variable INCOME with two new design (or indicator) variables, ‘Low’ and ‘Medium’ as shown below:

data spend3;
  set spend1;
  Low=1*(income='Low')+0*(income='Medium')-1*(income='High');
  Medium=0*(income='Low')+1*(income='Medium')-1*(income='High');
run;

 

This is similar to reference coding, in that it uses only two design variables for the three-level INCOME variable. It differs in that the last level (High) is coded as -1 instead of 0. Recall that GLM coding creates three design variables and as a result “overparameterizes” the model.

When writing the PROC REG code for the model, use the design variables instead of the original variable INCOME; in the code below, High is the reference level because it is last on the MODEL statement:

proc reg data=spend3;
  model spend=low medium high;
run;

How do we interpret these parameters?  For effect coding, the intercept now represents the average amount spent in the population and is estimated to be 1333.3333. The parameter estimates for Low and Medium indicate how much above (for a positive parameter) or below (for a negative parameter) the SPEND amount is for observations in these categories, as compared to the average SPEND amount in the population.

The arithmetic gets a little messy due to the decimals, but for Low income, the SPEND amount is estimated to be:

1333.33333 - 758.33333 = 575

For Medium income, the estimated SPEND amount is:

1333.33333 - 8.33333 = 1325

To find the estimated SPEND amount for observations in the High category, look at the coding for the design variables for observations in that category:

Notice that for these two observations the values of the indicator variables for Low and Medium are both -1. It may not be obvious, but to find the parameter for High, we take the negative of the parameter for Low and Medium and combine them:

-1*(-758.33333) -1*(8.33333) = 766.66666

This represents the deviation from the average. So the SPEND amount for High income is:

1333.33333 + 766.66666 = 2100 (rounding off)

To summarize, the average SPEND amounts are 575, 1325 and 2100 for income categories, Low, Medium, and High, respectively. While the parameter estimates are different than the ones obtained previously for GLM and reference coding, the estimated SPEND amounts are the same.

While we can always create our own design variables, several modeling procedures have a CLASS statement that will create them for us: GENMOD, GLM, GLIMMIX, LOGISTIC, MIXED, and others. Check the online documentation for each procedure to find what coding schemes and options are available for those pesky CLASS variables.

Tags
Share

About Author

Chris Daman

Sr Analytical Training Consultant

Chris Daman is a statistical training specialist and course developer in the Education Division at SAS. She has more than 20 years of teaching experience—both nationally and internationally—in the fields of programming, statistics, and mathematics. Before joining SAS in 2005, she taught classes at N.C. State University and IBM, worked in the pharmaceutical and financial industries, and was a survey statistician at an international research organization. She currently teaches advanced statistics courses covering mixed models, generalized linear mixed models, hierarchical linear models, and design of probability surveys; in addition, she teaches design of experiments and analysis of complex data, such as longitudinal data, multilevel data, or data from complex surveys. She also teaches data mining classes, including applied analytics and advanced decision trees. She has a bachelor's degree in mathematics from the University of North Carolina at Greensboro and a master's degree in statistics from N.C. State University. Chris's favorite part of teaching is the interaction with the students. To keep them involved with the material and each other, she often uses a variety of teaching techniques (such as analogies, optical illusions, stories, object lessons, and group interactions) rather than the standard instructor-to-student lecture format. As a result, students give high ratings to her classes and typically include comments such as "I enjoyed Chris's teaching style very much. She did an excellent job of engaging the class and fostering interactions between all the students and herself" or "I love Chris's sense of humor. It definitely helps you get through complicated material". In her spare time, Chris enjoys dancing, reading, spending time with her family, and traveling.

1 Comment

Back to Top