The 7-year question

1

I have been at SAS for 7 years and up until 10 days ago, I had never been asked this question. Since then, I've been asked four times, so now must be the time to answer it!

Question: Can we simply use a linear regression model to predict the response rate from a binomial experiment rather than logistic regression? My colleagues prefer least square regression even for binary outcomes (respond vs. not respond) because log odds is hard to interpret and translate to additive probabilities.

Answer: Using least squares regression for binary outcomes dates back prior to the mid-1980s; at that time, regression analysis was the only available approach. There are several problems with using regression analysis for binary outcomes: (1) the assumptions of normality and (2) constant variance are violated—so you haven’t met the underlying assumptions of the model. Another problem: (3) we want our predicted probabilities (in our case response rates) to be in the interval (0,1). However, predicted values from regression analysis are unbounded and you could predicted probabilities greater than 1 or less than 0. What would predicted probabilities of 2.43 or -3.2 mean? Since those are nonsensical, elaborate rules on discarding predicted values have to be developed, based on subject matter knowledge.

Even with those problems, regression analysis was the only game in town and, for large samples, the normal approximation would provide a solution. Ordinary least squares regression was used for binary outcomes until John Nelder and Robert Wedderburn formulated generalized linear models in the mid to late 80’s. In their formulation, the distribution is no longer restricted to the normal distribution. Instead, the response can come from an entire family of distributions, known as the exponential family. This includes such well known distributions as the normal, gamma, Bernoulli, binomial, multinomial, negative binomial, Poisson, and others. Furthermore, the linear model is related to the response variable via a link function, which restricts predicted probabilities to the range (0,1) for binomial outcomes; it also allows the magnitude of the variance of each measurement to be a function of its predicted value. So, generalized linear models overcome all three problems mentioned above; they unify many statistical models, such as regression analysis, logistic regression, and Poisson regression (among others).

You can use PROC LOGISTIC, PROC GENMOD, and PROC GLIMMIX for logistic regression. In each procedure, you get your predicted probabilities using the ESTIMATE statement. Below is sample code using data from our course Design of Experiments for Direct Marketing class, but the syntax can be used for any binary outcome. Adding the ILINK option after the slash on the ESTIMATE statement gives the predicted probabilities (i.e., our predicted response rates)—they are the values labeled ‘Mean’ in the output that follows:
 
proc logistic data=s.dem0304;
class interestRate Sticker / param=glm;
model response(event="1")=  interestRate Sticker riskScr;
estimate 'Response for High Interest Rate' int 1 interestRate 1 0  / ilink;
estimate 'Response for Low Interest Rate' int 1 interestRate 0 1 / ilink;
contrast  'Response for High vs Low Interest Rate' interestRate 1 -1;
estimate 'overall response rate' int 1 / ilink;
run;

Contrast Test Results
Contrast DF Wald
Chi-Square
Pr > ChiSq
Response for High vs Low Interest Rate 1 290.5752 <.0001

 

Estimate
Label Estimate Standard Error z Value Pr > |z| Mean Standard Error
of Mean
Response for High Interest Rate -3.3428 0.3179 -10.52 <.0001 0.03413 0.01048

 

Estimate
Label Estimate Standard Error z Value Pr > |z| Mean Standard Error
of Mean
Response for Low Interest Rate -2.2162 0.3145 -7.05 <.0001 0.09831 0.02788

 

Estimate
Label Estimate Standard Error z Value Pr > |z| Mean Standard Error
of Mean
overall response rate -2.7795 0.3145 -8.84 <.0001 0.05844 0.01730

The first output, labeled ‘Contrast Test Results’ indicate that the predicted probability (our response rate) for the HIGH interest rate offer is significantly different from the one for LOW interest rate offer. The predicted probabilities (response rates) are in the tables labeled ‘Estimate’, and can be found in the columns labeled 'Mean'. These results indicate that the predicted response rate is 0.03413 for the High Interest Rate offer,  0.09831 for the Low Interest Rate offer, and 0.05844 overall. (Standard errors are shown in the last column of each table.)

While this example is for direct marketing, the ideas may be generalized to any data set with a dichotomous outcome.

Share

About Author

Chris Daman

Sr Analytical Training Consultant

Chris Daman is a statistical training specialist and course developer in the Education Division at SAS. She has more than 20 years of teaching experience—both nationally and internationally—in the fields of programming, statistics, and mathematics. Before joining SAS in 2005, she taught classes at N.C. State University and IBM, worked in the pharmaceutical and financial industries, and was a survey statistician at an international research organization. She currently teaches advanced statistics courses covering mixed models, generalized linear mixed models, hierarchical linear models, and design of probability surveys; in addition, she teaches design of experiments and analysis of complex data, such as longitudinal data, multilevel data, or data from complex surveys. She also teaches data mining classes, including applied analytics and advanced decision trees. She has a bachelor's degree in mathematics from the University of North Carolina at Greensboro and a master's degree in statistics from N.C. State University. Chris's favorite part of teaching is the interaction with the students. To keep them involved with the material and each other, she often uses a variety of teaching techniques (such as analogies, optical illusions, stories, object lessons, and group interactions) rather than the standard instructor-to-student lecture format. As a result, students give high ratings to her classes and typically include comments such as "I enjoyed Chris's teaching style very much. She did an excellent job of engaging the class and fostering interactions between all the students and herself" or "I love Chris's sense of humor. It definitely helps you get through complicated material". In her spare time, Chris enjoys dancing, reading, spending time with her family, and traveling.

1 Comment

  1. Cat Truxillo on

    Great post, Chris! I sometimes get the same question, and I'm sure lots of people will feel better having your post to clarify the problem when faced with pressure to use OLS regression to model a binomial outcome variable.

Back to Top