I have been at SAS for 7 years and up until 10 days ago, I had never been asked this question. Since then, I've been asked four times, so now must be the time to answer it!
Question: Can we simply use a linear regression model to predict the response rate from a binomial experiment rather than logistic regression? My colleagues prefer least square regression even for binary outcomes (respond vs. not respond) because log odds is hard to interpret and translate to additive probabilities.
Answer: Using least squares regression for binary outcomes dates back prior to the mid-1980s; at that time, regression analysis was the only available approach. There are several problems with using regression analysis for binary outcomes: (1) the assumptions of normality and (2) constant variance are violated—so you haven’t met the underlying assumptions of the model. Another problem: (3) we want our predicted probabilities (in our case response rates) to be in the interval (0,1). However, predicted values from regression analysis are unbounded and you could predicted probabilities greater than 1 or less than 0. What would predicted probabilities of 2.43 or -3.2 mean? Since those are nonsensical, elaborate rules on discarding predicted values have to be developed, based on subject matter knowledge.
Even with those problems, regression analysis was the only game in town and, for large samples, the normal approximation would provide a solution. Ordinary least squares regression was used for binary outcomes until John Nelder and Robert Wedderburn formulated generalized linear models in the mid to late 80’s. In their formulation, the distribution is no longer restricted to the normal distribution. Instead, the response can come from an entire family of distributions, known as the exponential family. This includes such well known distributions as the normal, gamma, Bernoulli, binomial, multinomial, negative binomial, Poisson, and others. Furthermore, the linear model is related to the response variable via a link function, which restricts predicted probabilities to the range (0,1) for binomial outcomes; it also allows the magnitude of the variance of each measurement to be a function of its predicted value. So, generalized linear models overcome all three problems mentioned above; they unify many statistical models, such as regression analysis, logistic regression, and Poisson regression (among others).
You can use PROC LOGISTIC, PROC GENMOD, and PROC GLIMMIX for logistic regression. In each procedure, you get your predicted probabilities using the ESTIMATE statement. Below is sample code using data from our course Design of Experiments for Direct Marketing class, but the syntax can be used for any binary outcome. Adding the ILINK option after the slash on the ESTIMATE statement gives the predicted probabilities (i.e., our predicted response rates)—they are the values labeled ‘Mean’ in the output that follows:
proc logistic data=s.dem0304;
class interestRate Sticker / param=glm;
model response(event="1")= interestRate Sticker riskScr;
estimate 'Response for High Interest Rate' int 1 interestRate 1 0 / ilink;
estimate 'Response for Low Interest Rate' int 1 interestRate 0 1 / ilink;
contrast 'Response for High vs Low Interest Rate' interestRate 1 -1;
estimate 'overall response rate' int 1 / ilink;
run;
Contrast Test Results | |||
---|---|---|---|
Contrast | DF | Wald Chi-Square |
Pr > ChiSq |
Response for High vs Low Interest Rate | 1 | 290.5752 | <.0001 |
Estimate | ||||||
---|---|---|---|---|---|---|
Label | Estimate | Standard Error | z Value | Pr > |z| | Mean | Standard Error of Mean |
Response for High Interest Rate | -3.3428 | 0.3179 | -10.52 | <.0001 | 0.03413 | 0.01048 |
Estimate | ||||||
---|---|---|---|---|---|---|
Label | Estimate | Standard Error | z Value | Pr > |z| | Mean | Standard Error of Mean |
Response for Low Interest Rate | -2.2162 | 0.3145 | -7.05 | <.0001 | 0.09831 | 0.02788 |
Estimate | ||||||
---|---|---|---|---|---|---|
Label | Estimate | Standard Error | z Value | Pr > |z| | Mean | Standard Error of Mean |
overall response rate | -2.7795 | 0.3145 | -8.84 | <.0001 | 0.05844 | 0.01730 |
The first output, labeled ‘Contrast Test Results’ indicate that the predicted probability (our response rate) for the HIGH interest rate offer is significantly different from the one for LOW interest rate offer. The predicted probabilities (response rates) are in the tables labeled ‘Estimate’, and can be found in the columns labeled 'Mean'. These results indicate that the predicted response rate is 0.03413 for the High Interest Rate offer, 0.09831 for the Low Interest Rate offer, and 0.05844 overall. (Standard errors are shown in the last column of each table.)
While this example is for direct marketing, the ideas may be generalized to any data set with a dichotomous outcome.
1 Comment
Great post, Chris! I sometimes get the same question, and I'm sure lots of people will feel better having your post to clarify the problem when faced with pressure to use OLS regression to model a binomial outcome variable.