Last week I discussed ordinary least squares (OLS) regression models and showed how to illustrate the assumptions about the conditional distribution of the response variable. For a single continuous explanatory variable, the illustration is a scatter plot with a regression line and several normal probability distributions along the line.
The term "conditional distribution of the response" is a real mouthful. For brevity, I will say that the graph shows the assumed "error distributions."
A similar graph can illustrate regression models that involve a transformation of the response variable. A common transformation is to model the logarithm of the response variable, which means that the predicted curve is exponential.
There are two common ways to construct an exponential fit of a response variable, Y, with an explanatory variable, X. The two models are as follows:
- A generalized linear model of Y that uses LOG as a link function. In SAS you can construct this model with PROC GENMOD by setting DIST=NORMAL and LINK=LOG.
- An OLS model of log(Y), followed by exponentiation of the predicted values. In SAS you can construct this model with PROC GLM or REG, although for consistency I will use PROC GENMOD with an identity link function.
To illustrate the two models, I will use the same 'cars' data as last time. These data were used by Arthur Charpentier, whose blog post about GLMs inspired me to create my own graphs in SAS. Thanks to Randy Tobias and Stephen Mistler for commenting on an early draft of this post.
A generalized linear model with a log link
A generalized linear model of Y with a log link function assumes that the response is predicted by an exponential function of the form Y = exp(b0 + b1X) + ε and that the errors are normally distributed with a constant variance. In terms of the mean value of Y, it models the log of the mean: log(E(Y)) = b0 + b1X.
The graph to the left illustrates this model for the "cars" data used in my last post. The X variable is the speed of a car and the Y variable is the distance required to stop.
How can you create this graph in SAS? As shown in my last post, you can run a SAS procedure to get the parameter estimates, then obtain the predicted values by scoring the model on evenly spaced values of the explanatory variable. However, when you create the data for the probability distributions, be sure to apply the inverse link function, which in this case is the EXP function. This centers the error distributions on the prediction curve.
The call to PROC GENMOD is shown below. For the details of constructing the graph, download the SAS program used to create these graphs.
title "Generalized Linear Model of Y with log link"; proc genmod data=MyData; model y = x / dist=normal link=log; ods select ParameterEstimates; store work.GenYModel / label='Normal distribution for Y; log link'; run; proc plm restore=GenYModel noprint; score data=ScoreX out=Model pred=Fit / ILINK; /* apply inverse link to predictions */ run;
An OLS model of log(Y)
An alternative model is to fit an OLS model for log(Y). The data set already contains a variable called LogY = log(Y). The OLS model assumes that log(Y) is predicted by a model of the form b0 + b1X + ε. The model assumes that the errors are normally distributed and that the expected value of log(Y) is linear: E(log(Y)) = b0 + b1X. You can use PROC GLM to fit the model, but the following statement uses PROC GENMOD and PROC PLM to provide an "apples-to-apples" comparison:
title "Linear Model of log(Y)"; proc genmod data=MyData; model logY = x / dist=normal link=identity; ods select ParameterEstimates; store work.LogYModel / label='Normal distribution for log(Y); identity link'; run; proc plm restore=LogYModel noprint; score data=ScoreX out=Model pred=Fit; run;
On the log scale, the regression line and the error distributions look like the graph in my previous post. However, transforming to the scale of the original data provides a better comparison with the generalized linear model from the previous section. When you exponentiate the log(Y) predictions and error distribution, you obtain the graph at the left.
Notice that the error distributions are NOT normal. In fact, by definition, the distributions are lognormal. Furthermore, on this scale the assumed error distribution is heteroscedastic, with smaller variances when X is small and larger variances when X is large. These are the consequences of exponentiating the OLS model.
Incidentally, you can obtain this same model by using the FMM procedure, which enables you to fit lognormal distributions directly.
A comparison of log-link versus log(Y) models
It is worth noting that the two models result in different predictions. The following graph displays both predicted curves. They first curve (the generalized linear model with log link) goes through the "middle" of the data points, which makes sense when you think about the assumed error distributions for that model. The second curve (the exponentiated OLS model of log(Y)) is higher for large values of X than you might expect, until you consider the assumed error distributions for that model.
Both models assume that the effect of X on the mean value of Y is multiplicative, rather than additive. However, as one of my colleagues pointed out, the second model also assumes that the effect of errors is multiplicative, whereas in the generalized linear model the effect of the errors is additive. Researchers must use domain-specific knowledge to determine which model makes sense for the data. For applications such as exponential growth or decay, the second model seems more reasonable.
So there you have it. The two exponential models make different assumptions and consequently lead to different predictions. I am not an expert in generalized linear models, so I found the graphs in this article helpful to visualize the differences between the two models. I hope you did, too. If you want to see how the graphs were created, download the SAS program.