I was recently asked about how to interpret the output from the COLLIN (or COLLINOINT) option on the MODEL statement in PROC REG in SAS. The example in the documentation for PROC REG is correct but is somewhat terse regarding how to use the output to diagnose collinearity and how to determine which variables are collinear. This article uses the same data but goes into more detail about how to interpret the results of the COLLIN and COLLINOINT options.
An overview of collinearity in regression
Collinearity (sometimes called multicollinearity) involves only the explanatory variables. It occurs when a variable is nearly a linear combination of other variables in the model. Equivalently, there a set of explanatory variables that is linearly dependent in the sense of linear algebra. (Another equivalent statement is that the design matrix and the X`X matrices are not full rank.)
For example, suppose a model contains five regressor variables and the variables are related by X3 = 3*X1 - X2 and X5 = 2*X4;. In this case, there are two sets of linear relationships among the regressors, one relationship that involves the variables X1, X2, and X3, and another that involves the variables X4 and X5. In practice, collinearity means that a set of variables are almost linearly combinations of each other. For example, the vectors u = X3 - 3*X1 + X2 and v = X5 - 2*X4; are close to the zero vector.
Unfortunately, the words "almost" and "close to" are difficult to quantify. The COLLIN option on the MODEL statement in PROC REG provides a way to analyze the design matrix for potentially harmful collinearities.
Why should you avoid collinearity in regression?
The assumptions of ordinary least square (OLS) regression are not violated if there is collinearity among the independent variables. OLS regression still provides the best linear unbiased estimates of the regression coefficients.
The problem is not the estimates themselves but rather the variance of the estimates. One problem caused by collinearity is that the standard errors of those estimates will be big. This means that the predicted values, although the "best possible," will have wide prediction limits. In other words, you get predictions, but you can't really trust them.
A second problem concerns interpretability. The sign and magnitude of a parameter estimate indicate how the dependent variable changes due to a unit change of the independent variable when the other variables are held constant. However, if X1 is nearly collinear with X2 and X3, it does not make sense to discuss "holding constant" the other variables (X2 and X3) while changing X1. The variables necessarily must change together. Collinearities can even cause some parameter estimates to have "wrong signs" that conflict with your intuitive notion about how the dependent variable should depend on an independent variable.
A third problem with collinearities is numerical rather than statistical. Strong collinearities cause the cross-product matrix (X`X) to become ill-conditioned. Computing the least squares estimates requires solving a linear system that involves the cross-product matrix. Solving an ill-conditioned system can result in relatively large numerical errors. However, in practice, the statistical issues usually outweigh the numerical one. A matrix must be extremely ill-conditioned before the numerical errors become important, whereas the statistical issues are problematic for moderate to large collinearities.
How to interpret the output of the COLLIN option?
The following example is from the "Collinearity Diagnostics" section of the PROC REG documentation. Various health and fitness measurements were recorded for 31 men, such as time to run 1.5 miles, the resting pulse, the average pulse rate while running, and the maximum pulse rate while running. These measurements are used to predict the oxygen intake rate, which is a measurement of fitness but is difficult to measure directly.
data fitness; input Age Weight Oxygen RunTime RestPulse RunPulse MaxPulse @@; datalines; 44 89.47 44.609 11.37 62 178 182 40 75.07 45.313 10.07 62 185 185 44 85.84 54.297 8.65 45 156 168 42 68.15 59.571 8.17 40 166 172 38 89.02 49.874 9.22 55 178 180 47 77.45 44.811 11.63 58 176 176 40 75.98 45.681 11.95 70 176 180 43 81.19 49.091 10.85 64 162 170 44 81.42 39.442 13.08 63 174 176 38 81.87 60.055 8.63 48 170 186 44 73.03 50.541 10.13 45 168 168 45 87.66 37.388 14.03 56 186 192 45 66.45 44.754 11.12 51 176 176 47 79.15 47.273 10.60 47 162 164 54 83.12 51.855 10.33 50 166 170 49 81.42 49.156 8.95 44 180 185 51 69.63 40.836 10.95 57 168 172 51 77.91 46.672 10.00 48 162 168 48 91.63 46.774 10.25 48 162 164 49 73.37 50.388 10.08 67 168 168 57 73.37 39.407 12.63 58 174 176 54 79.38 46.080 11.17 62 156 165 52 76.32 45.441 9.63 48 164 166 50 70.87 54.625 8.92 48 146 155 51 67.25 45.118 11.08 48 172 172 54 91.63 39.203 12.88 44 168 172 51 73.71 45.790 10.47 59 186 188 57 59.08 50.545 9.93 49 148 155 49 76.32 48.673 9.40 56 186 188 48 61.24 47.920 11.50 52 170 176 52 82.78 47.467 10.50 53 170 172 ; proc reg data=fitness plots=none; model Oxygen = RunTime Age Weight RunPulse MaxPulse RestPulse / collin; ods select ParameterEstimates CollinDiag; ods output CollinDiag = CollinReg; quit;
The output from the COLLIN option is shown. I have added some colored rectangles to the output to emphasize how to interpret the table. To determine collinearity from the output, do the following:
- Look at the "Condition Index" column. Large values in this column indicate potential collinearities. Many authors use 30 as a number that warrants further investigation. Other researchers suggest 100. Most researchers agree that no single number can handle all situations.
- For each row that has a large condition index, look across the columns in the "Proportion of Variation" section of the table. Identify cells that have a value of 0.5 or greater. The columns of these cells indicate which variables contribute to the collinearity. Notice that at least two variables are involved in each collinearity, so look for at least two cells with large values in each row. However, there could be three or more cells that have large values. "Large" is relative to the value 1, which is the sum of each column.
Let's apply these rules to the output for the example:
- If you use 30 as a cutoff value, there are three rows (marked in red) whose condition numbers exceed the cutoff value. They are rows 5, 6, and 7.
- For the 5th row (condition index=33.8), there are no cells that exceed 0.5. The two largest cells (in the Weight and RestPulse columns) indicate a small near-collinearity between the Weight and RestPulse measurements. The relationship is not strong enough to worry about.
- For the 6th row (condition index=82.6), there are two cells that are 0.5 or greater (rounded to four decimals). The cells are in the Intercept and Age columns. This indicates that the Age and Intercept terms are nearly collinear. Collinearities with the intercept term can be hard to interpret. See the comments at the end of this article.
- For the 7th row (condition index=196.8), there are two cells that are greater than 0.5. The cells are in the RunPulse and MaxPulse columns, which indicates a very strong linear relationship between these two variables.
Your model has collinearities. Now what?
After you identify the cause of the collinearities, what should you do? That is a difficult and controversial question that has many possible answers.
- Perhaps the simplest solution is to use domain knowledge to omit the "redundant" variables. For example, you might want to drop MaxPulse from the model and refit. However, in this era of Big Data and machine learning, some analysts want an automated solution.
- You can use dimensionality reduction and an (incomplete) principal component regression. (This link also discussed partial least squares (PLS).)
- You can use a biased estimation technique such as ridge regression, which allows bias but reduces the variance of the estimates.
- Some practitioners use variable selection techniques to let the data decide which variables to omit from the model. However, be aware that different variable-selection methods might choose different variables from among the set of nearly collinear variables.
Should you center the data before performing the collinearity check?
Equally controversial is the question of whether to include the intercept term as a variable when running the collinearity diagnostics. The COLLIN option in PROC REG includes the intercept term among the variables to be analyzed for collinearity. The COLLINOINT option excludes the intercept term and, more importantly, centers the data by subtracting the mean of each column in the data matrix. Which should you use? Here are two opinions that I found:
- Do not center the data (Use the intercept term): Belsley, Kuh, and Welsch (Regression Diagnostics, 1980, p. 120) state that centering is "inappropriate in the event that [the design matrix]contains a constant column." They go on to say (p. 157) that "centering the data [when the model has an intercept term]can mask the role of the constant in any near dependencies and produce misleading diagnostic results." These quotes strongly favor using the COLLIN option, which analyzes exactly the design matrix that is used to construct the parameter estimates for the regression.
- Center the data (Do not use the intercept term): If the intercept is outside of the data range, Freund and Littell (SAS System for Regression, 3rd Ed, 2000) argue that including the intercept term in the collinearity analysis is not always appropriate. "The intercept is an estimate of the response at the origin, that is, where all independent variables are zero. .... [F]or most applications the intercept represents an extrapolation far beyond the reach of the data. For this reason, the inclusion of the intercept in the study of multicollinearity can be useful only if the intercept has some physical interpretation and is within reach of the actual data space." For the example data, it is impossible for a person to have zero age, weight, or pulse rate, therefore I suspect Freund and Little would recommend using the COLLINOINT option for these data. Remember that the COLLINOINT option centers the data, so you are performing collinearity diagnostics on a different data matrix than is used to construct the regression estimates.
So what should you do if the experts disagree? Well, SAS provides both options, but I usually defer to the math. Although I am reluctant to contradict Freund and Littell (both widely published experts and Fellows of the American Statistical Association), the mathematics of the collinearity analysis seems to favor the opinion of Belsley, Kuh, and Welsch, who argue for analyzing the same design matrix that is used for the regression estimates. This means using the COLLIN option. I have written a separate article that investigates this issue in more detail.
Do you have an opinion on this matter? Leave a comment.