I previously wrote about partial leverage plots for regression diagnostics and why they are useful. You can generate a partial leverage plot in SAS by using the PLOTS=PARTIALPLOT option in PROC REG. One useful property of partial leverage plots is the ability to graphically represent the null hypothesis that a regression coefficient is 0. John Sall ("Leverage Plots for General Linear Hypotheses", TAS, 1990) showed that you could add a confidence band to the partial regression plot such that if the line segment for Y=0 is not completely inside the band, then you reject the null hypothesis.
This article shows two ways to get the confidence band. The easy way is to write the partial leverage data to a SAS data set and use the REG statement in PROC SGPLOT to overlay a regression fit and a confidence band for the mean. This way is easy to execute, but it requires performing an additional regression analysis for every effect in the model. A second way (which is due to Sall, 1990) computes the confidence band directly from the statistics on the original data. This article shows how to use the SAS/IML language to compute the confidence limits from the output of PROC REG.
Confidence limits on a partial leverage plot: The easy way
The following example is taken from my previous article, which explains the code. The example uses a subset of Fisher's iris data. For convenience, the variables are renamed to Y (the response) and X1, X2, and X3 (the regressors). The call to PROC REG includes
data Have; set sashelp.iris( rename=(PetalLength = Y PetalWidth = x1 SepalLength = x2 SepalWidth = x3) where=(Species="Versicolor")); label Y= x1= x2= x3=; run; proc reg data=Have plots(only)=PartialPlot; model Y = x1 x2 x3 / clb partial partialdata; ods exclude PartialPlotData; ods output PartialPlotData=PD; quit; /* macro to create a partial leverage plot (with CLM) for each variable */ %macro MakePartialPlot(VarName); title "Partial Leverage Plot for &VarName"; title2 "with 95% CL for H0: Beta_&VarName = 0"; proc sgplot data=PD ; reg y=part_Y_&VarName x=part_&VarName / CLM; refline 0 / axis=y; run; %mend; %MakePartialPlot(x2); %MakePartialPlot(x3); |
Confidence limits on a partial leverage plot: The efficient way
The previous section shows an easy way to produce partial leverage regression plots in SAS. The confidence bands are a graphical way to examine the null hypotheses βi = 0. Look at the confidence band for the partial leverage plot for the X2 variable. The Y=0 axis (the gray line) is NOT wholly contained in the confidence band. You can therefore reject (with 95% confidence) the null hypotheses that β2=0. In contrast, look at the confidence band for the partial leverage plot for the X3 variable. The Y=0 axis IS wholly contained in the confidence band. You therefore cannot reject the null hypotheses that β3=0.
Although this method is easy, Sall (1990) points out that it is not the most computationally efficient method. For every variable, the SGPLOT procedure computes a bivariate regression, which means that this method requires performing k+2 linear regressions: the regression for the original data that has k regressors, followed by additional bivariate regression for the intercept and each of the k regressors. In practice, each bivariate regression can be computed very quickly, so the entire process is not very demanding. Nevertheless, Sall (1990) showed that all the information you need is already available from the regression on the original data. You just need to save that information and use it to construct the confidence bands.
The following section uses Sall's method to compute the quantities for the partial leverage plots. The program is included for the mathematically curious; I do not recommend that you use it in practice because the previous method is much simpler.
Sall's method for computing partial leverage regression plots
Sall's method requires the following output from the original regression model:
- The means of the regressor variables. You can get these statistics by using the SIMPLE option on the PROC REG statement and by using ODS OUTPUT SimpleStatistics=Simple to save the statistics to a data set.
- The inverse of the crossproduct matrix (the X`*X matrix). You can get this matrix by using the I option on the MODEL statement and by using ODS OUTPUT InvXPX=InvXPX to save the matrix to a data set.
- The mean square error and the error degrees of freedom for the regression model. You can get these statistics by using ODS OUTPUT ANOVA=Anova to save the ANOVA table to a data set.
- The parameter estimates and the F statistics for the null hypotheses βi=0. You can get these statistics by using the CLB option on the MODEL statement and by using ODS OUTPUT ParameterEstimates=PE to save the parameter estimates table to a data set. The data set contains t statistics for the null hypotheses, but the F statistic is the square of the t statistic.
The following call to PROC REG requests all the statistics and saves them to data sets:
proc reg data=Have plots(only)=PartialPlot simple; model Y = x1 x2 x3 / clb partial partialdata I; ods exclude InvXPX PartialPlotData; ods output SimpleStatistics=Simple InvXPX=InvXPX ANOVA=Anova ParameterEstimates=PE PartialPlotData=PD(drop=Model Dependent Observation); quit; |
Sall (1990) shows how to use these statistics to create the partial leverage plots. The following SAS/IML program implements Sall's method.
/* Add confidence bands to partial leverage plots: Sall, J. P. (1990). "Leverage Plots for General Linear Hypotheses." American Statistician 44:308–315. */ proc iml; alpha = 0.05; /* Use statistics from original data to estimate the confidence bands */ /* xBar = mean of each original variable */ use Simple; read all var "Mean" into xBar; close; p = nrow(xBar)-1; /* number of parameters */ xBar = xBar[1:p]; /* remove the Y mean */ /* get (X`*X)^{-1} matrix */ use InvXPX; read all var _NUM_ into xpxi; close; xpxi = xpxi[1:p, 1:p]; /* remove column and row for Y */ hbar = xbar` * xpxi * xbar; /* compute hbar from inv(X`*X) and xbar */ /* get s2 = mean square error and error DF */ use ANOVA; read all var {DF MS}; close; s2 = MS[2]; /* extract MSE */ DFerror = DF[2]; /* extract error DF */ /* get parameter estimates and t values */ use PE; read all var {Estimate tValue}; close; FValue = tValue#tValue; /* F statistic is t^2 for OLS */ b = Estimate; /* lines have slope b[i] and intercept 0 */ tAlpha = quantile("T", 1-alpha/2, DFerror); /* assume DFerror > 0 */ FAlpha = tAlpha#tAlpha; /* = quantile( "F", 1-alpha, 1, DFerror); */ /* Get partial plot data from REG output */ /* read in part_: names for intercept and X var names */ use PD; read all var _NUM_ into partial[c=partNames]; close; /* The formula for the confidence limits is from Sall (1990) */ nPts = 100; /* score CL at nPts values on [xMin, xMax] */ XPts = j(nPts, p, .); /* allocate matrices to store the results */ Pred = j(nPts, p, .); Lower = j(nPts, p, .); Upper = j(nPts, p, .); do i = 1 to p; levX = partial[,2*i-1]; /* partial x variable */ xMin = min(levX); xMax = max(levX); /* [xMin, xMax] */ /* partial regression line is (t,z) and CL band is (t, z+/- SE) */ t = T( do(xMin,xMax,(xMax-xMin)/(nPts-1))); /* assume xMin < xMax */ z = b[i]*t; /* predicted mean: yHat = b[i]*x + 0 */ SE = sqrt( s2#FAlpha#hbar + FAlpha/FValue[i] # (z#z) ); /* Sall, 1990 */ XPts[,i] = t; Pred[,i] = z; Lower[,i] = z - SE; Upper[,i] = z + SE; end; /* create names for new variables for the CL band */ oddIdx = do(1,ncol(partNames),2); /* names we want are the ODD elements */ Names = partNames[,oddIdx]; XNames = compress('X_' +Names); PredNames = compress('Pred_'+Names); LowerNames = compress('LCL_' +Names); UpperNames = compress('UCL_' +Names); /* write regression lines and confidence bands to a data set */ Results = XPts || Pred || Lower || Upper; create PartCL from Results[c=(XNames||PredNames||LowerNames||UpperNames)]; append from Results; close; QUIT; |
The data set PartCL contains the fit lines (which have slopes equal to b[i]) and the lower and upper confidence limits for the means of the partial leverage plots. The lines are evaluated at 100 evenly spaced points in the range of the "partial variables." The program uses matrix multiplication to evaluate the quadratic form \(\bar{x}^\prime (X^\prime X)^{-1} \bar{x}\). The remaining computations are scalar operations.
The following statements merge the regression lines and confidence intervals with the data for the partial leverage plots. You can then create a scatter plot of the partial leverage values and overlay the regression line and confidence bands.
data Want; set PD PartCL; run; /* macro for plotting a partial leverage plot with CL from the IML program */ %macro MakePartialPlotCL(VarName); title "Partial Leverage Plot for &VarName"; proc sgplot data=Want noautolegend; band x=x_part_&VarName lower=LCL_part_&VarName upper=UCL_part_&VarName; scatter x=part_&VarName y=part_Y_&VarName; series x=x_part_&VarName y=Pred_part_&VarName; refline 0 / axis=y; run; %mend; ods graphics / PUSH AttrPriority=None width=360px height=270px; %MakePartialPlotCL(x2); %MakePartialPlotCL(x3); ods graphics / POP; /* restore ODS GRAPHICS options */ |
As expected, the graphs are the same as the ones in the previous section. However, the quantities in the graphs are all produced by PROC REG as part of the computations for the original regression model. No additional regression computations are required.
Summary
This article shows two ways to get a confidence band for the conditional mean of a partial regression plot. The easy way is to use the REG statement in PROC SGPLOT. A more efficient way is to compute the confidence band directly from the statistics on the original data. This article shows how to use the SAS/IML language to compute the confidence limits from the output of PROC REG.
This article also demonstrates a key reason why SAS customers choose SAS/IML software. They want to compute some statistic that is not directly produced by any SAS procedure. The SAS/IML language enables you to combine the results of SAS procedures and use matrix operations to compute other statistics from formulas in textbooks or journal articles. In this case, PROC IML consumes the output of PROC REG and creates regression lines and confidence bands for partial leverage plots.
4 Comments
Interesting and very clear, as usual, thanks a lot, Rick!
Does this method apply in case of specification error (eg when the partial leverage plot exhibits a curvilinear pattern for some X, suggesting to re-specific the model after adding a quadratic term, X²)?
I am also wondering about the added value of this graphical tool over the t-value of each regression coefficient which is displayed by the default output of PROC REG.
Maybe, while they both tools (analytic and graphical) test for the same H0, one could take advantage (but how?) of the fact that instead of being limited to one figure (the t value), a partial leverage plot offers a multi-points insight.
Anyway, thanks for your always stimulating posts on the Do Loop!
Jean-Claude
< Does this method apply in case of specification error
The partial leverage regression line indicates whether the LINEAR effect is significant. The curvilinear pattern might be visible in the residual values, but not in the line or its confidence bands.
< I am also wondering about the added value of this graphical tool over the t-value of each regression coefficient
See the previous article for some thoughts on this. If all you care about is the p-value, the graphical method does not provide any advantage.
The general rule for graphical methods is that you expect a graph to provide more insight than a single statistic. The statistic is only valid if the assumptions of the statistical method are met. As you mentioned, diagnostic plots can help diagnose issues such as a misspecified model or heteroscedastic errors, whereas the p-value does not reveal those issues.
Many thanks for your above explanations.
I notice that, in the Fisher's iris data you are using as an example, you select only continuous variables (for both the dependent and the independent variables).
What is the validity and interpretation of partial leverage plots when the full model includes dummies as explanatory variables ("individual" dummies as well as dummies belonging to a set of dummies standing for a more than two categories nominal variable).
And how to cope with partial leverage plots when the model specification includes quadratic terms (X and X²) and/or interaction terms between, say, two variables? Are these plots still usable ?
Yes, Sall's paper addresses this on p. 331. Short answer: the partial leverage plots make sense. He describes how to interpret a partial leverage plot if the sole regressor is a classification variable. Also the case of a classification effect and a continuous covariate.
I don't think quadratic terms or interaction effects present any problems. They are merely columns in the design matrix and are treated the same as the main effects.