In my previous blog post about
Goldilocks and the negative R-Square, I think I left you with an impression that regression fits are garbage unless you trim down your models. Basically, your attitude should be that of a sculptor: You cut away at the model until you have the best image of reality.
Now I should confess that in order to get models that behaved so badly, I had to ratchet up the error variation and ratchet down the coefficient scale. My error standard deviation was 50 times the standard deviation of the coefficients, but that is somewhat unrealistic. Most models are not going to have negative R-Squares, even if they are overfit.
To see that, consider another set of simulations, where I choose a number of different variances for the residual error and a number of different variances for the scale of the parameters. This is for the same 200-regressor by 512-row problem from the previous blog post.
In the graphs of crossvalidation R-Square below, the columns are the different error standard deviations, and the rows are the standard deviations used to generate the model coefficients.
Now we see that the negative R-Squares occur only in the upper right, where the error standard deviation is large, 16, and the coefficient scale is small, with standard deviation 1. All the other combinations behave pretty well — in fact, they even consistently rise as it captures more of the support from all the variables. You see five simulations here, with the generated errors and coefficients different for each track.
That is a reassuring picture. Unfortunately, it is the not the picture that actually comes out when you do a random holdback from the experimental data. Here is the actual picture:
What a mess! There are nine tracks of crossvalidation R-Square as it sequences, adding up to all 200 regressors. There are three different holdback selections, identified by the three colors, and three repeats of the same holdback using different simulations of the error and coefficients. Notice especially how misbehaved the blue tracks are. That was the first holdback, the one I used in
my last post. The blue tracks not only go bad for low coefficient scale and high error (upper right), but some go bad for large coefficients and small error — notice the cell for column Sigma Error 2 and row Sigma Beta 8 — it has one track with R-Square near zero for most of the stepping to 200 regressors.
But notice that the red and green tracks are much more reasonable, behaving more like the tracks in the first graph.
It turns out that the blue holdback must have really damaged the experimental design; it took away enough important points to not support important corners of the experimental region, and thus the estimates were not supported well.
The lesson here is that you shouldn’t just take a random holdback from an experimental design that is fairly thin on data (200 variables supported by 384 rows). The holdback that was used in the first picture was carefully selected, basically making another factor in the design, and using it to select a random holdback from. This ensured that it was still supporting the rest of the variables well.
So Goldilocks may have seen Papa Bear’s big model and discovered that it was not so bad after all. With big effects and reasonable error variance, data is going to be worth fitting to models, even large ones, as long as you don’t mess up your estimates by a poor choice of validation sample.