Learning from data is good, so you would think that bigger studies to fit bigger models is always good. How could it ever be worse to learn more through data? It’s time for a story.
Goldilocks found the home of the research bears and looked into the first lab and found the work of Brother Bear. She found a little study, underpowered and thus unable to find anything significant, so Goldilocks left Brother Bear’s lab to look for better studies. She stepped into the next lab and found the work of Mama Bear. She found a bigger study than Baby Bear's, with some significant factors, but it only whetted Goldilocks' appetite to digest even bigger studies. So she stepped into the next lab and found the work of Papa Bear. Papa Bear’s study had thousands of cases on hundreds of variables and was performed by a careful experimental design. Goldilocks gobbled up the big results from Papa Bear and promptly got deathly sick.
Goldilocks was rushed to the hospital, where it was learned that she was the victim of a poisonous level of overfitting. When the fitted model was studied on an independent crossvalidation sample, the R-Square of the fit was negative! It fit worse than a model that didn’t know any variables, a fit worse than the simple mean.
The statistical pathologist had seen plenty of overfit poisonings, but they were fits to happenstance data. This isn’t supposed to happen in healthy well-designed experiments! But it turns out that size matters. Even in a well-designed experiment, you can overfit.
You don’t believe this fairy tale? Let’s simulate some data and see.
First, we create a fractional factorial experimental design with 200 factors in 512 observations. Then, we create a response that is just random normal. Here is the summary of fit.
The R-Square for the model is .46. We are capturing 46% of the variation in the model with the regressors. But the response is random, so the variation that is captured in the coefficients of the estimated model is random too. We are misled by the R-Square.
But we held back a random selection of 25% of the rows for crossvalidation. What is the R-Square of that? It is negative: -6.26. It is not merely negative, meaning that it the predictions are worse than the mean. The sum of squared errors is more than seven times worse than just fitting a mean! Of course, fitting a mean in this case is the correct model, i.e., just fit an intercept parameter.
The problem is that if we didn’t hold back a crossvalidation sample, we wouldn’t have known that the intercept-only model was the right one. An R-Square of 46% would seem good, and many regressors would test as significant.
This was the NULL case where there was no real effect. Does it improve if there were real effects? We simulated a second example with all effects being real, but with the coefficients of 10 of the 200 being large (around 2) and the rest being small (random with standard deviation .2) with normal random error with standard deviation 10. With plenty of real effects, can we predict well with new data (i.e., on the crossvalidation set)? If you fit all the variables, you still get a negative R-Square for validation. The model fits better when there are real effects, but it still doesn’t fit better than just fitting a mean; in fact, it fits with more than twice the sum of squared errors of fitting the mean!
This is a real model with all real effects in a designed experiment! The only problem is that the error is big and the effects are small, and there are a lot of effects. The effects absorb random variation as well as some of their true effect variation.
With this introduction, we need to consider two questions, first a question of attitude and second a question of criterion.
The first question is whether we should be reducing the model, removing terms that seem to not be significant. If we are doing the study only to test hypotheses, then maybe we shouldn’t be cutting it down. In any testing situation, we expect to end up with 5% of the null-hypothesis tests being falsely significant — that is the way we sized the test, with an alpha level of .05. When we cut down the model to just the significant terms, we still expect 5% of the original null terms to be falsely significant, but now they don’t appear with all the other terms, so we may mislead ourselves into thinking 5% of the new set, not the original set of terms. What used to be an alpha-sized now becomes test selection bias. The correct answer is NO, don’t reduce; reducing the model makes the tests invalid!
But if we don’t cut down the model, we can’t predict well — our predictions will be loaded up with random variation from the estimates. So we have no choice but to cut down. So the answer is also YES, we have to reduce the model if we want to predict.
Thus we have two camps, equally valid, answering differently because they have different goals. One camp is trying to test, the other to predict. In industrial statistics, in data mining and in many other fields, we don’t care about tests; we care about prediction, so we need to reduce our models.
The second question is, given that we need to cut down the model, how do we decide to select a model? The old default in JMP’s Stepwise is to forward select regressors until the effects are no longer significant at some p-value level. Here we see the sequence of p-values — remember that 10 terms have big coefficients, and all the others have small coefficients. It looks like Stepwise gets all the good terms, and at .05 selects about 5% of the weak regressors, as one would expect. The default p-value stopping criterion is .25, meant to ensure that it gets all the good terms. How well is the model predicting across this sequence? What is the validation R-Square at 30 terms? It turns out that it is about 5%, not very good. If you use alpha=.25 to stop, you get 66 terms and a validation R-Square of (negative) -.10, embarrassingly bad.
Here is the sequence of R-Squares with the training data in red, and the crossvalidation data in blue. As you would expect, the crossvalidation stops soon after 10 when it starts getting the weak regressors. The training R-Square can only increase.
A holdback crossvalidation set is the gold standard for determining the size of the model. It measures predictability in a very direct way. If you can afford to hold back data, you should definitely do it!
What if you don’t have the luxury of holding back data? Is there some other way? There are many criteria, but some of the most popular are the “information” criteria, the Akaike Information Criterion (AICc) and the Bayesian Information Criterion (BIC). You choose the model with the smallest criterion. Here is a plot of the AICc (in red) and BIC (in blue) for this model with 10 strong regressors and 190 weak ones.
BIC will choose smaller models than AIC. In this case, BIC gets it right, choosing 15 to 20 terms, which the crossvalidation shows will be close to the minimum prediction variance. So we recommend BIC, and in the next version of JMP, BIC will be the default stopping rule if there is no holdback crossvalidation set.
Let’s go back to the NULL data with no true regression effects. Below are the corresponding plots for the NULL case response. The Validation R-Square actually thinks you should select four terms, too many, but it is not a suggesting very strongly.
The BIC gets it right, selecting the minimum number of terms, and AICc overselects around 30 terms, where the validation R-Square will be around -.33.
So Goldilocks, stay away from Big Papa Bear models unless you are equipped to cut them down to size. SAS has a whole PROC dedicated to cutting models down to size, PROC GLMSELECT. JMP needs more of these kind of features, which will be coming with its next release.