Friday, July 31. 2009What Is the Error if the Probability of Rain Is .5 and It Rains?
Suppose that a mortgage aggregator came to you and said that this triple-A assemblage of loans had one chance in a billion of losing money. Then you evaluated the package and found out that it really had a one-in-a-million probability of failing. The error in the probability estimate was just .000001, not much. So with so little error in the claimed probability of failure, you ignore the discrepancy. The problem, of course, is that the expected loss in the latter case is a thousand times greater. Sometimes with small probabilities, little errors become big errors.
So what if we were told that the chance of something was zero -- that the event was impossible? If you believed that, with no reservation, you would bet your life on the certainty. Now suppose that that event claimed to have probability zero actually happens. What should the penalty be for being wrong? Should it be a dollar for getting the probability wrong by 1? Should it be infinity dollars, because it was a lie? This post is all about accounting systems for events that happen in the face of fitted probabilities that they would not happen. The events are response categories. The categories will be important, e.g., whether someone will buy a product or not, survive a disease or not, make a fraudulent transaction or not, engage in money laundering or not, choose one product versus others, etc. We fit models that attribute the probabilities, and then we need to find out how well these models predict. What is the best way to measure how good a prediction is in this situation? We have always known pretty well how to measure prediction for continuous responses. We use squared error to measure the fit, estimating to minimize the sum of squared residuals. Least squares is the foundation of most of our fitting arsenal for continuous responses. A measure of fit is the R-square, based on the sum of squared errors. With categorical responses, it’s not so obvious how to measure how well a model fits. What is the model supposed to do? We can think of predicting either as classification, i.e., picking a category we predict will result, or fitting probabilities so that the actual response is generally associated with a larger probability. For weather, the first approach would be to predict that it will rain; the second approach would be to assert that the probability of precipitation is 90%. For a statistician, the latter expression in terms of probability is preferred, because it expresses the degree of uncertainty. If you are calculating gains and losses from planning an event, if you know the probabilities, you can make decisions so that you can maximize your expected gain. If you are planning an open outdoor concert and it is very expensive to have valuable instruments rained on, you will avoid presenting the concert unless the probability of no rain times the revenue of the event is more than the probability of rain times the lost value in ruining the instruments. If the model only asserted that it will rain or not rain, you won’t be able to calculate your expected gain and make the best decision about whether to go ahead with the outdoor concert. Consider the following four measures of error. In each case, p is the probability we attribute to the event that actually occurred. Entropy error = -log(p) For all these, we calculate the error for each observation, and average them, except that for squared error, we take the square root of the mean squared error; so we will call the average measure RMSE (root mean squared error). RMSE is really the standard deviation estimate, but here we divide by n instead of n-1. Remember that the goal here is to fit the model so that p is always close to 1, i.e., we associate a high probability with the outcome that actually occurred. Here is a graph showing the four measures of error. ![]() Each measure scores events that happen with fitted probability of 1 as zero. Saying an event is certain, and then the event occurs, has no loss, no error. It is also the case that for all measures, the only way to get a perfect score of zero error is to have a fitted probability of 1 for the events that happen. Obtaining a zero fitted probability can be hard, considering that the models we often use make a fit of 1 very hard. For example, in logistic regression the probability is modeled as: p = 1/(1+exp(-Z)) where Z is a linear model in terms of regression variables. In order to get a probability of 1, Z will have to be infinite. Getting a zero is just as hard, needing Z to be –infinity. Of the four error measures, the only one that gives you a good chance of attributed perfection is misclassification error. It makes fitting like taking a pass-fail course in school. Each trial is likely to give you a perfect score of zero, but over many trials, a few misclassifications will creep in and spoil the grade. But of course, the problem with pass-fail courses is that they don’t measure very precisely. Misclassification error doesn’t care if you are slightly good or very good in estimating the probabilities. In the middle of the range, we have absolute error varying linearly with probability, measurement error is flat except for the sudden jump at .5, and the other two are increasing at increasing rates. The squared error and the absolute error are very similar, given that for squared error, you take the square root of the mean sum of squared errors. But RMSE and Absolute error differ in the middle. Suppose that you have two situations, one situation where half are all wrong and the other situation where all are half wrong. The first case can be predicting that it will always rain. The second case would be like predicting that the probability of rain is always .5. We suppose that it rains half the time. In the first case, the error in probability is either 1 or zero, half and half of the rows each. In the second case, the error in probability is a half in all the rows. If you believe that the two situations are equally bad, then you should like Absolute error. If you believe that the half all wrong is worse than all half wrong, then you should like root mean squared error. In the first case, both average errors are 1/2. In the second case, the RMSE error for the first is 1/sqrt(2), but for the second is 1/2.
Consider the case of flipping a coin. The RMSE player will always say that the probability is always fifty-fifty. That sounds reasonable. But the Absolute player could just as easily say that the probability is always 1 (or always 0), and be just as well off. The coin flip RMSE player just seems more right to me. Now let’s look at the other end of the scale where you have an event that is attributed with probability of zero, and the event actually occurs. Three of the error measures agree that the error should be 1. The entropy measure says it should be infinity. It is a good thing that logistic regression makes it hard to reach zero, because the cost of making an error there is infinitely great. But this makes sense. When you attribute a probability of zero, you are saying that an event will never happen, that it is impossible. If that event ever happens, it is not just an estimation error. It is a refutation. I like entropy error most. First, it is precisely the error that we minimize in fitting the logistic regression by maximum likelihood. Maximum likelihood is simply minimizing sum of the negative logarithms of the probabilities that the model produces for the events that we have in the data. Entropy is a good name for this kind of error because it is the accounting measure for information theory. Taking the log of probabilities makes a lot of sense; just consider doing a binary search among n equally-likely items, and to find a level with probability 1/n takes log2(n) comparisons. The number of coin flips to get an event with rarity p takes –log2(p) flips. Bits of log-probability uncertainty are additive, just as joint probabilities are multiplicative. I like misclassification error the least. It is a crude measure that doesn’t care about the probabilities we fit. Nevertheless, it is the easiest to understand, being a simple count on predicted categories being wrong. For each measure of error, we can define an “RSquare” measure. In least squares regression, this is the percent of variation in the response that is accounted for by the model. Another way is to say that it is 1 minus the ratio of unexplained variation to total variation. RSquare = 1 – (error in model)/(error around the mean) If we have no model terms, the estimate for the response will be the mean, so RSquare is 1 minus the error in the fitted model as a proportion to the error in a simpler model that has no regression terms. So if we define it this way, we can have an RSquare for each of the error measures defined above for categorical responses. For example, misclassification RSquare is: RSquare(misclass.) = 1 – (number misclassified) / (number of less-common responses) This follows because the reduced model with no regressors will give a constant probability, and the level with the highest probability will have an error of zero, and the other level an error of 1. The entropy version of RSquare is a by-product of the estimation process: RSquare = 1 – (-loglikelihood for full model)/(-loglikelihood for reduced model) McFadden (1973) calls the entropy RSquare the Likelihood Ratio Index. The RMSE RSquare also seems to have merit. Efron (1978) calls this the Pseudo-RSquare. So which measure should our software report? In my product group, we have tended to just report the entropy measure, because that is what we are minimizing. But since other software reports results with other measures, we now feel that we should report all four of these measures, so this is what the report will look like in the future: ![]() We report both the error average and the RSquare. For the misclassification measure, people never like the RSquare scaling. They want the misclassification rate, which in this case is .1203, i.e., 12% misclassification. For the other measures, RSquare is a more natural measure. However, there are some downsides to this measure in categorical models. First, it is hard to get a high RSquare, even if your model is very good. People can get used to seeing RSquare of .9 for continuous responses, and then see a much lower RSquare for categorical models. [Maddala [1983] recommends using the (un-log) likelihood and taking the n/2th root of the ratio, preserving the 0 to 1 scale, but making RSquares closer to 1.] If you have lots of data, you should hold back some to get a crossvalidation estimate of the error: ![]() The report above contains two pairs of measures, one pair for Training, the other for validation. The validation measures are run against a held-back set of data that is not used to estimate the model. In this way, we see how well the model performs against new data that is not in the “training” set used to make the estimates. A future blog post will consider validation in detail. We also have to address the zero problem in a future blog entry. With contingency table estimates of probabilities (or in degenerate logistic models), probabilities can go to zero, but validation data can contain these zero-probability values, thus spoiling the calculation with infinities. Solving this problem is harder, but there are some good approaches. And I have neglected to describe one of the most important measures, sorting efficiency, which is developed in ROC curves and lift curves. But enough for now. References Long [1997] Regression Models for Categorical and Limited Dependent Variables, Sage, 102-109. Maddala [1983] Limited-dependent and qualitative variables in econometrics, Cambridge University Press. Efron [1978] “Regression and ANOVA with zero-one data: Measures of residual variation,” JASA 73 113-121. McFadden [1973] “Conditional logit analysis of qualitative choice behavior,” in Zarembka, Frontiers of Econometrics, 105-142. Academic Press. Trackbacks
Trackback specific URI for this entry
No Trackbacks
|
ABOUT THIS BLOG John Sall is a co-founder and Executive Vice President of SAS. He leads the JMP business division, which creates interactive and highly visual data analysis software for the desktop and provides a visual interface to SAS.
QuicksearchSyndicate This BlogCategoriesTagsThe blog content appearing on this site does not necessarily represent the opinions of SAS. Your use of this blog is governed by the Terms of Use. |

John Sall is a co-founder and Executive Vice President of SAS. He leads the JMP business division, which creates interactive and highly visual data analysis software for the desktop and provides a visual interface to SAS.