The Normal approximation to the binomial distribution: How the quantiles compare

Most statistical programmers have seen a graph of a normal distribution that approximates a binomial distribution. The figure is often accompanied by a statement that gives guidelines for when the approximation is valid. For example, if the binomial distribution describes an experiment with n trials and the probability of success for each trial is p, then the quantity np(1-p) must be larger than some cutoff (often 5 is used, but sometimes 10) in order for the approximation to be valid.

The following SAS/IML statements compute the binomial probabilities for n=50 and p=0.8 (so that np(1-p)=8) and for the approximating normal curve, which has mean np and variance np(1-p). These values are plotted by the SGPLOT procedure (SGPLOT statements not shown):

proc iml;
n = 50;  /* number of trials: better approx for 100, 200, etc. */
p = 0.8; /* probability of success for each trial */
mu = n*p;/* classic approximation of binomial pdf by normal pdf */
stddev = sqrt(n*p*(1-p));
xMin = mu - 4*stddev; xMax = mu+4*stddev;
x1 = T(floor(xMin):ceil(xMax)); /* evaluate binomial at integers */
y1 = pdf("Binom", x1, p, n);
x2 = T(do(xMin, xMax, (xMax-xMin)/101));
y2 = pdf("Normal", x2, mu, stddev);
/* write values to SAS data set; plot with SGPLOT */

The approximation looks good. However, the figure suggests that you need to be careful if you use a normal distribution to approximate extreme quantiles (such as 0.01 or 0.99) of the binomial distribution. Although the "middles" of the two distributions agree well, there appears to be les agreement in the tails. Also, recall that the quantiles of a discrete distribution are integers, which provides yet another source of error when approximating a binomial quantile with a normal quantile. This is evident when you overlay the normal CDF on the binomial CDF, as in the following figure:

prob = T( do(0.01, 0.99, 0.005) );
q = quantile("binom", prob, p, n);  /* ALWAYS an integer! */
qNormal = quantile("normal", prob, mu, stddev);
diff = q - qNormal; /* error from approximating binomial quantiles */
/* write values to SAS data set; plot with SGPLOT */

The graph shows that there is considerable error for quantiles near zero and one. Is this important? It can be. For example, if you are creating a funnel plot for proportions the curves on the funnel plot are computed with the quantiles 0.001, 0.025, 0.975, and 0.999. These are extreme quantiles, but they are used to compute the funnel-like control limits, which are the most important feature of the plot. I suspect this is why David Spiegelhalter in his paper "Funnel plots for comparing institutional performance" used a rather complicated formula (in Appendix A.1.1) to interpolate the binomial quantiles in the funnel plot for proportions, rather than using a binomial approximation.

The following graph shows a close-up of the values of the binomial quantiles versus the normal approximation for the extreme quantiles near one. These are the values used to compute the upper control limits in a funnel plot. You can see that the normal approximation exhibits a systematic error, due to differences in the size of the binomial and normal tails.

The lesson is this: even though it is common to use a normal distribution to approximate the binomial, the extreme quantiles of the distributions might not be close. Even when np(1-p) is fairly large, there are still sizeable differences in the values of the extreme quantiles of the distributions.

tags: Data Analysis, Statistical Programming

4 Comments

  1. Posted March 14, 2012 at 9:12 am | Permalink

    Very interesting and good to know.

  2. Bhupinder Farmaha
    Posted March 14, 2012 at 9:23 am | Permalink

    Good information but normally ignored...

  3. Posted March 15, 2012 at 2:11 pm | Permalink

    How much of a difference would the 'continuity correction' make?

    • Posted March 15, 2012 at 2:23 pm | Permalink

      That's a good question and I invite you to pursue it. The standard continuity correction shifts the PDF by 1/2 unit, which will also affect the CDF and quantiles. I played around with the idea briefly, but I think it deserves further exploration.

One Trackback

  1. [...] The computation says that if you ask for M = 2,741 Poisson(0.5) variates, that there is a 99.9% chance that at least 1,000 of them will be positive. (For this example, you could also use the normal approximation to the binomial distribution.) [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <p> <pre lang="" line="" escaped=""> <q cite=""> <strike> <strong>