On the median of the chi-square distribution

6

I was at the Wikipedia site the other day, looking up properties of the Chi-square distribution. I noticed that the formula for the median of the chi-square distribution with d degrees of freedom is given as ≈ d(1-2/(9d))3. However, there is no mention of how well this formula approximates the true value of the median, nor is there a reference. (Please post a comment if you know where this formula is derived.)

The formula is a model for the value of the median. As such, it contains modeling error. By using the SAS/IML language or the SAS DATA step, you can compare this model to a direct numerical computation of the median.

As John D. Cook discussed, the error introduced by a model (the formula) can be orders of magnitude greater than the approximation error from a finite precision numerical computation. Let's compare the formula to a numerical computation of the median for a range of degrees of freedom. The median is the 0.5 quantile, so you can use the QUANTILE function to compute the "true" median. ("True" is in quotes because the computation is itself an approximation, but the approximation error should be small.) The following program computes the difference between the median values of the chi-square distribution as returned by the QUANTILE function and as computed by using the Wikipedia formula:

proc iml;
d = do(0.1,0.3,0.005) || do(0.4, 6, 0.1); /* range of degree of freedom */
q = quantile("chisquare", 0.5, d); /* "true" computation */
formula = d#(1-2/(9*d))##3;
diff = q-formula;

I've plotted the results. In the top graph, the "true" and approximate medians look identical except when the degree of freedom, d, is very small (for example, d < 0.125). However, there is a general rule to follow when you are trying to understand how two similar quantities differ: Do not plot the quantities themselves (the top graph), but rather plot the difference between the quantities (the bottom graph).

When you plot the difference, you get a better sense for the approximation. The formula appears to be biased for d > 1, since the formula is always less than the true median.

Is the formula of any value? Yes. It is a model, and models are good for showing us the big picture. The formula shows that the median increases roughly linearly with the degree of freedom. The formula is also useful for understanding the asymptotic value of the median: For a large degree of freedom, d, the formula indicates that the median is approximately d - 2/3. (I've used the approximation (1-x)3 ≈ (1-3x) for small x.)

Incidentally, the formula is a Laurent polynomial, which means that it includes both positive and negative powers of d. Laurent polynomials can be useful in numerical analysis, but you don't see them as often as their better-known cousin, the Taylor polynomial. A Laurent polynomial can capture the behavior of a function near zero (where Taylor polynomials also succeed) and far from zero (where Taylor polynomials often fail).

Share

About Author

Rick Wicklin

Distinguished Researcher in Computational Statistics

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of PROC IML and SAS/IML Studio. His areas of expertise include computational statistics, simulation, statistical graphics, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.

6 Comments

  1. Dear Rick Wicklin,

    Thanks for posting useful info about SAS/IML on this weblog and sorry that this comment in not related to this post.

    I have been looking for a command in SAS/IML which finds the location of the matched values in two different vectors.
    I came across this website: http://stackoverflow.com/questions/4952548/value-matching-in-sas-iml where the same question was asked and you suggested using XSECT function to find the intersection of the two sets. However, XSECT function gives the common elements between two sets rather that giving the location of those elements.

    For example, in the following code:

    proc iml;
    x={10 11 12 13 12 13 14 15};
    v={12 13};
    z=xsect(x,v);
    print z;
    quit;

    the SAS results are: Z= {12 13}, but I’m looking for a vector like {3 4 5 6} which is the location of common elements of vector V in vector X.

    Would you please provide me with the appropriate function in SAS/IML?
    Thank you very much.

  2. Dear Rick Wicklin,

    You ask if anybody knows where the approximation ≈ d(1-2/(9d))3 comes from. I looked up chi-square in the bible of functions, Abramowitz and Stegun, and stumbled upon formula 26.4.14, which states that if d>30, then the transformed variable x_2 = ( (x/d)^(1/3) –(1-2/9d) ) / \sqrt(2/9d) is standard normal distributed. And when that’s the case, the median Is X_2=0 and hence the above approximation.

    This approximation is a statement about the entire distribution; however from your graphs it seems safe to apply the median approximation for even lower degrees of freedom.

    (Where Abramowitz and Stegun have their result from I have no idea.)

    • Rick Wicklin

      Interesting. Thanks!

      While trying to find your formula, I stumbled upon 26.2.49, which approximates the inverse CDF (quantile function) of an ARBITRARY distribution in terms of Hermite polynomials! It's amazing what's in that tome.

  3. Dear Rick,
    About the approximation for the median of a chi-square distribution that you recommend, (1-x)^3, I just wanted to ask you:
    1. For which values of x does it apply? I guess that just for x<1, but please let me know if I'm wrong.
    2. Is there a published paper (yours of from someone else) that stands after such an approximation?. This is because I would like to use such a formula in a document that I'm working in, but I would like to have a formal reference about it.

    Thanks in advance for your kind repply,

    • Rick Wicklin

      I think you misread the article. I did not claim that the median is approximately (1-x)^3. What I said is that the median of a chi-sq(d) distribution is approximately d(1-2/(9d))^3 when d > 0.125. I also said that the median is approximately d - 2/3 for sufficiently large values of d. "Sufficiently large" means that d >> 2/9 (which is read "much much greater than 2/9"). Empirically, d > 1 is probably good enough.

      Regarding citation, you can cite this blog as
      Wicklin, Rick (2011) "On the median of the chi-square distribution," published in The DO Loop blog, 09NOV2011.
      URL: https://blogs.sas.com/content/iml/2011/11/09/on-the-median-of-the-chi-square-distribution.html

Leave A Reply

Back to Top