What is Mahalanobis distance?

I previously described how to use Mahalanobis distance to find outliers in multivariate data. This article takes a closer look at Mahalanobis distance. A subsequent article will describe how you can compute Mahalanobis distance.

Distance in standard units

In statistics, we sometimes measure "nearness" or "farness" in terms of the scale of the data. Often "scale" means "standard deviation." For univariate data, we say that an observation that is one standard deviation from the mean is closer to the mean than an observation that is three standard deviations away. (You can also specify the distance between two observations by specifying how many standard deviations apart they are.)

For many distributions, such as the normal distribution, this choice of scale also makes a statement about probability. Specifically, it is more likely to observe an observation that is about one standard deviation from the mean than it is to observe one that is several standard deviations away. Why? Because the probability density function is higher near the mean and nearly zero as you move many standard deviations away.

For normally distributed data, you can specify the distance from the mean by computing the so-called z-score. For a value x, the z-score of x is the quantity z = (x-μ)/σ, where μ is the population mean and σ is the population standard deviation. This is a dimensionless quantity that you can interpret as the number of standard deviations that x is from the mean.

Distance is not always what it seems

You can generalize these ideas to the multivariate normal distribution. The following graph shows simulated bivariate normal data that is overlaid with prediction ellipses. The ellipses in the graph are the 10% (innermost), 20%, ..., and 90% (outermost) prediction ellipses for the bivariate normal distribution that generated the data. The prediction ellipses are contours of the bivariate normal density function. The probability density is high for ellipses near the origin, such as the 10% prediction ellipse. The density is low for ellipses are further away, such as the 90% prediction ellipse.

In the graph, two observations are displayed by using red stars as markers. The first observation is at the coordinates (4,0), whereas the second is at (0,2). The question is: which marker is closer to the origin? (The origin is the multivariate center of this distribution.)

The answer is, "It depends how you measure distance." The Euclidean distances are 4 and 2, respectively, so you might conclude that the point at (0,2) is closer to the origin. However, for this distribution, the variance in the Y direction is less than the variance in the X direction, so in some sense the point (0,2) is "more standard deviations" away from the origin than (4,0) is.

Notice the position of the two observations relative to the ellipses. The point (0,2) is located at the 90% prediction ellipse, whereas the point at (4,0) is located at about the 75% prediction ellipse. What does this mean? It means that the point at (4,0) is "closer" to the origin in the sense that you are more likely to observe an observation near (4,0) than to observe one near (0,2). The probability density is higher near (4,0) than it is near (0,2).

In this sense, prediction ellipses are a multivariate generalization of "units of standard deviation." You can use the bivariate probability contours to compare distances to the bivariate mean. A point p is closer than a point q if the contour that contains p is nested within the contour that contains q.

Defining the Mahalanobis distance

You can use the probability contours to define the Mahalanobis distance. The Mahalanobis distance has the following properties:

  • It accounts for the fact that the variances in each direction are different.
  • It accounts for the covariance between variables.
  • It reduces to the familiar Euclidean distance for uncorrelated variables with unit variance.

For univariate normal data, the univariate z-score standardizes the distribution (so that it has mean 0 and unit variance) and gives a dimensionless quantity that specifies the distance from an observation to the mean in terms of the scale of the data. For multivariate normal data with mean μ and covariance matrix Σ, you can decorrelate the variables and standardize the distribution by applying the Cholesky transformation z = L-1(x - μ), where L is the Cholesky factor of Σ, Σ=LLT.

After transforming the data, you can compute the standard Euclidian distance from the point z to the origin. In order to get rid of square roots, I'll compute the square of the Euclidean distance, which is dist2(z,0) = zTz. This measures how far from the origin a point is, and it is the multivariate generalization of a z-score.

You can rewrite zTz in terms of the original correlated variables. The squared distance Mahal2(x,μ) is
= zT z
= (L-1(x - μ))T (L-1(x - μ))
= (x - μ)T (LLT)-1 (x - μ)
= (x - μ)T Σ -1 (x - μ)
The last formula is the definition of the squared Mahalanobis distance. The derivation uses several matrix identities such as (AB)T = BTAT, (AB)-1 = B-1A-1, and (A-1)T = (AT)-1. Notice that if Σ is the identity matrix, then the Mahalanobis distance reduces to the standard Euclidean distance between x and μ.

The Mahalanobis distance accounts for the variance of each variable and the covariance between variables. Geometrically, it does this by transforming the data into standardized uncorrelated data and computing the ordinary Euclidean distance for the transformed data. In this way, the Mahalanobis distance is like a univariate z-score: it provides a way to measure distances that takes into account the scale of the data.

tags: Data Analysis, Statistical Thinking

49 Comments

  1. Posted March 8, 2012 at 8:43 am | Permalink

    Thanks! I was reading about clustering recently and there was a little bit about how to calculate the mahalanobis distance, but this provides a much more intuitive feel for what it actually *means*.

    • Posted September 13, 2012 at 5:21 pm | Permalink

      is this better than wikipedia?

      • Annie Chas
        Posted February 25, 2013 at 10:54 am | Permalink

        definitely..

  2. Andrea
    Posted March 23, 2012 at 9:22 am | Permalink

    Great entry! Thanks.

  3. Sumit Adhikari
    Posted June 29, 2012 at 1:44 am | Permalink

    Great Article. Thanks for your effort

  4. murali ambekar
    Posted July 5, 2012 at 6:22 am | Permalink

    sir, I have calculate MD of 20 vectors each having 9 elements for ex.[1 2 3 3 2 1 2 1 3] using the formula available in the literature. I got 20 values of MD [2.6 10 3 -6.4 9.5 0.4 10.9 10.5 5.8,6.2,17.4,7.4,27.6,24.7,2.6,2.6,2.6,1.75,2.6,2.6]. Actually I wanted to calculate divergence. Can you please help me to understand how to interpret these results and represent graphically.
    Thanking you

  5. Posted September 13, 2012 at 5:25 pm | Permalink

    Sir, can you elaborate the relation between Hotelling t-squared distribution and Mahalanobis Distance?

    • Posted September 14, 2012 at 8:52 am | Permalink

      They are closely related. Mahalanobis distance is a way of measuring distance that accounts for correlation between variables. In multivariate hypothesis testing, the Mahalanobis distance is used to construct test statistics. For example, if you have a random sample and you hypothesize that the multivariate mean of the population is mu0, it is natural to consider the Mahalanobis distance between xbar (the sample mean) and mu0. This is an example of a Hotelling T-square statistic. By knowing the sampling distribution of the test statistic, you can determine whether or not it is reasonable to conclude that the data are a random sample from a population with mean mu0.

      There are other T-square statistics that arise. For example, there is a T-square statistic for testing whether two groups have the same mean, which is a multivariate generalization of the two-sample t-test. All of the T-square statistics use the Mahalanobis distance to compute the quantities that are being compared.

  6. vidya
    Posted September 26, 2012 at 6:39 am | Permalink

    Sir please explain the difference and the relationships betweeen euclidean and mahalanobis distance

    • Posted September 26, 2012 at 8:45 am | Permalink

      Mahalanobis distance adjusts for correlation. To measure the Mahalanobis distance between two points, you first apply a linear transformation that "uncorrelates" the data, and then you measure the Euclidean distance of the transformed points.

  7. Thomas
    Posted November 8, 2012 at 2:36 pm | Permalink

    Don't you mean "like a MULTIVARIATE z-score" in your last sentence. Apologies for the pedantry.

    Thomas

    • Posted November 8, 2012 at 4:01 pm | Permalink

      Math is a pedantic discipline. I welcome the feedback. I think the sentence is okay because I am comparing the Mahal distance to the concept of a univariate z-score. Therefore it is LIKE a univariate z-score. As you say, I could have written it differently. How about we agree that it is the "multivariate analog of a z-score"?

  8. Tommy Carstensen
    Posted November 10, 2012 at 8:38 pm | Permalink

    This is much better than Wikipedia. How did you generate the plot with the prediction ellipses? How did you convert the Mahalanobis distances to P-values? Do you have some sample data and a tutorial somewhere on how to generate the plot with the ellipses?

  9. Ali Fakhraee Seyedabad
    Posted November 13, 2012 at 9:50 am | Permalink

    Hi,
    In the context of clustering, lets say k-means, when we want to calculate the distance of a given point from a given cluster which one of the following is suggested:
    1. calculate the covariance matrix of the whole data once and use the transformed data with euclidean distance?
    or
    2. each time we want to calculate the distance of a point from a given cluster, calculate the covariance matrix of that cluster and then compute the distance?

    I hope I could convey my question. for I'm working on my project, which is a neuronal data, and I want to compare the result from k-means when euclidean distance is used with k-means when mahalanobis distance is used.

    Thanks in advance.

    • Posted November 13, 2012 at 10:30 am | Permalink

      The first option is simpler and assumes that the covaraince is equal for all clusters. The second option assumes that each cluster has it's own covariance. A third option is to consider the "popoled" covariance, which is an average of the covariances for each cluster. It all depends on how you want to model your data. These options are discussed in the documentation for PROC CANDISC and PROC DISCRIM. Look at the Iris example in PROC CANDISC and read about the POOL= option in PROC DISCRIM.

  10. waqas
    Posted November 27, 2012 at 9:53 pm | Permalink

    Nice...Thanks for such a nice tutorial

  11. Tim
    Posted February 26, 2013 at 11:52 am | Permalink

    At the end, you take the squared distance to get rid of square roots. Since you had previously put the mahalanobis distance in the context of outlier detection, this reminded me of the least squares method, which seeks to minimize the sum of squared residuals. I've heard the "square" explained variously as a way to put special emphasize on large deviations in single points over small deviations in many, or explained as a way to get a favourable convex property of the minimization problem. Are any of these explanations correct and/or worth keeping in mind when working with the mahalanobis distance?

    • Posted February 26, 2013 at 12:42 pm | Permalink

      Yes. In the least squares context, the sum of the squared errors is actually the squared (Euclidean) distance between the observed response (y) and the predicted response (y_hat). In both contexts, we say that a distance is "large" if it is large in any one component (dimension).

  12. jack
    Posted March 20, 2013 at 6:30 pm | Permalink

    Thanks. It is very useful to me. Can use Mahala. distance as z-score feed into probability function ChiSquareDensity to calculate probability?

    Thanks.

  13. Andy
    Posted May 8, 2013 at 8:38 pm | Permalink

    Hi Rick. Sorry for two basic questions. I have read that Mahalanobis distance theoretically requires input data to be Gaussian distributed. Is it just because it possess the inverse of the covariance matrix? Or is there any other reason? Second, it is said this technique is scale-invariant (wikipedia) but my experience is that this might only be possible with Gaussian data and that since real data is generally not Gaussian distributed, scale-variance feature does not hold? Please comment. Thanks

    • Posted May 9, 2013 at 6:06 am | Permalink

      I think these are great questions (and not basic). From a theoretical point of view, MD is just a way of measuring distances. You choose any covariance matrix, and then measure distance by using a weighted sum of squares formula that involves the inverse covariance matrix. (The Euclidean distance is unweighted sum of squares, where the covariance matrix is the identity matrix.) So the definition of MD doesn't even refer to data, Gaussian or otherwise.

      If you read my article "Use the Cholesky transformation to uncorrelate variables," you can understand how the MD works.
      What makes MD useful is that IF your data are MVN(mu, Sigma) and also you use Sigma in the MD formula, then the MD has the geometric property that it is equivalent to first transforming the data so that they are uncorrelated, and then measuring the Euclidean distance in the transformed space.

      So to answer your questions: (1) the MD doesn't require anything of the input data. However, it is a natural way to measure the distance between correlated MVN data. (2) The scale invariance only applies when choosing the covariance matrix. If you change the scale of your variables, then the covariance matrix also changes. If you measure MD by using the new covariance matrix to measure the new (rescaled) data, you get the same answer as if you used the original covariance matrix to measure the original data.

      • Andy
        Posted May 9, 2013 at 9:38 pm | Permalink

        Thanks a lot for your prompt response. Appreciate your posts

  14. FREDRICK SAMSON
    Posted May 14, 2013 at 3:43 am | Permalink

    Sir, Im trying to develop a calibration model for near infrared analysis, and Im required to plug in a Mahalanobis distance that will be used for prediction of my model, however, im stuck as I dont know where to start, can you give a help on how can i use mahalanobis formula?

  15. Geovin George
    Posted June 24, 2013 at 7:11 am | Permalink

    Hello,
    I have read couple of article that says If the M-distance value is less than 3.0 then the sample is represented in the calibration model, if the M-distance value is greater than 3.0, this indicates that the sample is not well represented by the model, so how did they come up with this limitation?

  16. bahman
    Posted July 24, 2013 at 8:01 pm | Permalink

    thank you very much! It made my night! :)
    The funny thing is that the time now is around 4 in the morning and when I started reading I was too asleep. But now, I am quite excited about how great was the idea of mahalanobis distance and how beautiful is it! All this sense is because of your clear and great explanation of the method. Many thanks!

  17. Hasibul haque
    Posted September 25, 2013 at 1:36 pm | Permalink

    How to derive mahalanobis distribution?

    • Posted September 25, 2013 at 2:40 pm | Permalink

      This is a classical result, probably known to Pearson and Mahalanobis. For a modern derivation, see R.A. Johnson and D.W. Wichern, Applied Multivariate Statistical Analysis (3rd Ed), 1992, p. 140, which shows that if X is p-dimensional MVN(mu, Sigma), then the squared Mahalanobis distances for X are distributed as chi-square with p derees of freedom. The result is approximately true (see 160) for a finite sample with estimated mean and covariance provided that n-p is large enough.

  18. RvE
    Posted November 6, 2013 at 11:37 am | Permalink

    If I compare a cluster of points to itself (so, comparing identical datasets), and the value is e.g. 2.2. Can I say that a point is on average 2.2 standard deviations away from the centroid of the cluster?

    Actually, there is no real mean or centroid determined, right? Mahalanobis distance is only defined on two points, so only pairwise distances are calculated, no?

    • Posted November 6, 2013 at 8:26 pm | Permalink

      You can compute an estimate of multivariate location (mean, centroid, etc) and compute the Mahalanobis distance from observations to that point. The statement "the average Mahalanobis distance from the centroid is 2.2" makes perfect sense.

      • RvE
        Posted November 11, 2013 at 4:18 pm | Permalink

        Thx for the reply. Does this statement makes sense after the calculation you describe, or also with e.g. 100 vs. 100 pairwise comparisons? I guess both, only in the latter, the centroid is not calculated, so the statement is not precise... .

        • Posted November 11, 2013 at 4:46 pm | Permalink

          I think calculating pairwise MDs makes mathematical sense, but it might not be useful. The MD is a generalization of a z-score. In 1-D, you say z_i = (x_i - mu)/sigma to standardize a set of univariate data, and the standardized distance to the center of the data is d_i = |x_i-mu|/sigma. What you are proposing would be analogous to looking at the pairwise distances d_ij = |x_i - x_j|/sigma.

          • Posted November 12, 2013 at 12:33 pm | Permalink

            Well, I guess there are two different ways to calculate mahalanobis distance between two clusters of data like you explain above, but to be sure we are talking about the same thing, I list them below:
            1) you compare each data point from your sample set to mu and sigma matrices calculated from your reference distribution (although labeling one cluster sample set and the other reference distribution may be arbitrary), thereby calculating the distance from each point to this so called mahalanobis-centroid of the reference distribution.
            2) you compare each datapoint from matrix Y to each datapoint of matrix X, with, X the reference distribution (mu and sigma are calculated from X only)

            The values of the distances will be different, but I guess the ordinal order of dissimilarity between clusters is preserved when using either method 1 or 2. I actually wonder when comparing 10 different clusters to a reference matrix X, or to each other, if the order of the dissimilarities would differ using method 1 or method 2. Also, I can't imagine a situation where one method would be wrong and the other not. Althought method one seems more intuitive in some situations.

  19. Posted November 12, 2013 at 12:40 pm | Permalink
  20. Posted November 12, 2013 at 1:12 pm | Permalink

    I tested both methods, and they gave very similar results for me, the ordinal order is preserved, and even the relative difference between cluster dissimilarity seems to be similar for both methods.

  21. Chris
    Posted November 15, 2013 at 11:28 pm | Permalink

    I have seen several papers across very different fields use PCA to reduce a highly correlated set of variables observed for n individuals, extract individual factor scores for components with eigenvalues>1, and use the factor scores as new, uncorrelated variables in the calculation of a Mahalanobis distance. The purpose of data reduction is two-fold, it identities relevant commonalities among the raw data variables and gives a better sense of anatomy, and it reduces the number of variables sothat the within-sample cov matrices are not singular due to p being greater than n. Is this appropriate?

    I understand that the new PCs are uncorrelated but this is ACROSS populations. The within-population cov matrices should still maintain correlation. Results seem to work out (that is, make sense in the context of the problem) but I have seen little documentation for doing this. I understand from the above that a Euclidean distance using all PCs would be equivalent to the Mahalanobis distance but it sometimes isn't clear that using the PCs with very small eigenvalues is desirable. Thanks

    • Posted November 16, 2013 at 12:18 pm | Permalink

      Since the distance is a sum of squares, the PCA method approximates the distance by using the sum of squares of the first k components, where k < p. Provided that most of the variation is in the first k PCs, the approximation is good, but it is still an approximations, whereas the MD is exact.

  22. Deepa Kadam
    Posted November 29, 2013 at 1:09 am | Permalink

    Hi Rick..
    I have Data set of 10000 observations and 10 parameters so as have centroid for each parameter. Now I want to calculate Mahalanobis Distance for each observation and assign probability. As per my understanding there are two ways to do so, 1. Using Principal Component & 2. using Hat Matrix. For my scenario i cant use hat matrix. And I find Principle component method little tidious.
    Is there any other way to do the same using SAS?
    Need your help..

    Regards,
    Deepa.

  23. ratan srivastava
    Posted July 4, 2014 at 12:48 am | Permalink

    How to apply the concept of mahalanobis distance in self organizing maps.

    • Posted July 5, 2014 at 7:28 am | Permalink

      That's a very broad question. I did an internet search and obtained many results. See if this paper provides the kind of answers you are looking for.

  24. Daniel
    Posted July 15, 2014 at 3:37 am | Permalink

    Suppose I wanted to define an isotropic normal distribution for the point (4,0) in your example for which 2 std devs touch 2 std devs of the plotted distribution. How could I proceed to find the std dev of my new distribution? It seems to be related to the MD.

    • Posted July 15, 2014 at 6:19 am | Permalink

      I don't understand what "touching" means, even in the case of univariate distributions. Do you mean that the centers are 2 (or 4?) MD units apart?

      • Daniel
        Posted July 15, 2014 at 1:49 pm | Permalink

        Thanks, already solved the problem, my hypothesis was correct.

8 Trackbacks

  1. [...] recently blogged about Mahalanobis distance and what it means geometrically. I also previously showed how Mahalanobis distance can be used to compute outliers in multivariate [...]

  2. [...] to compute the Mahalanobis distance between each point and the origin. The Mahalanobis distance is a standardized distance that takes into account correlations between the [...]

  3. [...] distance has an approximate chi-squared distribution when the data are MVN. See the article "What is Mahalanobis distance?" for an explanation of Mahalanobis distance and its geometric interpretation. I will use a SAS/IML [...]

  4. [...] argument to the EXP function involves the expression d2=(x-μ)TΣ-1(x-μ), where d is the Mahalanobis distance between the multidimensional point x and the mean vector μ. I have previous written a SAS/IML [...]

  5. By The best of SAS blogs for 2012 - SAS Voices on December 27, 2012 at 4:24 pm

    [...] What is Mahalanobis distance? [...]

  6. [...] What is Mahalanobis distance?: I describe the geometry of Mahalanobis distance, which provides a way to measure distances that takes into account correlations in the data. [...]

  7. [...] areas. There are many ways to define the distance between observations. I have previously written an article that explains Mahalanobis distance, which is used often in multivariate analysis, and I have showed how to compute the Mahalanobis [...]

  8. […] can think of two ways to draw prediction ellipses. One way is to use the geometry of Mahalanobis distance. The second way, which is used by the classical SAS/IML functions, is to use ideas from principal […]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <p> <pre lang="" line="" escaped=""> <q cite=""> <strike> <strong>