I previously described how to use Mahalanobis distance to find outliers in multivariate data. This article takes a closer look at Mahalanobis distance. A subsequent article will describe how you can compute Mahalanobis distance.
Distance in standard units
In statistics, we sometimes measure "nearness" or "farness" in terms of the scale of the data. Often "scale" means "standard deviation." For univariate data, we say that an observation that is one standard deviation from the mean is closer to the mean than an observation that is three standard deviations away. (You can also specify the distance between two observations by specifying how many standard deviations apart they are.)
For many distributions, such as the normal distribution, this choice of scale also makes a statement about probability. Specifically, it is more likely to observe an observation that is about one standard deviation from the mean than it is to observe one that is several standard deviations away. Why? Because the probability density function is higher near the mean and nearly zero as you move many standard deviations away.
For normally distributed data, you can specify the distance from the mean by computing the so-called z-score. For a value x, the z-score of x is the quantity z = (x-μ)/σ, where μ is the population mean and σ is the population standard deviation. This is a dimensionless quantity that you can interpret as the number of standard deviations that x is from the mean.
Distance is not always what it seems
You can generalize these ideas to the multivariate normal distribution. The following graph shows simulated bivariate normal data that is overlaid with prediction ellipses. The ellipses in the graph are the 10% (innermost), 20%, ..., and 90% (outermost) prediction ellipses for the bivariate normal distribution that generated the data. The prediction ellipses are contours of the bivariate normal density function. The probability density is high for ellipses near the origin, such as the 10% prediction ellipse. The density is low for ellipses are further away, such as the 90% prediction ellipse.
In the graph, two observations are displayed by using red stars as markers. The first observation is at the coordinates (4,0), whereas the second is at (0,2). The question is: which marker is closer to the origin? (The origin is the multivariate center of this distribution.)
The answer is, "It depends how you measure distance." The Euclidean distances are 4 and 2, respectively, so you might conclude that the point at (0,2) is closer to the origin. However, for this distribution, the variance in the Y direction is less than the variance in the X direction, so in some sense the point (0,2) is "more standard deviations" away from the origin than (4,0) is.
Notice the position of the two observations relative to the ellipses. The point (0,2) is located at the 90% prediction ellipse, whereas the point at (4,0) is located at about the 75% prediction ellipse. What does this mean? It means that the point at (4,0) is "closer" to the origin in the sense that you are more likely to observe an observation near (4,0) than to observe one near (0,2). The probability density is higher near (4,0) than it is near (0,2).
In this sense, prediction ellipses are a multivariate generalization of "units of standard deviation." You can use the bivariate probability contours to compare distances to the bivariate mean. A point p is closer than a point q if the contour that contains p is nested within the contour that contains q.
Defining the Mahalanobis distance
You can use the probability contours to define the Mahalanobis distance. The Mahalanobis distance has the following properties:
- It accounts for the fact that the variances in each direction are different.
- It accounts for the covariance between variables.
- It reduces to the familiar Euclidean distance for uncorrelated variables with unit variance.
For univariate normal data, the univariate z-score standardizes the distribution (so that it has mean 0 and unit variance) and gives a dimensionless quantity that specifies the distance from an observation to the mean in terms of the scale of the data. For multivariate normal data with mean μ and covariance matrix Σ, you can decorrelate the variables and standardize the distribution by applying the Cholesky transformation z = L-1(x - μ), where L is the Cholesky factor of Σ, Σ=LLT.
After transforming the data, you can compute the standard Euclidian distance from the point z to the origin. In order to get rid of square roots, I'll compute the square of the Euclidean distance, which is dist2(z,0) = zTz. This measures how far from the origin a point is, and it is the multivariate generalization of a z-score.
You can rewrite zTz in terms of the original correlated variables. The squared distance Mahal2(x,μ) is
= zT z
= (L-1(x - μ))T (L-1(x - μ))
= (x - μ)T (LLT)-1 (x - μ)
= (x - μ)T Σ -1 (x - μ)
The last formula is the definition of the squared Mahalanobis distance. The derivation uses several matrix identities such as (AB)T = BTAT,
(AB)-1 = B-1A-1, and (A-1)T = (AT)-1. Notice that if Σ is the identity matrix, then the Mahalanobis distance reduces to the standard Euclidean distance between x and μ.
The Mahalanobis distance accounts for the variance of each variable and the covariance between variables. Geometrically, it does this by transforming the data into standardized uncorrelated data and computing the ordinary Euclidean distance for the transformed data. In this way, the Mahalanobis distance is like a univariate z-score: it provides a way to measure distances that takes into account the scale of the data.

28 Comments
Thanks! I was reading about clustering recently and there was a little bit about how to calculate the mahalanobis distance, but this provides a much more intuitive feel for what it actually *means*.
is this better than wikipedia?
definitely..
Great entry! Thanks.
Great Article. Thanks for your effort
sir, I have calculate MD of 20 vectors each having 9 elements for ex.[1 2 3 3 2 1 2 1 3] using the formula available in the literature. I got 20 values of MD [2.6 10 3 -6.4 9.5 0.4 10.9 10.5 5.8,6.2,17.4,7.4,27.6,24.7,2.6,2.6,2.6,1.75,2.6,2.6]. Actually I wanted to calculate divergence. Can you please help me to understand how to interpret these results and represent graphically.
Thanking you
I suggest you post your question to the discussion forum at https://communities.sas.com/community/support-communities/sas_statistical_procedures and provide a link to your definition of "divergence."
Sir, can you elaborate the relation between Hotelling t-squared distribution and Mahalanobis Distance?
They are closely related. Mahalanobis distance is a way of measuring distance that accounts for correlation between variables. In multivariate hypothesis testing, the Mahalanobis distance is used to construct test statistics. For example, if you have a random sample and you hypothesize that the multivariate mean of the population is mu0, it is natural to consider the Mahalanobis distance between xbar (the sample mean) and mu0. This is an example of a Hotelling T-square statistic. By knowing the sampling distribution of the test statistic, you can determine whether or not it is reasonable to conclude that the data are a random sample from a population with mean mu0.
There are other T-square statistics that arise. For example, there is a T-square statistic for testing whether two groups have the same mean, which is a multivariate generalization of the two-sample t-test. All of the T-square statistics use the Mahalanobis distance to compute the quantities that are being compared.
Sir please explain the difference and the relationships betweeen euclidean and mahalanobis distance
Mahalanobis distance adjusts for correlation. To measure the Mahalanobis distance between two points, you first apply a linear transformation that "uncorrelates" the data, and then you measure the Euclidean distance of the transformed points.
how to determine euclidean distance??
The usual way: the square root of the sum of the squares of the differences between coordinates dist(p,q)=||p-q||.
See http://en.wikipedia.org/wiki/Euclidean_distance. In SAS, you can use PROC DISTANCE to calculate the Euclidean distance.
Don't you mean "like a MULTIVARIATE z-score" in your last sentence. Apologies for the pedantry.
Thomas
Math is a pedantic discipline. I welcome the feedback. I think the sentence is okay because I am comparing the Mahal distance to the concept of a univariate z-score. Therefore it is LIKE a univariate z-score. As you say, I could have written it differently. How about we agree that it is the "multivariate analog of a z-score"?
This is much better than Wikipedia. How did you generate the plot with the prediction ellipses? How did you convert the Mahalanobis distances to P-values? Do you have some sample data and a tutorial somewhere on how to generate the plot with the ellipses?
Hi,
In the context of clustering, lets say k-means, when we want to calculate the distance of a given point from a given cluster which one of the following is suggested:
1. calculate the covariance matrix of the whole data once and use the transformed data with euclidean distance?
or
2. each time we want to calculate the distance of a point from a given cluster, calculate the covariance matrix of that cluster and then compute the distance?
I hope I could convey my question. for I'm working on my project, which is a neuronal data, and I want to compare the result from k-means when euclidean distance is used with k-means when mahalanobis distance is used.
Thanks in advance.
The first option is simpler and assumes that the covaraince is equal for all clusters. The second option assumes that each cluster has it's own covariance. A third option is to consider the "popoled" covariance, which is an average of the covariances for each cluster. It all depends on how you want to model your data. These options are discussed in the documentation for PROC CANDISC and PROC DISCRIM. Look at the Iris example in PROC CANDISC and read about the POOL= option in PROC DISCRIM.
Nice...Thanks for such a nice tutorial
At the end, you take the squared distance to get rid of square roots. Since you had previously put the mahalanobis distance in the context of outlier detection, this reminded me of the least squares method, which seeks to minimize the sum of squared residuals. I've heard the "square" explained variously as a way to put special emphasize on large deviations in single points over small deviations in many, or explained as a way to get a favourable convex property of the minimization problem. Are any of these explanations correct and/or worth keeping in mind when working with the mahalanobis distance?
Yes. In the least squares context, the sum of the squared errors is actually the squared (Euclidean) distance between the observed response (y) and the predicted response (y_hat). In both contexts, we say that a distance is "large" if it is large in any one component (dimension).
Thanks. It is very useful to me. Can use Mahala. distance as z-score feed into probability function ChiSquareDensity to calculate probability?
Thanks.
Kind of. Two common uses for the Mahalanobis distance are
1) For MVN data, the square of the Mahalanobis distance is asymptotically distributed as a chi-square. See the article "Testing Data for Multivariate Normality" for details.
2) You can use Mahalanobis distance to detect multivariate outliers.
In both of these applications, you use the Mahalanobis distance in conjunction with the chi-square distribution function to draw conclusions.
Hi Rick. Sorry for two basic questions. I have read that Mahalanobis distance theoretically requires input data to be Gaussian distributed. Is it just because it possess the inverse of the covariance matrix? Or is there any other reason? Second, it is said this technique is scale-invariant (wikipedia) but my experience is that this might only be possible with Gaussian data and that since real data is generally not Gaussian distributed, scale-variance feature does not hold? Please comment. Thanks
I think these are great questions (and not basic). From a theoretical point of view, MD is just a way of measuring distances. You choose any covariance matrix, and then measure distance by using a weighted sum of squares formula that involves the inverse covariance matrix. (The Euclidean distance is unweighted sum of squares, where the covariance matrix is the identity matrix.) So the definition of MD doesn't even refer to data, Gaussian or otherwise.
If you read my article "Use the Cholesky transformation to uncorrelate variables," you can understand how the MD works.
What makes MD useful is that IF your data are MVN(mu, Sigma) and also you use Sigma in the MD formula, then the MD has the geometric property that it is equivalent to first transforming the data so that they are uncorrelated, and then measuring the Euclidean distance in the transformed space.
So to answer your questions: (1) the MD doesn't require anything of the input data. However, it is a natural way to measure the distance between correlated MVN data. (2) The scale invariance only applies when choosing the covariance matrix. If you change the scale of your variables, then the covariance matrix also changes. If you measure MD by using the new covariance matrix to measure the new (rescaled) data, you get the same answer as if you used the original covariance matrix to measure the original data.
Thanks a lot for your prompt response. Appreciate your posts
Sir, Im trying to develop a calibration model for near infrared analysis, and Im required to plug in a Mahalanobis distance that will be used for prediction of my model, however, im stuck as I dont know where to start, can you give a help on how can i use mahalanobis formula?
If you use SAS software, you can see my article on how to compute Mahalanobis distance in SAS. Other SAS procedures, such as PROC DISCRIM, also use MD. If you do not use SAS, I suggest you ask your research supervisor for more details about how to implement his suggestion.
7 Trackbacks
[...] recently blogged about Mahalanobis distance and what it means geometrically. I also previously showed how Mahalanobis distance can be used to compute outliers in multivariate [...]
[...] to compute the Mahalanobis distance between each point and the origin. The Mahalanobis distance is a standardized distance that takes into account correlations between the [...]
[...] distance has an approximate chi-squared distribution when the data are MVN. See the article "What is Mahalanobis distance?" for an explanation of Mahalanobis distance and its geometric interpretation. I will use a SAS/IML [...]
[...] argument to the EXP function involves the expression d2=(x-μ)TΣ-1(x-μ), where d is the Mahalanobis distance between the multidimensional point x and the mean vector μ. I have previous written a SAS/IML [...]
[...] What is Mahalanobis distance? [...]
[...] What is Mahalanobis distance?: I describe the geometry of Mahalanobis distance, which provides a way to measure distances that takes into account correlations in the data. [...]
[...] areas. There are many ways to define the distance between observations. I have previously written an article that explains Mahalanobis distance, which is used often in multivariate analysis, and I have showed how to compute the Mahalanobis [...]