How to compute Mahalanobis distance in SAS

I recently blogged about Mahalanobis distance and what it means geometrically. I also previously showed how Mahalanobis distance can be used to compute outliers in multivariate data. But how do you compute Mahalanobis distance in SAS?

Computing Mahalanobis distance with built-in SAS procedures and functions

There are several ways to compute the Mahalanobis distances between observations and the sample mean. One way is to compute the leverage statistic by using a regression procedure, and then using a mathematical relationship between the leverage and the Mahalanobis distance. To find the Mahalanobis distance between pairs of points, you can use principal component analysis and the DISTANCE procedure. And, as mentioned in a previous blog, the MCD routine in SAS/IML software provides the classical Mahalanobis distance for a data matrix.

Writing a Mahalanobis distance function

The previous methods all have a disadvantage: they provide the Mahalanobis distance as a consequence of computing something else (regression, principal components, or MCD). However, if your goal is to compute the Mahalanobis distance, it is more efficient to call a function that is designed for that purpose. You can use the SAS/IML language to write such a function.

Rather than just present a Mahalanobis distance function in its final form, I'm going to describe three functions: the naive "first attempt" function, a more efficient alternative, and a final function that is both efficient and can be used in a variety of situations.

Update: As of SAS/IML 12.1, the MAHALANOBIS function is distributed as part of SAS/IML software. It is no longer necessary to implement the function yourself. You can skip the next two sections of this article unless you are interested in the mathematical ideas.

A first attempt: Implement the mathematical formula

One of the advantages of the SAS/IML language is that it enables you to use a natural syntax to implement mathematical formulas that appear in textbooks and journal articles. Given an observation x from a multivariate normal distribution with mean μ and covariance matrix Σ, the squared Mahalanobis distance from x to μ is given by the formula d2 = (x - μ)T Σ -1 (x - μ). You can use this definition to define a function that returns the Mahalanobis distance for a row vector x, given a center vector (usually μ or an estimate of μ) and a covariance matrix:

proc iml;
/* simplest approach: x and center are row vectors */
start mahal1(x, center, cov);
   y = (x - center)`; /* col vector */
   d2 = y` * inv(cov) * y; /* explicit inverse. Not optimal */
   return (sqrt(d2));   
finish;
 
x = {1 0};  /* test it */
center = {1 1};
cov = {9 1, 
       1 1};
md1 = mahal1(x, center, cov);
print md1;

This naive implementation computes the Mahalanobis distance, but it suffers from the following problems:

  1. The function uses the SAS/IML INV function to compute an explicit inverse matrix. However, it is rarely necessary to compute an explicit matrix inverse. (See also the comments to John D. Cook's article "Don’t invert that matrix.") Explicit computations are less stable numerically and are less efficient than matrix factorization methods.
  2. The function as written takes a single observation for x and returns a single distance. The function should be generalized so that you can pass in a matrix for x, where each row of the matrix is an observation. The function should return a vector of distances, one for each row of x.

A second attempt: More efficient and more general

Instead of computing the explicit inverse for the given covariance matrix, you can use the Cholesky decomposition (found by the ROOT function in SAS/IML software) followed by solving a linear system by using the TRISOLV function, as described in my article on the Cholesky transformation. You can compute the distance for multiple rows of x by using a DO loop. A second version of the Mahalanobis function follows:

/* (1) Use ROOT and TRISOLV instead of INV to solve linear system
   (2) Let x be a matrix. Compute distances from row x[i,] to center. */
start mahal2(x, center, cov);
   /* x[i,] and center are row vectors */
   n = nrow(x);
   d2 = j(n,1);               /* allocate vector for distances */
   U = root(cov);      
   /* Want   d2 = v * inv(cov) * v` = v * w  where cov*w = v`
   ==> (U`U)*w = v`            Cholesky decomp of cov
   ==>  First solve U` z = v`  for z,
         then solve  U w = z   for w */
   y = x - center;
   do i = 1 to n;            /* using the DO loop is not optimal */
      v = y[i,];             /* get i_th centered row */
      z = trisolv(2, U, v`); /* solve linear system */
      w = trisolv(1, U, z);
      d2[i,] = v * w;        /* dot product of vectors */
   end;
   return (sqrt(d2));
finish;
 
x = {1 0,  /* test it: x has more than one row */
     0 1,
    -1 0,
     0 -1};
md2 = mahal2(x, center, cov);
print md2;

This is not a bad algorithm, but it can be improved. Currently it computes distances from a bunch of points (the rows of x) to a single point (passed in as center). However, if you want to compute pairwise distances between two sets of points, you would need to call this function multiple times, and each function call would perform the same Cholesky decomposition. An improvement is to allow the center argument to also be a matrix, and have the function return a matrix of distances, where the (i,j)th distance is the Mahalanobis distance between the ith row of x and the jth row of center.

There are two tricks that you can use to make the new implementation of the Mahalanobis distance function as efficient as possible:

  1. The TRISOLV function enables you to solve multiple linear systems in a single call by passing in multiple "right-hand sides." Consequently, you don't actually need the DO loop in the previous section to loop over the rows of x.
  2. It turns out that the squared distance can be written in terms of the diagonal elements of a certain matrix product. In a previous blog, I showed an efficient shortcut for computing the diagonal elements of a matrix product without actually forming the product.
The following SAS/IML module is a final version of a function that computes the pairwise Mahalanobis distance between points. If the second argument is a row vector (such as the mean vector), this algorithm computes the Mahalanobis distance between observations and a single point:

/* 3) Enable pairwise Mahalanobis distances by allowing 2nd arg to be a matrix.
   4) Solve linear system for many right-hand sides at once */
start mahalanobis(x, center, cov);
   /* x[i,] and center[i,] are row vectors */
   n = nrow(x);      
   m = nrow(center);
   d2 = j(n,m);     /* return matrix of pairwise Mahalanobis distances */
   U = root(cov);   /* solve for Cholesky root once */
   do k = 1 to m;   /* for each row of the center matrix */
      v = x - center[k,];    /* v is matrix of right-hand sides */
      z = trisolv(2, U, v`); /* solve n linear systems in one call */
      w = trisolv(1, U, z);
      /* Trick: vecdiag(A*B) = (A#B`)[,+] without computing product */
      d2[,k] = (v # w`)[,+]; /* shortcut: diagonal of matrix product */
   end;
   return (sqrt(d2));
finish;
 
y = {1 1, 0 0, 1 2}; /* x and y are BOTH matrices */
md = mahalanobis(x, y, cov); /* compute pairwise distances between x and y */
print md;

In summary, you can write a function in the SAS/IML language to compute the Mahalanobis distance in SAS. This single function enables you to compute the Mahalanobis distance for three common situations:

  • from each observation to the mean: md = mahalanobis(x, mean(x), cov(x))
  • from each observation to a specific observation: md = mahalanobis(x, MyObs, cov(x))
  • from each observation to all other observations (all possible pairs): p=x; md = mahalanobis(x, p, cov(x))

tags: Efficiency, Matrix Computations, Statistical Programming

13 Comments

  1. K
    Posted March 9, 2012 at 11:55 am | Permalink

    For the mahalnobis example, would it be possible to provide an example of what the dataset should look like, please? Thank you

    • Posted March 9, 2012 at 12:58 pm | Permalink

      To use this code with a SAS data set, the p variables should be in p columns of the data set.
      For example:

      data mydata;
      input x1-x10;
      datalines;
      ...data here...
      ;

      In PROC IML you can then say:

      use mydata;
      read all var _NUM_ into X; /* or specify names of vars */
      center = mean(X);
      cov = cov(X);
      md = mahalanobis(x, center, cov);

      • Kevin
        Posted February 18, 2014 at 4:52 pm | Permalink

        I have 8 variables: x1-x8

        I want to compute the Mahalanobis Distance from a reference point of 7 for each of the variables.

        How do I modify the above code to perform this?

        • Posted February 18, 2014 at 4:54 pm | Permalink

          You can ask SAS-related questions at the SAS Support Communities. See the links in the right-hand sidebar.

  2. Ju
    Posted October 15, 2012 at 1:34 pm | Permalink

    Hey, I want to know if sumone can help me on how to calculate the covariance between two matrices X and Y in proc iml sas. For example
    X = {1 2, 3 4}
    Y = {1 2 , 5 6}
    cov(X,Y)

    Thank you

    • Posted October 16, 2012 at 4:38 pm | Permalink

      You need to explain what you mean. What is cov(X,Y) for your example? Are you asking for the matrix of pairwise covariances for each distinct pair of variables X_i and Y_j (which results in a 2x2 matrix) or some kind of 4x4 block covariance matrix?

      • Ju
        Posted October 20, 2012 at 3:27 am | Permalink

        Good day, I have actually obtain the correct answer after looking through several code you have posted around the web. Thanks I lot.
        /*Calculate the Covariance*/
        start covar(x,y);
        n_sim = max(Ncol(x),Nrow(x));
        ans = (sum(x#y) - (sum(x)*sum(y))/n_sim)/n_sim;
        return(ans);
        finish;

        • Posted October 21, 2012 at 6:03 am | Permalink

          This isn't any covariance that I am familiar with. It returns a scalar, rather than a matrix. Furthermore,
          if X=Y=Identity(n) [the nxn identity matrix] the "Ju Covariance" of an identity matrix with itself is 0. I suggest that you ask help from a local statistician at your workplace or school.

  3. Maz
    Posted September 4, 2013 at 4:45 am | Permalink

    Hi!

    I really need your help. Say I have multiple-group data (group=4, number of variables=4, where each group has 5 observation). I am using MD formula [D^2=(x1bar-x2bar)'S(inverse)(x1bar-x2bar)] for calculation. My intention was; I wanted to rank the variables based on the D^2 (as above formula). I have calculated the D^2, and the table of D^2 that I got describes the distance for groups pairs. The question is how can I find the D^2 for each variable so that I can rank the variables? Please help me:) Thank you very much.

  4. Mark Lundin
    Posted March 25, 2014 at 3:07 pm | Permalink

    Dr. Wicklin. Thank you for your work, which I have found to be an exceptional resource.
    A practical question: Would you care to comment on the possibility of computing the MD or MD^2 in the presence of a covariance matrix (and also invereted covariance matrix) which is not positive definite? Some MD^2's could be negative as a result. Assuming the original covariance matrix did not become non positive definite because of sampling problems (id est, those negative values in the covariance matrix have true meaning), is it reasonable to attempt an interpretation of a negative MD^2, or the magnitude of negativity? Should just those negative MD^2 points be discounted, or are ALL results coming from a non positive definite covaraince matrix invalid to begin with?

    • Posted March 26, 2014 at 6:25 pm | Permalink

      I think it depends on your application. Mathematically, a non-PD matrix does not generate a metric and the resulting formula does not satisfy the axioms of a distance. At the same time, in factor analysis, you often estimate a matrix that might not be PD so this case does arise in practice. In that case the estimate of the "Total variance" can be greater than 1. See the doc for PROC FACTOR or this page: http://support.sas.com/documentation/cdl/en/imlsug/63546/HTML/default/viewer.htm#imlsug_ugmultfactor_sect001.htm
      I am not an expert on factor analysis, but it seems to me that the problem is that the ESTIMATES are not PD and so statisticians try to jump through loops to make sense of something that is (I think) inherently nonsensical.

      • Mark Lundin
        Posted March 26, 2014 at 7:17 pm | Permalink

        Thank you for taking the time to think aobut my question. You confirm my own instinct, it depends on the application.
        The subject comes up in finance, for example with the application of mean-variance portfolio theory in constructing an "efficient frontier". See, for example:
        http://epublications.bond.edu.au/cgi/viewcontent.cgi?article=1078&context=ejsie
        The same portfolio covariance matrix in that application, containing some negative price-change (return) covariance (which does occur in practice, and often results in a non PD matrix) leads to my question with respect to the MD^2.

4 Trackbacks

  1. [...] Uses the Mahalanobis module to compute the Mahalanobis distance between each point and the origin. The Mahalanobis distance is a standardized distance that takes into account correlations between the variables. [...]

  2. [...] distance between the multidimensional point x and the mean vector μ. I have previous written a SAS/IML function that computes the Mahalanobis distance in SAS, so it is simple to construct a SAS/IML function that computes the multivariate normal PDF for any [...]

  3. [...] concepts, including bivariate normal cumulative distribution, inverse hyperbolic functions and Mahalanobis distance ... to name a [...]

  4. [...] that explains Mahalanobis distance, which is used often in multivariate analysis, and I have showed how to compute the Mahalanobis distance in SAS. Today's article is simpler: how do you compute the usual Euclidean distance in SAS? Recall that [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <p> <pre lang="" line="" escaped=""> <q cite=""> <strike> <strong>