Detecting outliers in SAS: Part 3: Multivariate location and scatter

In two previous blog posts I worked through examples in the survey article, "Robust statistics for outlier detection," by Peter Rousseeuw and Mia Hubert. Robust estimates of location in a univariate setting are well-known, with the median statistic being the classical example. Robust estimates of scale are less well-known, with the best known example being interquartile range (IQR), but a more modern statistic being the MAD function.

For multivariate data, the classical (nonrobust) estimate of location is the vector mean, c, which is simply the vector whose ith component is the mean of the ith variable. The classical (nonrobust) estimate of scatter is the covariance matrix. An outlier is defined as an observation whose Mahalanobis distance from c is greater than some cutoff value. As in the univariate case, both classical estimators are sensitive to outliers in the data. Consequently, statisticians have created robust estimates of the center and the scatter (covariance) matrix.

MCD: Robust estimation by subsampling

A popular algorithm that computes a robust center and scatter of multivariate data is known as the Minimum Covariance Determinant (MCD) algorithm. The main idea is due to Rousseeuw (1985), but the algorithm that is commonly used was developed by Rousseeuw and Van Driessen (1999). The MCD algorithm works by sampling h observations from the data over and over again, where h is typically in the range n/2 < h < 3n/4. The "winning" subset is the h points whose covariance matrix has the smallest determinant. Points far from the center of this subset are excluded, and the center and scatter of the remaining points are used as the robust estimates of location and scatter.

This Monte Carlo approach works very well in practice, but it does have the unfortunate property that it is not deterministic: a different random number seed could result in different robust estimates. Recently, Hubert, Rousseeuw, and Verdonck (2010) have published a deterministic algorithm for the MCD.

Robust MCD estimates in SAS/IML software

The SAS/IML language includes the MCD function for robust estimation of multivariate location and scatter. The following matrix defines a data matrix from Brownlee (1965) that correspond to certain measurements taken on 21 consecutive days. The points are shown in a three-dimensional scatter plot that was created in SAS/IML Studio.

proc iml;
x = { 80  27  89,  80  27  88,  75  25  90, 
      62  24  87,  62  22  87,  62  23  87, 
      62  24  93,  62  24  93,  58  23  87, 
      58  18  80,  58  18  89,  58  17  88, 
      58  18  82,  58  19  93,  50  18  89, 
      50  18  86,  50  19  72,  50  19  79, 
      50  20  80,  56  20  82,  70  20  91 };
 
/* classical estimates */
labl = {"x1" "x2" "x3"};
mean = mean(x);
cov = cov(x);
print mean[c=labl format=5.2], cov[r=labl c=labl format=5.2];

Most researchers think that observations 1, 2, 3, and 21 are outliers, with others including observation 2 as an outlier. (These points are shown as red crosses in the scatter plot.) The following statement runs the MCD algorithm on these data and prints the robust estimates:

/* robust estimates */
N = nrow(x);   /* 21 observations */
p = ncol(x);   /*  3 variables */
 
optn = j(8,1,.); /* default options for MCD */
optn[1] = 0;     /* =1 if you want printed output */
optn[4]= floor(0.75*N); /* h = 75% of obs */
 
call MCD(sc, est, dist, optn, x);
RobustLoc = est[1, ];     /* robust location */
RobustCov = est[3:2+p, ]; /* robust scatter matrix */
print RobustLoc[c=labl format=5.2], RobustCov[r=labl c=labl format=5.2];

The robust estimate of the center of the data is not too different from the classical estimate, but the robust scatter matrix is VERY different from the classical covariance matrix. Each robust estimate excludes points that are identified as outliers.

If you take these robust estimates and plug them into the classical Mahalanobis distance formula, the corresponding distance is known as the robust distance. It measures the distance between each observation and the estimate of the robust center by using a metric that depends on the robust scatter matrix. The MCD subroutine returns distance information in a matrix that I've called DIST (the third argument). The first row of DIST is the classical Mahalanobis distance. The second row is the robust distance, which is based on the robust estimates of location and scatter. The third row is an indicator variable with the value 1 if an observation is closer to the robust center than some cutoff value, and 0 otherwise. Consequently, the following statements find the outliers.
/* rules to detect outliers */
outIdx = loc(dist[3,]=0); /* RD > cutoff */
print outIdx;

The MCD algorithm has determined that observations 1, 2, 3, and 21 are outliers.

Incidentally, the cutoff value used by MCD is based on a quantile of the chi-square distribution because the squared Mahalanobis distance of multivariate normal data obeys a chi-square distribution with p degress of freedom, where p is the number of variables. The cutoff used is as follows:
cutoff = sqrt( quantile("chisquare", 0.975, p) ); /* dist^2 ~ chi-square */

In a recent paper, Hardin and Rocke (2005) propose a different criterion, based on the distribution of robust distances.

Robust MCD estimates in SAS/STAT software: How to "trick" PROC ROBUSTREG

The ROBUSTREG procedure can also compute MCD estimates. Usually, the ROBUSTREG procedure is used as a regression procedure, but you can also use it to obtain the MCD estimates by "inventing" a response variable. The MCD estimates are produced for the explanatory variables, so the choice of a response variable is unimportant. In the following example, I generate random values for the response variable.

In a regression context, the word "outlier" is reserved for an observation for which the value of the response variable is far from the predicted value. In other words, in regression an outlier means "far away (from the model) in the Y direction." In contrast, the ROBUSTREG procedure uses the MCD algorithm to identify influential observations in the space of explanatory (that is, X) variables. These are also called high-leverage points. They are observations that are far from the center of the X variables. High-leverage points are very influential in ordinary least squares regression, and that is why it is important to identify them.

To generate the MCD estimates, specify the DIAGNOSTICS and the LEVERAGE(MCDINFO) options on the MODEL statement, as shown in the following statements:

/* write data from SAS/IML (or use a DATA step) */
create X from x[c=labl]; append from x; close;
quit;
 
data X;
set X;
y=rannor(1); /* random response variable */
run;
 
proc robustreg data=X method=lts;
   model y = x1 x2 x3 / diagnostics leverage(MCDInfo);
   ods select MCDCenter MCDCov Diagnostics;
   ods output diagnostics=Diagnostics(where=(leverage=1));
run;
 
proc print data=Diagnostics; 
var Obs Mahalanobis RobustDist Leverage;
run;

The robust estimates of location and scatter are the same, as are the robust distances. The "leverage" variable is an indicator variable that tells you which observations are far from the center of the explanatory variables. They are multivariate "outliers" in the the space of the X variables, although they are not necessarily outliers for the response (Y) variable.

tags: Data Analysis, Statistical Programming

9 Comments

  1. Posted February 2, 2012 at 2:57 pm | Permalink

    Good stuff!

    When there are a bunch of dimensions, every data point is an outlier .... it might make a good post to talk about the "curse of dimensionality".

  2. Jim
    Posted August 6, 2013 at 9:48 am | Permalink

    Mentioning "curse of dimensionality": how can I simulate the distance convergence of (dist_max-dist_min)/dist_min -> 0 like mentioned in http://en.wikipedia.org/wiki/Curse_of_dimensionality?

    • Posted August 6, 2013 at 10:08 pm | Permalink

      Sorry, but that paragraph is poorly written. Distances between what? And what is being held constant as n-->infinity? I don't understand what they are trying to say.

  3. Michael Kinnear
    Posted August 17, 2014 at 6:14 pm | Permalink

    This data analysis is incorrect. The data results from Rousseeuw's Minvol (Minimum Volume Ellipsoid) is presented , not the Rousseeuw MCD data results.

    • Posted August 18, 2014 at 9:00 am | Permalink

      I don't understand your comment. This article presents the results of an MCD analysis of the Brownlee stackloss data. Can you explain why you say it is incorrect?

      Yes, you can also analyze these data by using the MVE method. The SAS/IML chapter about robust statistics provides a comparison of the MVE and MCD methods applied to these data. Researchers have used these data to compare many different analyses.

  4. Michael Kinnear
    Posted August 18, 2014 at 9:15 pm | Permalink

    I used the FORTRAN (source) programs downloaded directly from the Rousseeuw website. The source was compiled (Intel; Lahey/Fujitsu).
    Results from Minvol (MVE) Results from fastMCD:
    1 2.254 5.528 1 2.2536 12.17328
    2 2.325 5.637 2 2.32474 12.25568
    3 1.594 4.197 3 1.59371 9.26399
    21 2.177 3.657 21 2.17676 7.62277
    Since this dataset is small, the fastMCD algorithm will calculate the exact MCD.
    Note that the corresponding scatter plots look substantially different from each other.

    • Posted August 19, 2014 at 8:52 am | Permalink

      Yes, I get those exact same numbers for the MD and RD of those observations when I use Rousseeuw's default values for the MVE and MCD routines.

  5. Michael Kinnear
    Posted August 19, 2014 at 8:35 pm | Permalink

    I guess that I need to clarify my previous statements.Your last three' tables' are labeled MCD Center, MCD Covariance, and the third has no label. A novice reader may erroneously conclude that the last 'table' contains the results of the MCD analysis. It does not, but rather contains the results of a MVE analysis. For the sake of clarity, you should probably either replace the data with the actual MCD numbers or label it as the results of a MVE analysis. Hope this clarifies what I was trying to say earlier; I didn't mean to cause any confusion.
    Your kind cooperation in this matter is most gratefully appreciated.

    • Posted August 20, 2014 at 10:47 am | Permalink

      All of the tables are from MCD. The column labeled "RobustDist" is the robust MCD distance. I suspect that the source of the discrepancy is that I am not using the default value of h. If you have further questions, post them to the SAS Statistical Community.

4 Trackbacks

  1. By What is Mahalanobis distance? - The DO Loop on February 15, 2012 at 5:43 am

    [...] previously described how to use Mahalanobis distance to find outliers in multivariate data. This article takes a closer look at Mahalanobis distance. A subsequent article will describe how [...]

  2. [...] my post on detecting outliers in multivariate data in SAS by using the MCD method, Peter Flom commented "when there are a bunch of dimensions, every data [...]

  3. [...] blogged about Mahalanobis distance and what it means geometrically. I also previously showed how Mahalanobis distance can be used to compute outliers in multivariate data. But how do you compute Mahalanobis distance in [...]

  4. [...] Detecting outliers in SAS: Multivariate location and scatter, which desribes how to use SAS software to find multivariate outliers [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <p> <pre lang="" line="" escaped=""> <q cite=""> <strike> <strong>