# Identifying multivariate inliers and outliers

We’re nearing the end of this series of posts on fraud detection in clinical trials and some upcoming features of JMP Clinical 4.1 that help identify unusual observations. We’ve described how visit dates and measurements taken in the clinic can signify problems at the clinical site, and discussed how trial participants can appear multiple times within the same study. For the last two posts, we focus on some methods that use as much of a subject’s data as possible to see how he or she compares to other subjects. Today, we’ll use Mahalanobis distance to identify multivariate inliers and outliers.

Mahalanobis distance can be used to calculate the distance between two random vectors or from a vector to a particular point in multivariate space (typically the multivariate mean or centroid). It differs from Euclidian distance in that it accounts for the correlation between the variables, thus considering that points may not be distributed spherically around the centroid. Mahalanobis distance is straightforward to compute in SAS, and for our purposes we’ll be comparing vectors of subject data to the centroid.

In statistics, most of us are familiar with the term “outlier.” Merriam-Webster defines outlier as “a statistical observation that is markedly different in value from the others of the sample.” In a box plot, an outlier may be identified as a point that exceeds 1.5 times the interquartile range (distance between first and third quartiles) beyond the first or third quartiles.

On the other hand, an “inlier” is a value that lies close to the mean. In a univariate setting, a value close to the mean would not raise any eyebrows. However, as Evans (2001) points out, it would be unlikely for an observation to lie near the mean for a large number of variables. So while outliers may be problematic, inliers may be more likely to represent observations that are “too good to be true” or “too good to be real.” Of course, a Mahalanobis distance of zero would represent a subject that lies on the mean for every variable.  (How better to escape detection than to create individual values that don’t stand out?)

JMP Clinical’s upcoming Multivariate Inliers and Outliers analytical process (AP) creates a data set comprising one row per subject from a set of CDISC formatted data sets. Variables for lab tests, vital signs, symptoms and other Findings data will be generated by visit number and time point. Variables representing frequencies of adverse events, medications and medical history terms will be computed. Based on dialog options, variables exceeding a certain percentage of missing data (default 5%) will be excluded from the analysis, since subjects with any missing data for the variables of interest cannot have Mahalanobis distance computed. This is an important point. A balanced approach must be taken between using as much data as possible versus the potential to exclude too many subjects in your review. Subjects who discontinue early would be expected to have numerous missing values, so it is likely to expect to lose some subjects when considering the majority of the data. However, options allow excluding all variables containing missing values.

The resulting data set computed above will then be used to compute Mahalanobis distance. Figure 1 shows a box plot of distance measures for each subject compared to the centroid of the multivariate distribution. The distance measure is computed from 268 of 744 possible variables. The remaining variables had missing data rates exceeding 5% and were removed from analysis. Fifty-two subjects did not have distance measures computed due to 15 variables with some missing data. This figure can identify outliers (large distances) or inliers (small distances). The dotted red reference line is derived from a chi-square distribution with k (number of variables in the analysis) degrees of freedom. The square of Mahalanobis distance follows this distribution.

Figure 1. Box plot of Mahalanobis distance for all sites

Figure 2 presents Mahalanobis distance by study site to identify if any particular site is extreme. This may uncover possible data errors among the analyzed variables, but it may also describe key differences in study population across the sites. The length of the box plots indicates how variable subjects are within a given site around the multivariate mean. Large variability can reflect a diverse study population or a site that may benefit from additional training. Low variability may reflect a particularly homogeneous population. In any event, any site that “stands out” may require greater scrutiny.

Figure 2. Box plots of Mahalanobis distance by site

Evans, S. (2001). Statistical aspects of the detection of fraud.  In: Lock, S, Wells, F & Farthing, M, eds. Fraud and Misconduct: in Biomedical Research, Third Edition. BMJ Books.

### One Comment

1. Mike Clayton
Posted December 10, 2012 at 12:48 pm | Permalink

MV outliers and inliers are often detected in semiconductor wafer test data, and these devices can be found to be "risky" in subsequent stress testing. They are all from same wafer, so saw same processes and tools, but tiny defects make them "maverick circuits" and they can be contained.
Luckily thousands are made on each wafer, and x-y coordinates are known, and that data is tracked through all manufacturing steps, SPC and Test steps, until die are singulated and packaged. But the DieID (lot and wafer and x-y location) are marked on each part in 2D barcode or other methods for customer return issues. So MV analysis is heavily used to contain statistical mavericks which often turn out to be physical mavericks as well. Thanks for his nice tutorial to add to our collection.

1. [...] on our discussion from last time, we would like to use as much of a subject’s data as possible to assess the [...]