How to compute Mahalanobis distance in SAS

I recently blogged about Mahalanobis distance and what it means geometrically. I also previously showed how Mahalanobis distance can be used to compute outliers in multivariate data. But how do you compute Mahalanobis distance in SAS?

Computing Mahalanobis distance with built-in SAS procedures and functions

There are several ways to compute the Mahalanobis distances between observations and the sample mean. One way is to compute the leverage statistic by using a regression procedure, and then using a mathematical relationship between the leverage and the Mahalanobis distance. To find the Mahalanobis distance between pairs of points, you can use principal component analysis and the DISTANCE procedure. And, as mentioned in a previous blog, the MCD routine in SAS/IML software provides the classical Mahalanobis distance for a data matrix.

Writing a Mahalanobis distance function

The previous methods all have a disadvantage: they provide the Mahalanobis distance as a consequence of computing something else (regression, principal components, or MCD). However, if your goal is to compute the Mahalanobis distance, it is more efficient to call a function that is designed for that purpose. You can use the SAS/IML language to write such a function.

Rather than just present a Mahalanobis distance function in its final form, I'm going to describe three functions: the naive "first attempt" function, a more efficient alternative, and a final function that is both efficient and can be used in a variety of situations.

A first attempt: Implement the mathematical formula

One of the advantages of the SAS/IML language is that it enables you to use a natural syntax to implement mathematical formulas that appear in textbooks and journal articles. Given an observation x from a multivariate normal distribution with mean μ and covariance matrix Σ, the squared Mahalanobis distance from x to μ is given by the formula d2 = (x - μ)T Σ -1 (x - μ). You can use this definition to define a function that returns the Mahalanobis distance for a row vector x, given a center vector (usually μ or an estimate of μ) and a covariance matrix:

proc iml;
/* simplest approach: x and center are row vectors */
start mahal1(x, center, cov);
   y = (x - center)`; /* col vector */
   d2 = y` * inv(cov) * y; /* explicit inverse. Not optimal */
   return (sqrt(d2));   
finish;
 
x = {1 0};  /* test it */
center = {1 1};
cov = {9 1, 
       1 1};
md1 = mahal1(x, center, cov);
print md1;

This naive implementation computes the Mahalanobis distance, but it suffers from the following problems:

  1. The function uses the SAS/IML INV function to compute an explicit inverse matrix. However, it is rarely necessary to compute an explicit matrix inverse. (See also the comments to John D. Cook's article "Don’t invert that matrix.") Explicit computations are less stable numerically and are less efficient than matrix factorization methods.
  2. The function as written takes a single observation for x and returns a single distance. The function should be generalized so that you can pass in a matrix for x, where each row of the matrix is an observation. The function should return a vector of distances, one for each row of x.

A second attempt: More efficient and more general

Instead of computing the explicit inverse for the given covariance matrix, you can use the Cholesky decomposition (found by the ROOT function in SAS/IML software) followed by solving a linear system by using the TRISOLV function, as described in my article on the Cholesky transformation. You can compute the distance for multiple rows of x by using a DO loop. A second version of the Mahalanobis function follows:

/* (1) Use ROOT and TRISOLV instead of INV to solve linear system
   (2) Let x be a matrix. Compute distances from row x[i,] to center. */
start mahal2(x, center, cov);
   /* x[i,] and center are row vectors */
   n = nrow(x);
   d2 = j(n,1);               /* allocate vector for distances */
   U = root(cov);      
   /* Want   d2 = v * inv(cov) * v` = v * w  where cov*w = v`
   ==> (U`U)*w = v`            Cholesky decomp of cov
   ==>  First solve U` z = v`  for z,
         then solve  U w = z   for w */
   y = x - center;
   do i = 1 to n;            /* using the DO loop is not optimal */
      v = y[i,];             /* get i_th centered row */
      z = trisolv(2, U, v`); /* solve linear system */
      w = trisolv(1, U, z);
      d2[i,] = v * w;        /* dot product of vectors */
   end;
   return (sqrt(d2));
finish;
 
x = {1 0,  /* test it: x has more than one row */
     0 1,
    -1 0,
     0 -1};
md2 = mahal2(x, center, cov);
print md2;

This is not a bad algorithm, but it can be improved. Currently it computes distances from a bunch of points (the rows of x) to a single point (passed in as center). However, if you want to compute pairwise distances between two sets of points, you would need to call this function multiple times, and each function call would perform the same Cholesky decomposition. An improvement is to allow the center argument to also be a matrix, and have the function return a matrix of distances, where the (i,j)th distance is the Mahalanobis distance between the ith row of x and the jth row of center.

There are two tricks that you can use to make the new implementation of the Mahalanobis distance function as efficient as possible:

  1. The TRISOLV function enables you to solve multiple linear systems in a single call by passing in multiple "right-hand sides." Consequently, you don't actually need the DO loop in the previous section to loop over the rows of x.
  2. It turns out that the squared distance can be written in terms of the diagonal elements of a certain matrix product. In a previous blog, I showed an efficient shortcut for computing the diagonal elements of a matrix product without actually forming the product.
The following SAS/IML module is a final version of a function that computes the pairwise Mahalanobis distance between points. If the second argument is a row vector (such as the mean vector), this algorithm computes the Mahalanobis distance between observations and a single point:

/* 3) Enable pairwise Mahalanobis distances by allowing 2nd arg to be a matrix.
   4) Solve linear system for many right-hand sides at once */
start mahalanobis(x, center, cov);
   /* x[i,] and center[i,] are row vectors */
   n = nrow(x);      
   m = nrow(center);
   d2 = j(n,m);     /* return matrix of pairwise Mahalanobis distances */
   U = root(cov);   /* solve for Cholesky root once */
   do k = 1 to m;   /* for each row of the center matrix */
      v = x - center[k,];    /* v is matrix of right-hand sides */
      z = trisolv(2, U, v`); /* solve n linear systems in one call */
      w = trisolv(1, U, z);
      /* Trick: vecdiag(A*B) = (A#B`)[,+] without computing product */
      d2[,k] = (v # w`)[,+]; /* shortcut: diagonal of matrix product */
   end;
   return (sqrt(d2));
finish;
 
y = {1 1, 0 0, 1 2}; /* x and y are BOTH matrices */
md = mahalanobis(x, y, cov); /* compute pairwise distances between x and y */
print md;

In summary, you can write a function in the SAS/IML language to compute the Mahalanobis distance in SAS. This single function enables you to compute the Mahalanobis distance for three common situations:

  • from each observation to the mean: md = mahalanobis(x, mean(x), cov(x))
  • from each observation to a specific observation: md = mahalanobis(x, MyObs, cov(x))
  • from each observation to all other observations (all possible pairs): p=x; md = mahalanobis(x, p, cov(x))

Post a Comment

The DATA step and the implied AND operator

The SAS DATA step supports a special syntax for determining whether a value is contained in an interval:

y = (-2 < x < 2);
This expression creates an indicator variable with the value 1 if x is in the interval (-2,2) and 0 otherwise.

The documentation for the AND operator states that "two comparisons with a common variable linked by [the AND operator] can be condensed" into a single statement with an implied AND operator. For example, the following two statements are equivalent:

y = (-2 < x < 2);
y = (-2<x & x<2);
The second syntax is more familiar to programmers in languages such as C/C++ that do not support an "implied AND" comparison operator.

The reason that I mention this syntax is that it is NOT available in the SAS/IML language (nor in R, nor in MATLAB). Sometimes experienced DATA step programmers expect the implied AND operator to work in their SAS/IML programs. The syntax does parse and execute, but it gives a different value than in the DATA step! For example, consider the following SAS/IML program:

proc iml;
x = -3:3;   /* the vector {-3 -2 -1 0 1 2 3} */
y = (-2 < x < 2);
There are no errors when you run this program. The expression for y is parsed as
y = ((-2 < x) < 2);
Regardless of the values in x, the variable y is a vector of ones. Why? The previous statement is equivalent to the following two statements:
v = (-2 < x);  /* {0 0 1 1 1 1 1} */
y = (v < 2);   /* {1 1 1 1 1 1 1} */
In the first statement, v is an indicator variable: v[i]=1 when the expression is true, and 0 otherwise. In the second statement, the zeros and ones in v are compared with the value 2. Not surprisingly, all of the zeros and ones are less than 2, so y is a vector of all ones.

Conclusion: Don't use the DATA step syntax in your SAS/IML programs. Instead, use an explicit AND operator such as (-2<x & x<2).

I am told that Python and Perl 6 (but not Perl 5) also support this implied AND operator. The SQL procedure in SAS software supports it, although other implementations of SQL do not. Do you know of other languages that also support a compact syntax for testing whether a value is within an interval?

Post a Comment

Convert a covariance matrix to a correlation matrix in SAS

I have previously blogged about how to convert a covariance matrix into a correlation matrix in SAS (and the other way around). However, I still get questions about it, perhaps because my previous post demonstrated more than one way to accomplish each transformation.

To eliminate all confusion, the following SAS/IML program implements a function that converts a covariance matrix into the associated correlation matrix. For an explanation of why this works, see my previous article.

proc iml;
/* convert a covariance matrix, S, to a correlation matrix */
start Cov2Corr(S);
   D = sqrt(vecdiag(S));
   return( S / D` / D ); /* divide columns, then divide rows */
finish;
 
S = {1.0  1.0  8.1,  /* test it */
     1.0 16.0 18.0,
     8.1 18.0 81.0 };
Corr = Cov2Corr(S);
print Corr;

There are many covariance matrices that produce the same correlation matrix, therefore the inverse operation (which produces a covariance matrix from a correlation matrix) is not uniquely defined. However, if you specify the standard deviations for each variable, you can write a SAS/IML function that converts a correlation matrix to a covariance matrix:

/* R = correlation matrix
   sd = (vector of) standard deviations for each variable 
   Return covariance matrix with sd##2 on the diagonal */
start Corr2Cov(R, sd);
   std = colvec(sd);
   return( std` # R #std );
finish;
 
/* convert correlation matrix to covariance matrix */
sd = {1 4 9}; /* std devs = sqrt(vecdiag(S)) */
Cov = Corr2Cov(Corr, sd);
print Cov;

Converting a covariance matrix to a correlation matrix with SAS/STAT software

You can, of course, use the DATA step to convert a covariance matrix to a correlation matrix (and the other way around), but here's a neat trick that you can do in SAS/STAT software: the FACTOR procedure can read a covariance matrix (technically, a TYPE=COV data set) and write the corresponding correlation matrix to a SAS data set, as shown in the following example.

/* define covariance matrix as TYPE=COV data set (lower triangular) */
data cov(type=cov);
_type_='COV';
input x1-x3;
cards;
1    .   .
1   16   .
8.1 18  81
;
run;
/* convert covariance matrix to correlation matrix */
proc factor data=cov outstat=outstat noprint; 
run;
proc print data=outstat noobs; 
where _TYPE_="CORR";
run;

Does anyone know of a slick trick that converts a correlation matrix (and standard deviations) into a covariance matrix without using SAS/IML software? Post your solution in the comments.

Post a Comment

What is Mahalanobis distance?

I previously described how to use Mahalanobis distance to find outliers in multivariate data. This article takes a closer look at Mahalanobis distance. A subsequent article will describe how you can compute Mahalanobis distance.

Distance in standard units

In statistics, we sometimes measure "nearness" or "farness" in terms of the scale of the data. Often "scale" means "standard deviation." For univariate data, we say that an observation that is one standard deviation from the mean is closer to the mean than an observation that is three standard deviations away. (You can also specify the distance between two observations by specifying how many standard deviations apart they are.)

For many distributions, such as the normal distribution, this choice of scale also makes a statement about probability. Specifically, it is more likely to observe an observation that is about one standard deviation from the mean than it is to observe one that is several standard deviations away. Why? Because the probability density function is higher near the mean and nearly zero as you move many standard deviations away.

For normally distributed data, you can specify the distance from the mean by computing the so-called z-score. For a value x, the z-score of x is the quantity z = (x-μ)/σ, where μ is the population mean and σ is the population standard deviation. This is a dimensionless quantity that you can interpret as the number of standard deviations that x is from the mean.

Distance is not always what it seems

You can generalize these ideas to the multivariate normal distribution. The following graph shows simulated bivariate normal data that is overlaid with prediction ellipses. The ellipses in the graph are the 10% (innermost), 20%, ..., and 90% (outermost) prediction ellipses for the bivariate normal distribution that generated the data. The prediction ellipses are contours of the bivariate normal density function. The probability density is high for ellipses near the origin, such as the 10% prediction ellipse. The density is low for ellipses are further away, such as the 90% prediction ellipse.

In the graph, two observations are displayed by using red stars as markers. The first observation is at the coordinates (4,0), whereas the second is at (0,2). The question is: which marker is closer to the origin? (The origin is the multivariate center of this distribution.)

The answer is, "It depends how you measure distance." The Euclidean distances are 4 and 2, respectively, so you might conclude that the point at (0,2) is closer to the origin. However, for this distribution, the variance in the Y direction is less than the variance in the X direction, so in some sense the point (0,2) is "more standard deviations" away from the origin than (4,0) is. </p

Notice the position of the two observations relative to the ellipses. The point (0,2) is located at the 90% prediction ellipse, whereas the point at (4,0) is located at about the 75% prediction ellipse. What does this mean? It means that the point at (4,0) is "closer" to the origin in the sense that you are more likely to observe an observation near (4,0) than to observe one near (0,2). The probability density is higher near (4,0) than it is near (0,2).

In this sense, prediction ellipses are a multivariate generalization of "units of standard deviation." You can use the bivariate probability contours to compare distances to the bivariate mean. A point p is closer than a point q if the contour that contains p is nested within the contour that contains q.

Defining the Mahalanobis distance

You can use the probability contours to define the Mahalanobis distance. The Mahalanobis distance has the following properties:

  • It accounts for the fact that the variances in each direction are different.
  • It accounts for the covariance between variables.
  • It reduces to the familiar Euclidean distance for uncorrelated variables with unit variance.

For univariate normal data, the univariate z-score standardizes the distribution (so that it has mean 0 and unit variance) and gives a dimensionless quantity that specifies the distance from an observation to the mean in terms of the scale of the data. For multivariate normal data with mean μ and covariance matrix Σ, you can decorrelate the variables and standardize the distribution by applying the Cholesky transformation z = L-1(x - μ), where L is the Cholesky factor of Σ, Σ=LLT.

After transforming the data, you can compute the standard Euclidian distance from the point z to the origin. In order to get rid of square roots, I'll compute the square of the Euclidean distance, which is dist2(z,0) = zTz. This measures how far from the origin a point is, and it is the multivariate generalization of a z-score.

You can rewrite zTz in terms of the original correlated variables. The squared distance Mahal2(x,μ) is
= zT z
= (L-1(x - μ))T (L-1(x - μ))
= (x - μ)T (LLT)-1 (x - μ)
= (x - μ)T Σ -1 (x - μ)
The last formula is the definition of the squared Mahalanobis distance. The derivation uses several matrix identities such as (AB)T = BTAT, (AB)-1 = B-1A-1, and (A-1)T = (AT)-1. Notice that if Σ is the identity matrix, then the Mahalanobis distance reduces to the standard Euclidean distance between x and μ.

The Mahalanobis distance accounts for the variance of each variable and the covariance between variables. Geometrically, it does this by transforming the data into standardized uncorrelated data and computing the ordinary Euclidean distance for the transformed data. In this way, the Mahalanobis distance is like a univariate z-score: it provides a way to measure distances that takes into account the scale of the data.

Post a Comment

Avoid unnecessary IF-THEN statements in loops

Way back when I learned to program, I remember a computer instructor explaining that an IF-THEN statement can be a relatively slow operation. He said "If a multiplication takes one unit of time, an IF statement requires about 70 units." I don't know where his numbers came from, or even if they were correct, but I know that I spent a lot of time avoiding IF statements in order to make my programs run as fast as possible. I learned to write expressions such as y = (x>0) rather than the equivalent statement if (x>0) then y=1; else y=0.

Modern computers are much faster than they were back then, but old habits die hard. The following program was written by a colleague and examines the relative performances of a DO loop (=slow) versus a call to the CUSUM function (=fast) on a vector with one million elements. When I saw the program, I could hear my old instructor's voice whispering in my ear....watch out for those IF statements....

proc iml;
N = 1e6;
z = rannorm(j(N,1)); /* generate random values */
y = j(N, 1, .);
 
/* time DO loop with compute cumulative sum (inefficient) */
t0 = time(); 
do i = 1 to n;
   if i = 1 then y[i] = z[i];
   else y[i] = y[i-1] + z[i];
end;
tDO_IF = time() - t0;
 
/* time vectorized computation (fast) */
t0 = time(); 
x = cusum(z);
tCUSUM = time() - t0;

As expected, the CUSUM function is much faster than writing a DO loop (the times are in the table that follows), which is why I always try to vectorize computations. That result was not surprising, but what caught my eye was the IF-THEN statement sitting inside of the DO loop. It occurred to me that part of the time spent by the first computation was because the IF-THEN statement is called one million times. To make a fairer comparison between the time taken by the DO loop and the CUSUM function, you can eliminate the IF-THEN statement entirely:

t0 = time();
y[1]=z[1]; /* initialize */
do i = 2 to n;
   y[i] = y[i-1] + z[i];
end;
tDO = time() - t0;
print tDO_IF tDO tCUSUM;

The table shows that omitting the IF-THEN statement results in the DO loop running 20% faster.

What if the IF-THEN statement is unavoidable?

You can't always eliminate an IF-THEN statement, but when you have an IF-THEN/ELSE statement inside a loop, you can usually restructure your program by writing two loops, one for when the IF condition is TRUE, and one for when it is FALSE. For example, consider the following statements that (inefficiently) compute two sums: the sum of the positive elements in a vector, and the sum of the negative elements:

/* loop that "needs" an IF-THEN statement (?) */
/* IF-THEN inside loop...SLOW! */
t0 = time(); 
posSum = 0; negSum = 0;
do i = 1 to n;
   if z[i] < 0 then 
      negSum = negSum + z[i];
   else 
      posSum = posSum + z[i];
end;
tDO_IF = time() - t0;   
 
/* rewrite as two loops: one over positive elements and one over negative */
t0 = time(); 
negIdx = loc(z<0);
posIdx = loc(z>0);;
negSum = 0;
do i = 1 to ncol(negIdx);
   negSum = negSum + z[negIdx[i]];
end;
posSum = 0;
do i = 1 to ncol(posIdx);
   posSum = posSum + z[posIdx[i]];
end;
t2DO_IF = time() - t0;

Of course, neither of these examples are as efficient as using a vectorized computation that calls the SUM function (see below). However, once again we observe a 20% improvement in the performance of the code by removing IF-THEN statements from inside the DO loop, as shown by the following table:

/* vectorized computation (fast) */
t0 = time();
negIdx = loc(z<0);
negSum = sum(z[negIdx]);
posIdx = loc(z>0);
posSum = sum(z[posIdx]);
tCUSUM = time() - t0;
print tDO_IF t2DO_IF tCUSUM;

Avoiding IF-THEN statements in a loop is just one of several ways to optimize loop performance. In general, you should remove any conditions that do not change inside of the loop. This is sometimes called "loop hoisting." For some compiled programming languages such as FORTRAN and C/C++, an optimizing compiler can sometimes determine that an expression is constant and move it outside of the loop for you.

So remember this little efficiency tip: avoid unnecessary statements inside loops. If it is easy to restructure the code to remove an IF-THEN statement inside a loop, or to move a constant expression outside of a loop, do so. It will pay off in terms of efficiency.

Do you have an efficiency tip to share? Post a comment.

Post a Comment

Use the Cholesky transformation to correlate and uncorrelate variables

A variance-covariance matrix expresses linear relationships between variables. Given the covariances between variables, did you know that you can write down an invertible linear transformation that "uncorrelates" the variables? Conversely, you can transform a set of uncorrelated variables into variables with given covariances. The transformation that works this magic is called the Cholesky transformation; it is represented by a matrix that is the "square root" of the covariance matrix.

The Square Root Matrix

Given a covariance matrix, Σ, it can be factored uniquely into a product Σ=UTU, where U is an upper triangular matrix with positive diagonal entries and the superscript denotes matrix transpose. The matrix U is the Cholesky (or "square root") matrix. Some people (including me) prefer to work with lower triangular matrices. If you define L=UT, then Σ=LLT. This is the form of the Cholesky decomposition that is given in Golub and Van Loan (1996, p. 143). Golub and Van Loan provide a proof of the Cholesky decomposition, as well as various ways to compute it.

Geometrically, the Cholesky matrix can transformation uncorrelated variables into variables whose variances and covariances are given by Σ. In particular, if you generate p standard normal variates, the Cholesky transformation maps the variables into variables for the multivariate normal distribution with covariance matrix Σ and centered at the origin (denoted MVN(0, Σ)).

The Cholesky Transformation: The Simple Case

Let's see how the Cholesky transofrmation works in a very simple situation. Suppose that you want to generate multivariate normal data that are uncorrelated, but have non-unit variance. The covariance matrix for this situation is the diagonal matrix of variances: Σ = diag(σ21,...,σ2p). The square root of Σ is the diagonal matrix D that consists of the standard deviations: Σ = DTD where D = diag(σ1,...,σp).

Geometrically, the D matrix scales each coordinate direction independently of the other directions. This is shown in the following image. The X axis is scaled by a factor of 3, whereas the Y axis is unchanged (scale factor of 1). The transformation D is diag(3,1), which corresponds to a covariance matrix of diag(9,1). If you think of the circles in the top image as being probability contours for the multivariate distribution MVN(0, I), then the bottom shows the corresponding probability ellipses for the distribution MVN(0, D).

The General Cholesky Transformation Correlates Variables

In the general case, a covariance matrix contains off-diagonal elements. The geometry of the Cholesky transformation is similar to the "pure scaling" case shown previously, but the transformation also rotates and shears the top image.

Computing a Cholesky matrix for a general covariance matrix is not as simple as for a diagonal covariance matrix. However, you can use ROOT function in SAS/IML software to compute the Cholesky matrix. Given any covariance matrix, the ROOT function returns a matrix U such that the product UTU equals the covariance matrix and U is an upper triangular matrix with positive diagonal entries. The following statements compute a Cholesky matrix in PROC IML:

proc iml;
Sigma = {9 1, 
         1 1};
U = root(Sigma);
print U (U`*U)[label="Sigma=U`*U"];

You can use the Cholesky matrix to create correlations among random variables. For example, suppose that X and Y are independent standard normal variables. The matrix U (or its transpose, L=UT) can be used to create new variables Z and W such that the covariance of Z and W equals Σ. The following SAS/IML statements generate X and Y as rows of the matrix xy. That is, each column is a point (x,y). (Usually the variables form the columns, but transposing xy makes the linear algebra easier.)The statements then map each (x,y) point to a new point, (z,w), and compute the sample covariance of the Z and W variables. As promised, the sample covariance is close to Σ, the covariance of the underlying population.

/* generate x,y ~ N(0,1), corr(x,y)=0 */
xy = j(2, 1000);
call randseed(12345);
call randgen(xy, "Normal"); /* each col is indep N(0,1) */
 
L = U`; 
zw = L * xy; /* apply Cholesky transformation to induce correlation */
cov = cov(zw`); /* check covariance of transformed variables */
print cov;

The following graph shows the geometry of the transformation in terms of the data and in terms of probability ellipses. The top graph is a scatter plot of the X and Y variables. Notice that they are uncorrelated and that the probability ellipses are circles. The bottom graph is a scatter plot of the Z and W variables. Notice that they are correlated and the probability contours are ellipses that are tilted with respect to the coordinate axes. The bottom graph is the transformation under L of points and circles in the top graph.

The Inverse Cholesky Transformation Uncorrelates Variables

You might wonder: Can you go the other way? That is, if you start with correlated variables, can you apply a linear transformation such that the transformed variables are uncorrelated? Yes, and it's easy to guess the transformation that works: it is the inverse of the Cholesky transformation!

Suppose that you generate multivariate normal data from MVN(0,Σ). You can "uncorrelate" the data by transforming the data according to L-1. In SAS/IML software, you might be tempted to use the INV function to compute an explicit matrix inverse, but as I have written, an alternative is to use the SOLVE function, which is more efficient than computing an explicit inverse matrix. However, since L is a lower triangular matrix, there is an even more efficient way to solve the linear system: use the TRISOLV function, as shown in the following statements:

/* Start with correlated data. Apply inverse of L. */
zw = T( RandNormal(1000, {0, 0}, Sigma) );
xy = trisolv(4, L, zw); /* more efficient than inv(L)*zw or solve(L, zw) */
 
/* did we succeed? Compute covariance of transformed data */
cov = cov(xy`);
print cov;

Success! The covariance matrix is essentially the identity matrix. The inverse Cholesky transformation "uncorrelates" the variables.

The TRISOLV function, which uses back-substitution to solve the linear system, is extremely fast. Anytime you are trying to solve a linear system that involves a covariance matrix, you should try to solve the system by computing the Cholesky factor of the covariance matrix, followed by back-substitution.

In summary, you can use the Cholesky factor of a covariance matrix in several ways:

  • To generate multivariate normal data with a given covariance structure from uncorrelated normal variables.
  • To remove the correlations between variables. This task requires using the inverse Cholesky transformation.
  • To quickly solve linear systems that involve a covariance matrix.

Post a Comment

How to access SAS sample programs

Have you ever wanted to run a sample program from the SAS documentation or wanted to use a data set that appears in the SAS documentation? You can: all programs and data sets in the documentation are distributed with SAS, you just have to know where to look!

Sample data in the SASHELP library

The data that are used in the SAS documentation are obtained in one of two ways: from a data set in the SASHELP library, or from a DATA step.

Many documentation examples use data in the SASHELP library, such as the Class, Iris, and Cars data sets. To use these data sets, specify the two-level name (libref.DataSetName) on the DATA= option of a procedure. For example, to display descriptive statistics about numerical variables in the Iris data, use the following call to the MEANS procedure:

title "Descriptive statistics for SASHELP.Iris";
proc means data=sashelp.iris;
run;

In other examples, the data is presented as a DATA step. However, sometimes the DATA step is so long that all of the data do not appear in the documentation. For example, the documentation of the QUANTREG procedure contains the following truncated DATA step:

data ozone;
  days = _n_;
  input ozone @@;
datalines;
0.0060 0.0060 0.0320 0.0320 0.0320 0.0150 0.0150 0.0150 0.0200 0.0200
0.0160 0.0070 0.0270 0.0160 0.0150 0.0240 0.0220 0.0220 0.0220 0.0185
0.0150 0.0150 0.0110 0.0070 0.0070 0.0240 0.0380 0.0240 0.0265 0.0290
   ... more lines ...   
0.0220 0.0210 0.0210 0.0130 0.0130 0.0130 0.0330 0.0330 0.0330 0.0325
0.0320 0.0320 0.0320 0.0120 0.0200 0.0200 0.0200 0.0320 0.0320 0.0250
0.0180 0.0180 0.0270 0.0270 0.0290
;

The complete data is contained in the sample program for this example. The next sections show how to access sample programs.

Access sample programs from the Help menu

You can access SAS sample programs by using the Help menu inside the SAS Windowing Environment, which is the default GUI environment. The following image shows a way to access the SAS/STAT sample programs by using the SAS GUI:

Access sample programs from the installation directory

Michael A. Raithel, in a SAS Tip on sasCommunity.org, points out that sample programs exist for all products, and gives examples of directories in which you can find them. For example, on my desktop PC, which is running SAS 9.3, I can browse to

C:\Program Files\SASHome\SASFoundation\9.3\ProductName\sample\
in order to see the sample programs.

You need to replace ProductName with the name of a SAS product, such as stat or ets. But how can you find the path of the "root directory" for your SAS installation? You can run the following DATA step in order to find it:

data _NULL_;
   %put %sysget(sasroot);
run;
     C:\Program Files\SASHome\SASFoundation\9.3

This is the value of sasroot that Michael refers to in his article. As Michael points out, it can be useful to browse programs in the sample directory, especially for products that you are trying to learn. The names of the files can be somewhat cryptic, so you should to use a search tool to find the correct file. For example, if you search the SAS/STAT sample directory for the term "data ozone," you will discover that the DATA step for the PROC QUANTREG example is in the file qregex4.sas. Therefore, the complete path for the PROC QUANTREG example on my PC is

     C:\Program Files\SASHome\SASFoundation\9.3\stat\sample\qregex4.sas

I'd like to be able to end this post by saying "the examples for PROC IML are great and help you learn how to program in the SAS/IML language." Unfortunately, many of the SAS/IML examples (which are based on the documentation) are old and do not contain many comments or explanatory text. But don't dispair. For Getting Started examples, read The DO Loop every Monday.

Post a Comment

Detecting outliers in SAS: Part 3: Multivariate location and scatter

In two previous blog posts I worked through examples in the survey article, "Robust statistics for outlier detection," by Peter Rousseeuw and Mia Hubert. Robust estimates of location in a univariate setting are well-known, with the median statistic being the classical example. Robust estimates of scale are less well-known, with the best known example being interquartile range (IQR), but a more modern statistic being the MAD function.

For multivariate data, the classical (nonrobust) estimate of location is the vector mean, c, which is simply the vector whose ith component is the mean of the ith variable. The classical (nonrobust) estimate of scatter is the covariance matrix. An outlier is defined as an observation whose Mahalanobis distance from c is greater than some cutoff value. As in the univariate case, both classical estimators are sensitive to outliers in the data. Consequently, statisticians have created robust estimates of the center and the scatter (covariance) matrix.

MCD: Robust estimation by subsampling

A popular algorithm that computes a robust center and scatter of multivariate data is known as the Minimum Covariance Determinant (MCD) algorithm. The main idea is due to Rousseeuw (1985), but the algorithm that is commonly used was developed by Rousseeuw and Van Driessen (1999). The MCD algorithm works by sampling h observations from the data over and over again, where h is typically in the range n/2 < h < 3n/4. The "winning" subset is the h points whose covariance matrix has the smallest determinant. Points far from the center of this subset are excluded, and the center and scatter of the remaining points are used as the robust estimates of location and scatter.

This Monte Carlo approach works very well in practice, but it does have the unfortunate property that it is not deterministic: a different random number seed could result in different robust estimates. Recently, Hubert, Rousseeuw, and Verdonck (2010) have published a deterministic algorithm for the MCD.

Robust MCD estimates in SAS/IML software

The SAS/IML language includes the MCD function for robust estimation of multivariate location and scatter. The following matrix defines a data matrix from Brownlee (1965) that correspond to certain measurements taken on 21 consecutive days. The points are shown in a three-dimensional scatter plot that was created in SAS/IML Studio.

proc iml;
x = { 80  27  89,  80  27  88,  75  25  90, 
      62  24  87,  62  22  87,  62  23  87, 
      62  24  93,  62  24  93,  58  23  87, 
      58  18  80,  58  18  89,  58  17  88, 
      58  18  82,  58  19  93,  50  18  89, 
      50  18  86,  50  19  72,  50  19  79, 
      50  20  80,  56  20  82,  70  20  91 };
 
/* classical estimates */
labl = {"x1" "x2" "x3"};
mean = mean(x);
cov = cov(x);
print mean[c=labl format=5.2], cov[r=labl c=labl format=5.2];

Most researchers think that observations 1, 2, 3, and 21 are outliers, with others including observation 2 as an outlier. (These points are shown as red crosses in the scatter plot.) The following statement runs the MCD algorithm on these data and prints the robust estimates:

/* robust estimates */
N = nrow(x);   /* 21 observations */
p = ncol(x);   /*  3 variables */
 
optn = j(8,1,.); /* default options for MCD */
optn[1] = 0;     /* =1 if you want printed output */
optn[4]= floor(0.75*N); /* h = 75% of obs */
 
call MCD(sc, est, dist, optn, x);
RobustLoc = est[1, ];     /* robust location */
RobustCov = est[3:2+p, ]; /* robust scatter matrix */
print RobustLoc[c=labl format=5.2], RobustCov[r=labl c=labl format=5.2];

The robust estimate of the center of the data is not too different from the classical estimate, but the robust scatter matrix is VERY different from the classical covariance matrix. Each robust estimate excludes points that are identified as outliers.

If you take these robust estimates and plug them into the classical Mahalanobis distance formula, the corresponding distance is known as the robust distance. It measures the distance between each observation and the estimate of the robust center by using a metric that depends on the robust scatter matrix. The MCD subroutine returns distance information in a matrix that I've called DIST (the third argument). The first row of DIST is the classical Mahalanobis distance. The second row is the robust distance, which is based on the robust estimates of location and scatter. The third row is an indicator variable with the value 1 if an observation is closer to the robust center than some cutoff value, and 0 otherwise. Consequently, the following statements find the outliers.
/* rules to detect outliers */
outIdx = loc(dist[3,]=0); /* RD > cutoff */
print outIdx;

The MCD algorithm has determined that observations 1, 2, 3, and 21 are outliers.

Incidentally, the cutoff value used by MCD is based on a quantile of the chi-square distribution because the squared Mahalanobis distance of multivariate normal data obeys a chi-square distribution with p degress of freedom, where p is the number of variables. The cutoff used is as follows:
cutoff = sqrt( quantile("chisquare", 0.975, p) ); /* dist^2 ~ chi-square */

In a recent paper, Hardin and Rocke (2005) propose a different criterion, based on the distribution of robust distances.

Robust MCD estimates in SAS/STAT software: How to "trick" PROC ROBUSTREG

The ROBUSTREG procedure can also compute MCD estimates. Usually, the ROBUSTREG procedure is used as a regression procedure, but you can also use it to obtain the MCD estimates by "inventing" a response variable. The MCD estimates are produced for the explanatory variables, so the choice of a response variable is unimportant. In the following example, I generate random values for the response variable.

In a regression context, the word "outlier" is reserved for an observation for which the value of the response variable is far from the predicted value. In other words, in regression an outlier means "far away (from the model) in the Y direction." In contrast, the ROBUSTREG procedure uses the MCD algorithm to identify influential observations in the space of explanatory (that is, X) variables. These are also called high-leverage points. They are observations that are far from the center of the X variables. High-leverage points are very influential in ordinary least squares regression, and that is why it is important to identify them.

To generate the MCD estimates, specify the DIAGNOSTICS and the LEVERAGE(MCDINFO) options on the MODEL statement, as shown in the following statements:

/* write data from SAS/IML (or use a DATA step) */
create X from x[c=labl]; append from x; close;
quit;
 
data X;
set X;
y=rannor(1); /* random response variable */
run;
 
proc robustreg data=X method=lts;
   model y = x1 x2 x3 / diagnostics leverage(MCDInfo);
   ods select MCDCenter MCDCov Diagnostics;
   ods output diagnostics=Diagnostics(where=(leverage=1));
run;
 
proc print data=Diagnostics; 
var Obs Mahalanobis RobustDist Leverage;
run;

The robust estimates of location and scatter are the same, as are the robust distances. The "leverage" variable is an indicator variable that tells you which observations are far from the center of the explanatory variables. They are multivariate "outliers" in the the space of the X variables, although they are not necessarily outliers for the response (Y) variable.

Post a Comment

Random number seeds: Only the first seed matters!

The other day I encountered the following SAS DATA step for generating three normally distributed variables. Study it, and see if you can discover what is unnecessary (and misleading!) about this program:

data points;
drop i;
do i=1 to 10;
   x=rannor(34343);
   y=rannor(12345);
   z=rannor(54321);
   output;
end;
run;

The program creates the POINTS data set. The data set contains three variables, each containing random numbers from the standard normal distribution. I'm guessing that the author of the program thinks that using rannor(12345) to define the y variable makes y independent from the x variable, which is defined by rannor(34343).

Sorry, but that is not correct.

The x, y, and z variables are, indeed, independent samples from a normal distribution, but that fact does not depend on using different seeds in the RANNOR function. In fact, in this DATA step, all random number seeds except the first one are completely ignored! Don't believe me? Run the following DATA step and compare the two data sets, as follows:

data points2;
drop i;
/* change all random number seeds except the first */
x=rannor(34343); y=rannor(1); z=rannor(2); output;
do i=2 to 10;
   x=rannor(10+i);
   y=rannor(100+i);
   z=rannor(1000+i);
   output;
end;
run;
 
proc compare base=points compare=points2;
run;
                           The COMPARE Procedure
                Comparison of WORK.POINTS with WORK.POINTS2
                               (Method=EXACT)
                               
NOTE: No unequal values were found. All values compared are exactly equal.

All values compared are exactly equal. Every observation, every variable, down to the last bit. But except for the first observation of the x variable, the second DATA step uses completely different random number seeds! How can the POINTS2 data set be identical to the POINTS data set?

As I explained in a previous post on random number seeds in SAS, the random number seed for a DATA step (or SAS/IML program) is set by the first call. SAS ignores subsequent seeds within the same DATA step or PROC step. In my previous post, I used the newer (and better) STREAMINIT function and the RAND function instead of the older RANNOR function, but the fact remains that first random number seed determines the random number stream for the entire DATA step.

For further details, see the SAS documentation, which shows an example similar to mine in which three data sets (imaginatively named A, B, and C) contain the same pseudorandom numbers.

Now that I've ranted against using different random number seeds, I will reveal that the DATA step at the beginning of my post is from an example in the SAS Knowledge Base! Yes, even experienced SAS programmers are sometimes confused by the subtleties of random number streams. There is nothing wrong with a program that uses multiple seeds, but such a program makes the reader think that all those seeds are actually doing something. They’re not.

Are you someone who uses different random number seeds for each variable in the same DATA step or PROC IML program? If so, you can safely stop. Multiple seeds do not make your random variables any more "random." Only the first seed matters.
Post a Comment

Detecting outliers in SAS: Part 2: Estimating scale

In a previous blog post on robust estimation of location, I worked through some of the examples in the survey article, "Robust statistics for outlier detection," by Peter Rousseeuw and Mia Hubert. I showed that SAS/IML software and PROC UNIVARIATE both support the robust estimators of location that are mentioned in the paper. Today's post looks at the robust estimators of scale that are mentioned in the same paper and works through more examples in the paper. The paper uses the following five measurements, which contain one outlier:
6.25, 6.27, 6.28, 6.34, 63.1

Robust scale statistics in SAS/IML software

SAS/IML software contains several functions for robust estimation of scale. For estimating scale, the MAD function is often used. The MAD statistic is an acronym for "median of all absolute deviations from the median." The MAD statistic is often multiplied by a constant in order to make it unbiased for data that are normally distributed. The constant is 1.483, but you don't need to remember that value because the MAD function has the "NMAD" option that automatically includes the multiplication factor, as shown in the following example:

proc iml;
x = {6.25, 6.27, 6.28, 6.34, 63.1};
mad = mad(x, "NMAD"); 
print mad;

Rousseeuw and Hubert briefly mention two other robust measures of scale: the Qn estimator (Rousseeuw and Croux, JASA, 1993) and the interquartile range (IQR), which is well-known from the Tukey box plot. You can compute both of these estimators in SAS/IML software, as follow:

Qn = mad(x, "QN"); 
call qntl(q, x, {0.25 0.75}); /* compute 25th and 75th percentile */
IQR = q[2] - q[1];
print Qn IQR;

The three robust estimates of scale are similar. They range from 0.04 (MAD) to 0.07 (IQR). The IQR is sometimes divided by 1.349 in order to estimate the scale of normally distributed data. If you divide 0.07 by 1.349, you get 0.052, which make the estimates even more similar.

The connection with outlier detection

All this discussion of robust estimation of location and scale is closely related to detecting outliers. In practice, outliers are often detected using a rule or formula. The classical rule is to compute z-scores, which are just the normalized values zi = (xi - x̄)/s, where is the sample mean and s is the sample standard deviation. An outlier is defined as any observation for which |zi| exceeds some cutoff value, typically 2.5 or 3.

This rule fails when there is a large outlier in the data. For example, the following SAS/IML statements compute the classical z-scores for the Rousseeuw and Hubert example:

/* rules to detect outliers */
z = (x - mean(x)) / std(x);
print z;

Because the mean and standard deviation are both influenced by the outlier, no observation has a large z-score, and therefore none is flagged as an outlier. However, using robust estimators in the z-score formula does successfully identify the outlier, as shown in the following statements:

zRobust = (x - median(x)) / mad(x, "NMAD");
print zRobust;

The outlier has a HUGE "robust score." Of course, you don't have to print out the scores and inspect them. The following SAS/IML statements use the LOC function (the most useful function that you've never heard of!) to find all of the data for which the robust z-score exceeds 2.5, and prints only the outliers:

outIdx = loc(abs(zRobust)>2.5);
if ncol(outIdx)>0 then 
   outliers = x[outIdx];
else 
   outliers = .;
print outliers;

Robust Estimates in the UNIVARIATE Procedure

The UNIVARIATE procedure also supports robust estimates of scale. The ROBUSTSCALE option on the PROC UNIVARIATE statement computes the robust estimates in the Rousseeuw and Hubert article, as well as others. The documentation for the UNIVARIATE procedure includes a section that describes the robust estimates of scale. The following example computes robust estimates of scale:

data a;
input x @@;
datalines;
6.25 6.27 6.28 6.34 63.1
;
run;
 
proc univariate data=a robustscale;
   var x;
   ods select RobustScale;
run;

Notice that the output from PROC UNIVARIATE includes two columns. The first column is an unadjusted robust estimate. The second column estimates the standard deviation for normally distributed data, which can be derived from the first column.

Post a Comment