When is a correlation matrix not a correlation matrix?

8

This article is an excerpt from my forthcoming book Simulating Data with SAS.

Not every matrix with 1 on the diagonal and off-diagonal elements in the range [–1, 1] is a valid correlation matrix. A correlation matrix has a special property known as positive semidefiniteness. All correlation matrices are positive semidefinite (PSD), but not all estimates are guaranteed to have that property. For example, robust estimators and matrices of pairwise correlation coefficients are two situations in which an estimate might fail to be PSD.

A third situtation can occur when a correlation matrix is estimated based on forecasts. For example, an analyst might conjecture that the correlation between certain currencies (such as the dollar, yen, and euro) will have certain values in the coming year:

  • the first and second currencies will have correlation R12 = 0.6.
  • the first and third currencies will have correlation R13 = 0.9.
  • the second and third currencies will have correlation R23 = 0.9.

Unfortunately, the resulting matrix of pairwise correlations is not positive definite and therefore does not represent a valid correlation matrix. How can you tell? Positive semidefinite matrices always have nonnegative eigenvalues. As shown by the output of following program, this matrix has a negative eigenvalue:

proc iml;
R = {1.0 0.6 0.9,
     0.6 1.0 0.9,
     0.9 0.9 1.0};
eigval = eigval(R);
print eigval;

So there you have it: a matrix of correlations that is not a correlation matrix. Mathematically, the problem is that the various correlations between variables are not independent, which means that analyst cannot choose pairwise correlations arbitrarily. If R is a correlation matrix, then the correlations must satisfy the condition det(R) ≥ 0. For a 3 x 3 matrix, this implies that the correlation coefficients satisfy the equation:

R212 + R213 + R223 - 2 R12 R13 R23 ≤ 1

The set of (R12, R13, R23) triplets that satisfy the inequality forms a convex subset of the unit cube, as shown in the following image, which is from Rousseeuw and Molenberghs (TAS, 1994).

If you substitute the values R12=0.6 and R13 = R23 = 0.9, you discover that these three values do not satisfy the inequality. The triplet of pairwise correlations is outside of the convex region shown in the figure.

This can cause problems in multivariate analyses and simulation studies. But what can you do about it? One solution is to try to find a valid correlation matrix that is closest (in some sense) to your estimate.

In my book, I provide SAS/IML functions that implement an algorithm due to Nick Higham that finds the closest correlation matrix by projecting the estimate onto the surface of the convex region. The algorithm works in arbitrary dimensions.

This is a good time to remind SAS users that by default PROC CORR computes pairwise correlations. If your variables contain missing values, the resulting matrix of correlations might not be PSD. If you intend to use the PROC CORR output for simulation or as input for a regression or multivariate analysis, be sure to specify the NOMISS option on the PROC CORR statement! This option excludes observations with missing values and always results in a positive semidefinite estimate of correlation.

Share

About Author

Rick Wicklin

Distinguished Researcher in Computational Statistics

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of SAS/IML software. His areas of expertise include computational statistics, simulation, statistical graphics, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.

8 Comments

  1. A general approach to these problems is to use semidefinite optimization. The nearest correlation matrix estimation is one of the typical applications. The alternating projections might work well if the original matrix is nearly PSD, but it think it will be slow to converge otherwise. Also, the semidefinite optimization approach allows one to use different distance functions without chaning the algorithm.

    Support for the is planned for the future in SAS/OR.

  2. Pingback: Constructing common covariance structures - The DO Loop

  3. Pingback: Computing the nearest correlation matrix - The DO Loop

  4. Pingback: Missing values and pairwise correlations: A cautionary example - The DO Loop

  5. Pingback: An efficient way to increment a matrix diagonal - The DO Loop

  6. Pingback: A simple way to construct a large correlation matrix - The DO Loop

  7. Pingback: Compute the square root matrix - The DO Loop

  8. Pingback: Gershgorin discs and the location of eigenvalues - The DO Loop

Leave A Reply

Back to Top