*This article is an excerpt from my forthcoming book* Simulating Data with SAS.

Not every matrix with 1 on the diagonal and off-diagonal elements in the range [–1, 1] is a valid correlation matrix. A correlation matrix has a special property known as positive semidefiniteness. All correlation matrices are positive semidefinite (PSD), but not all estimates are guaranteed to have that property. For example, robust estimators and matrices of pairwise correlation coefficients are two situations in which an estimate might fail to be PSD.

A third situtation can occur when a correlation matrix is estimated based on forecasts. For example, an analyst might conjecture that the correlation between certain currencies (such as the dollar, yen, and euro) will have certain values in the coming year:

- the first and second currencies will have correlation
*R*_{12}= 0.6. - the first and third currencies will have correlation
*R*_{13}= 0.9. - the second and third currencies will have correlation
*R*_{23}= 0.9.

Unfortunately, the resulting matrix of pairwise correlations is not positive definite and therefore does not represent a valid correlation matrix. How can you tell? Positive semidefinite matrices always have nonnegative eigenvalues. As shown by the output of following program, this matrix has a negative eigenvalue:

proc iml; R = {1.0 0.6 0.9, 0.6 1.0 0.9, 0.9 0.9 1.0}; eigval = eigval(R); print eigval; |

So there you have it: a matrix of correlations that is not a correlation matrix.
Mathematically, the problem is that the various correlations between variables are not independent, which means that analyst cannot choose pairwise correlations arbitrarily. If *R* is a correlation matrix, then the correlations must satisfy the condition det(*R*) ≥ 0. For a 3 x 3 matrix, this implies that the correlation coefficients satisfy
the equation:

*R*^{2}_{12} + *R*^{2}_{13} +
*R*^{2}_{23} -
2 *R*_{12} *R*_{13} *R*_{23} ≤ 1

The set of (*R*_{12}, *R*_{13}, *R*_{23}) triplets that satisfy the inequality forms a convex subset of the unit cube, as shown in the following image, which is from Rousseeuw and Molenberghs (*TAS*, 1994).

If you substitute the values *R*_{12}=0.6 and *R*_{13} = *R*_{23} = 0.9, you discover that these three values do not satisfy the inequality. The triplet of pairwise correlations is outside of the convex region shown in the figure.

This can cause problems in multivariate analyses and simulation studies. But what can you do about it? One solution is to try to find a valid correlation matrix that is closest (in some sense) to your estimate.

In my book, I provide SAS/IML functions that implement an algorithm due to Nick Higham that finds the closest correlation matrix by projecting the estimate onto the surface of the convex region. The algorithm works in arbitrary dimensions.

This is a good time to remind SAS users that by default **PROC CORR computes pairwise correlations**. If your variables contain missing values, the resulting matrix of correlations might not be PSD. If you intend to use the PROC CORR output for simulation or as input for a regression or multivariate analysis, **be sure to specify the NOMISS option** on the PROC CORR statement! This option excludes observations with missing values and always results in a positive semidefinite estimate of correlation.

## One Comment

A general approach to these problems is to use semidefinite optimization. The nearest correlation matrix estimation is one of the typical applications. The alternating projections might work well if the original matrix is nearly PSD, but it think it will be slow to converge otherwise. Also, the semidefinite optimization approach allows one to use different distance functions without chaning the algorithm.

Support for the is planned for the future in SAS/OR.

## 6 Trackbacks

[...] It is a mathematical fact that the constant correlation matrix that has all off-diagonal entries equal to ρ is positive semidefinite for all values of ρ in the range [0,1]. Notice that this correlation matrix is a special case of a Toeplitz correlation structure. For example, toeplitz({1 .2 .2 .2}) is a valid correlation matrix. This is good to keep in mind, because sometimes it is hard to generate a positive definite matrix [...]

[...] this problem occurs is in simulating multivariate normal data. I have previously written about why an estimated matrix of pairwise correlations is not always a valid correlation matrix. This article discusses what to do about it. The material in this article is taken from my [...]

[...] to problems in statistical computing. I have previously written about this phenomenon in my article "When is a correlation matrix not a correlation matrix." Specifically, consider the symmetric array whose elements are pairwise correlations between [...]

[…] matrix is also sometimes used in numerical computations that involve estimated covariance matrices. Sometimes an estimate is not positive definite, and ridging is one approach to try to obtain an estimate for a covariance matrix that has this […]

[…] is not easy to write down a valid correlation matrix. I've previously written about the fact that not every symmetric matrix with a unit diagonal is a correlation matrix. Correlation matrices are symmetric and positive definite (PD), which means that all the eigenvalues […]

[…] All positive numbers have square roots, and mathematicians, who love to generalize everything, have defined a class of matrices with properties that are reminiscent of positive numbers. They are called positive definite matrices, and they arise often in statistics because every covariance and correlation matrix is symmetric and positive definite (SPD). […]