Testing data for multivariate normality

I've blogged several times about multivariate normality, including how to generate random values from a multivariate normal distribution. But given a set of multivariate data, how can you determine if it is likely to have come from a multivariate normal distribution?

The answer, of course, is to run a goodness-of-fit (GOF) test to compare properties of the data with theoretical properties of the multivariate normal (MVN) distribution. For univariate data, I've written about the usefulness of the quantile-quantile (Q-Q) plot to model the distribution of data, and it turns out that there is a similar plot that you can use to assess multivariate normality. There are also analytic GOF tests that can be used.

To see how these methods work in SAS, we need data. Use the RANDNORMAL function in SAS/IML software to generate data that DOES come from a MVN distribution, and use any data that appears nonnormal to examine the alternative case. For this article, I'll simulate data that is uniformly distributed in each variable to serve as data that is obviously not normal. The following SAS/IML program simulates the data:

proc iml;
N = 100; /* 100 obs for each distribution */
call randseed(1234);
 
/* multivariate normal data */
mu = {1 2 3};
Sigma = {9 1 2,
       1 6 0,
       2 0 4 };
X = randnormal(N, mu, Sigma);
 
/* multivariate uniform data */
v = j(N, ncol(mu));         /* allocate Nx3 matrix*/
call randgen(v, "Uniform"); /* each var is U[0,1] */
v = sqrt(12)*(v - 1/2);     /* scale to mean 0 and unit variance */
U = mu + T(sqrt(vecdiag(Sigma))) # v; /* same mean and var as X */

A graphical test of multivariate normality

If you want a quick check to determine whether data "looks like" it came from a MVN distribution, create a plot of the squared Mahalanobis distances versus quantiles of the chi-square distribution with p degrees of freedom, where p is the number of variables in the data. (For our data, p=3.) As I mentioned in the article on detecting outliers in multivariate data, the squared Mahalanobis distance has an approximate chi-squared distribution when the data are MVN. See the article "What is Mahalanobis distance?" for an explanation of Mahalanobis distance and its geometric interpretation.

I will use a SAS/IML function that computes Mahalanobis distances. You can insert the function definition into the program, or you can load the module from a SAS catalog if it was previously stored. The following program computes the Mahalanobis distance between the rows of X and the sample mean:

load module=Mahalanobis; /* or insert module definition here */
 
Mean = mean(X); /* compute sample mean and covariance */
Cov = cov(X);
md = mahalanobis(X, Mean, Cov);

For MVN data, the square of the Mahalanobis distance is asymptotically distributed as a chi-square with three degrees of freedom. (Note: for a large number of variables you need a very large sample size before the asymptotic chi-square behavior becomes evident.) To plot these quantities against each other, I use the same formula that PROC UNIVARIATE uses to construct its Q-Q plots, as follows:

md2 = md##2;
call sort(md2, 1); 
s = (T(1:N) - 0.375) / (N + 0.25);
chisqQuant = quantile("ChiSquare", s, ncol(X));

If you plot md2 versus chiSqQuant, you get the graph on the left side of the following image. Because the points in the plot tend to fall along a straight line, the plot suggests that the data are distributed as MVN. In contrast, the plot on the right shows the same computations and plot for the uniformly distributed data. These points do not fall on a line, indicating that the data are probably not MVN. Because the samples contain a small number of points (100 for this example), you should not expect a "perfect fit" even if the data are truly distributed as MVN.

Goodness-of-fit tests for multivariate normality

Mardia's (1974) test multivariate normality is a popular GOF test for multivariate normality. Mardia (1970) proposed two tests that are based definitions of multivariate skewness and kurtosis. (See von Eye and Bogat (2004) for an overview of this and other methods.) It is easy to implement these tests in the SAS/IML language.

However, rather than do that, I want to point out that SAS provides the %MULTNORM macro that implements Mardia's tests. The macro also plots the squared Mahalanobis distances of the observations to the mean vector against quantiles of a chi-square distribution. (However, it uses the older GPLOT procedure instead of the newer SGPLOT procedure.) The macro requires either SAS/ETS software or SAS/IML software. The following statements define the macro and call it on the simulated MVN data:

/* write data from SAS/IML to SAS data set */
varNames = "x1":"x3";
create Normal from X[c=varNames]; append from X; close Normal;
quit; 
 
/* Tests for MV normality */
%inc "C:\path of macro\multnorm.sas";
%multnorm(data=Normal, var=x1 x2 x3, plot=MULT);

The macro generates several tables and graphs that are not shown here. The test results shown in the preceding table indicate that there is no reason to reject the hypothesis that the sample comes from a multivariate normal distribution. In addition to Mardia's test of skewness and kurtosis, the macro also performs univariate tests of normality on each variable and another test called the Henze-Zirkler test.

Another graphical tool: Plot of marginal distributions

To convince yourself that the simulated data are multivariate normal, it is a good idea to use the SGSCATTER procedure to create a plot of the univariate distribution for each variable and the bivariate distribution for each pair of variables. Alternatively, you can use the CORR procedure as is shown in the following statements. The CORR procedure can also produce the sample mean and sample covariance, but these tables are not shown here.

/* create scatter plot matrix of simulated data */
proc corr data=Normal COV plots(maxpoints=NONE)=matrix(histogram);
   var x:;
   ods select MatrixPlot;
run;

The scatter plot matrix shows (on the diagonal) that each variable is approximately normally distributed. The off-diagonal elements show that the pairwise distributions are bivariate normal. This is characteristic of multivariate normal data: all marginal distributions are also normal. (This explains why the %MULTNORM macro includes univariate tests of normality in its test results.) Consequently, the scatter plot matrix is a useful graphical tool for investigating multivariate normality.

tags: Data Analysis, Statistical Programming

6 Comments

  1. bedasa mekonnon
    Posted November 22, 2012 at 6:59 am | Permalink

    Dear sir

    would you send syntax for Multivariate normality data test in SAS?

    Thank you
    bedasamoke@gmail.com

  2. stan
    Posted January 5, 2013 at 7:33 am | Permalink

    Dear Rick Wicklin, thanks for the interesting post. Some questions from a newbie in statisics.
    Outliers can affect the distribution of our data points (more precise, residuals) but an extreme value may belong to distribution's tail.
    1) Do we have to calculate residuals to check for multivariate normality (MVN) as we do in a univariate case and submit them for the tests you mention here ?
    2) Do we need to check MVN before testing for outliers or vice versa ?
    3) Do the number of variables determine the number of dimensions (d) in MV normal distribution ?

    Sincerely,
    Stan

    • Posted January 5, 2013 at 8:47 pm | Permalink

      There is no regression problem here. We have a set of N observations and d variables. We want to check whether these data are likely to have come from a MVN distribution.
      1) There is no response variable. The univariate analog is having N univariate data and asking whether a normal distribution fits the data. You can use PROC UNIVARIATE in the 1-D case.
      2) If there is an extreme outlier in the data, that will affect the sample mean and covariance, and correspondingly affect the fit. If some data point is obviously wrong (for example, a coding error or a known abnormality), you might want to exclude it before fitting the MVN distribution.
      3) Yes, the number of variables determines the dimension.

      • Stan
        Posted January 21, 2013 at 3:30 am | Permalink

        Dear Rick,

        thanks for your reply. Another short question: does the sentence "There is no regression problem here." mean that we have to consider raw data rather than residuals for testing a (general/mixed) linear model assumptions in cases where the independent variable(s) is(are) categorical rather than continuous (like here: http://stats.stackexchange.com/q/11887)?

        Thank you.

        • Posted January 21, 2013 at 9:50 am | Permalink

          No, my statement meant "In this article, I do not assume that there is a regression problem." For your application, apply the theory to the residuals. In other words, if you want to test whether a linear model captures all of the "signal" and leaves you with residuals that are MVN, then you would test the residuals.

2 Trackbacks

  1. [...] like a chi-square distribution with d degrees of freedom. (This is discussed in the article "Testing data for multivariate normality.") Therefore, you can use quantiles of the chi-square distribution to define outliers. A standard [...]

  2. [...] Testing data for multivariate normality, which uses Mahalanobis distance to assess the distribution of multivariate data. [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <p> <pre lang="" line="" escaped=""> <q cite=""> <strike> <strong>