Sampling from the multivariate normal distribution

SAS/IML software is often used for sampling and simulation studies. For simulating data from univariate distributions, the RANDSEED and RANDGEN subroutines suffice to sample from a wide range of distributions. (I use the terms "sampling from a distribution" and "simulating data from a distribution" interchangeably.)

For multivariate simulations, the IMLMLIB library contains a series of modules for sampling from multivariate distributions.

Included in this module library is the RANDNORMAL module which enables you to sample from a multivariate normal distribution with a given mean and covariance structure. You need to specify three parameters:

  • N: the number of observations that you want to sample
  • Mean: the mean of the distribution (This is the "population mean.")
  • Cov: the covariance structure of the distribution. (This is the "population covariance.") The covariance structure is specified by using a matrix whose (i, j)th element specifies the covariance between the ith and jth variables. If you want to sample from variables that are uncorrelated, use a diagonal covariance matrix. To sample from identical uncorrelated variables, use an identity matrix.

Sample from the Multivariate Normal Distribution

The following statements generate 1,000 random observations from a multivariate normal distribution with a specified mean and covariance structure. The first five observations are displayed.

proc iml;
/** specify the mean and covariance of the population **/
Mean = {1, 2, 3};
Cov = {3 2 1, 
       2 4 0,
       1 0 5};
NumSamples = 1000;
call randseed(4321);  /** set seed for the RandNormal module **/
X = RandNormal(NumSamples, Mean, Cov);
print (X[1:5,])[label="First 5 Obs: MV Normal"];

How can you verify that the computation is correct? That's an interesting question and a complete answer will have to wait for a future post. However, you can convince yourself that the observations probably came from the specified distribution by computing the sample mean and the sample covariance matrix. These sample statistics should be close to the parameters of the population

Computing the Mean of the Sample

The following statements compute the mean vector of the sample:

/** check the sample mean **/
SampleMean = X[:, ];
print SampleMean;

The sample mean is, indeed, close to the mean of the population. You should expect the sample mean to approach the population mean as the number of samples gets large. (Recall that, in the univariate case, the sample mean is distributed as σ2/N, where σ2 is the population variance. The same idea holds for the multivariate mean vector.)

Computing the Covariance of the Sample

You can compute the sample covariance matrix by using the following statements:

/** check the sample covariance **/
C = X - SampleMean;
SampleCov = C` * C / (NumSamples-1);
print SampleCov;

For readers who have upgraded to SAS/IML 9.22, the following statement is an alternative way to compute the covariance matrix:

/** SAS/IML 9.22 and beyond: call the COV function **/
SampleCov2 = cov(X);

The sample covariance matrix is also close to the covariance parameters for the population. The sample covariance matrix should approach the population covariance as the number of samples gets large. (The distribution of the sample covariance matrix is called the Wishart distribution.)

Sampling When You Have a Correlation Matrix

A correlation matrix is just a covariance matrix for data that have been standardized. Consequently, it is acceptable to use a correlation matrix as the third argument to the RANDNORMAL function. Suppose that S is a covariance matrix and R is the corresponding correlation matrix. If you use R as the third argument to the RANDNORMAL function, the resulting sample will have the same general shape as if you had used S, but the shape will be linearly scaled in certain directions.

If you have correlation matrix instead of covariance, and you also have a set of variances, you can transform the correlation matrix into a covariance matrix.

tags: Simulation


  1. Harry
    Posted October 23, 2012 at 6:19 am | Permalink

    How could I sampling from the bivariate exponetial distribution and bivariate weibull distribution?

    • Posted October 23, 2012 at 10:42 am | Permalink

      That's a difficult question, and there are several parameterizations for those distributions. I recommend the book by Mark Johnson (1987) Multivariate Statistical Simulation. In it, he shows how to simulate from the Morganstern distribution with correlated exponential marginals. Unfortunately, it only models weakly correlated variables. For the bivariate Weibull, start with X1 and X2 that are bivarate exponential. Then define Y1 = X1^{1/alpha) and Y2 = X2^{1/alpha). Y1 and Y2 are bivaraite Weibull and correlated, although I'm not sure how to express the correlation analytically in terms of the corr(X1,X2).

  2. Raphael Fraser
    Posted December 13, 2012 at 2:25 am | Permalink

    What is the most efficient way to create multiple data sets (in longitudinal format) from the multivariate normal distribution? This is especially useful for running simulations.

    • Posted December 13, 2012 at 8:37 am | Permalink

      Let's say you want to simulate k MVN samples, where each sample contains N observations. You want an ID variable with values 1-k that identifies each sample, and variables x1-x_p that contain the MVN data.

      X = RandNormal(N*k, mu, Cov);     /* draw k samples of size N */
      ID = shape( repeat(T(1:k), 1, N), N*k );
      varNames = "ID" || ("x1":("x"+strip(char(ncol(Cov)))));
      Y = ID || X;
      create MVN from Y[colname=varNames]; append from Y; close MVN;
  3. Jon Dickens
    Posted September 8, 2014 at 8:07 am | Permalink

    I need to generate multivariate random samples from a probability distribution to use in a SAS Training Course but we do not have access to PROC IML.

    We use either SAS Enterprise Guide 5.1 or SAS Enterprise Miner 12.1 for Statistical Data Analysis.

    While I fully appreciate the functionality of IML and the advantages of using a Matrix Approach, I would like to know if it is possible to create the above samples in SAS Enterprise Miner.

    Alternatively is there a magic macro using BASE SAS that emulates matrix methodology?

    • Posted September 8, 2014 at 8:25 am | Permalink

      You can use the SIMNORMAL procedure in SAS/STAT software, as described on p. 136 of my book Simulating Data with SAS.

  4. Thidarat
    Posted April 20, 2015 at 12:02 pm | Permalink

    If I have many variables how I can test these variables that their distributions are multivariate normal.

  5. mark anthony
    Posted April 27, 2016 at 10:37 am | Permalink

    greetings sir rick! i am a BS in Statistics student in Central Luzon State University, Philippines, i am currently working on my thesis. my problem is how to simulate multiple predictors (p=25) that are correlated and as well as they are multivariate normally distributed.

  6. mark anthony
    Posted April 27, 2016 at 10:41 am | Permalink

    does SAS/IML or SAS/STAT have procedure options for my problem sir. thanks in advance to your response sir

8 Trackbacks

  1. [...] are different for each column of x. As I've discussed previously, when you want correlated vectors you can sample from a multivariate normal distribution by using the RANDNORMAL module. Each of the resulting vectors is univariate normal and they are correlated with each other. You [...]

  2. [...] Sampling from the multivariate normal distribution: This article describes how to obtain correlated multivariate normal vectors in SAS. [...]

  3. [...] blogged several times about multivariate normality, including how to generate random values from a multivariate normal distribution. But given a set of multivariate data, how can you determine if it is likely to have come from a [...]

  4. [...] about how to simulate various quantities. Random integers? Check! Random univariate samples? Check! Random multivariate samples? [...]

  5. [...] statistical task varies, but one place where this problem occurs is in simulating multivariate normal data. I have previously written about why an estimated matrix of pairwise correlations is not always a [...]

  6. [...] have previously shown how to use the RANDNORMAL function in SAS/IML to simulate multivariate normal data. Now suppose that you want to generate 10 samples, where each sample contains five observations [...]

  7. […] written about how to generate a sample from a multivariate normal (MVN) distribution in SAS by using the RANDNORMAL function in SAS/IML software. Last week a SAS/IML programmer showed […]

  8. […] X and Y." Usually this means that he wants data generated from a multivariate distribution, such as the bivariate normal distribution. However, an alternative scenario is to generate X by according to some univariate distribution and […]

Post a Comment

Your email is never published nor shared. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>