Simulate multivariate normal data in SAS by using PROC SIMNORMAL

0

My article about Fisher's transformation of the Pearson correlation contained a simulation. The simulation uses the RANDNORMAL function in SAS/IML software to simulate multivariate normal data. If you are a SAS programmer who does not have access to SAS/IML software, you can use the SIMNORMAL procedure in SAS/STAT software to simulate data from a multivariate normal distribution.

The 'TYPE' of a SAS data set

Most SAS procedures read and analyze raw data. However, some SAS procedures read and write special data sets that represent a statistical summary of data. PROC SIMNORMAL can read a TYPE=CORR or TYPE=COV data set. Usually, these special data sets are created as an output data set from another procedure. For example, the following SAS statements compute the correlation between four variables from a sample of 50 Iris versicolor flowers:

proc corr data=sashelp.iris(where=(Species="Versicolor"))  /* input raw data */
          nomiss noprint outp=OutCorr;                     /* output statistics */
var PetalLength PetalWidth SepalLength SepalWidth;
run;
 
proc print data=OutCorr; run;
SAS TYPE=CORR data set

The output data set contains summary statistics including the mean, standard deviations, and correlation matrix for the four variables in the analysis. PROC PRINT does not display the 'TYPE' attribute of this data set, but if you run PROC CONTENTS you will see a field labeled "Data Set Type," which has the value "CORR".

You can also create a TYPE=CORR or TYPE=COV data set by using the DATA step as shown in the documentation for PROC SIMNORMAL.

Use PROC SIMNORMAL to generate multivariate normal data

Recall that you can use the standard deviations and correlations to construct a covariance matrix. When you call PROC SIMNORMAL, it internally constructs the covariance matrix from the information in the OutCorr data set and use the mean and covariance matrix to simulate multivariate normal data. The following call to PROC SIMNORMAL simulates 50 observations from a multivariate normal population. The DATA step combines the original and simulated data; the call to PROC SGSCATTER overlays the original and the simulated samples. Click to enlarge the graph.

proc simnormal data=OutCorr outsim=SimMVN
               numreal = 50           /* number of realizations = size of sample */
               seed = 12345;          /* random number seed */
   var PetalLength PetalWidth SepalLength SepalWidth;
run;
 
/* combine the original data and the simulated data */
data Both;
set sashelp.iris(where=(Species="Versicolor")) /* original */
    SimMVN(in=sim);                            /* simulated */
Simulated = sim;
run;
 
ods graphics / attrpriority=none;   /* use different markers for each group */
title "Overlay of Original and Simulated MVN Data";
proc sgscatter data=Both;
   matrix PetalLength PetalWidth SepalLength SepalWidth / group=Simulated;
run;
ods graphics / attrpriority=none;   /* reset markers */
Overlay of original and simulated four-dimensional data

Notice that the original data are rounded whereas the simulated data are not. Except for that minor difference, the simulated data appear to be similar to the original data. Of course, the simulated data will not match unless the original data is approximately multivariate normal.

Simulate many samples from a multivariate normal distribution

The SIMNORMAL procedure supports the NUMREAL= option, which you can use to specify the size of the simulated sample. (NUMREAL stands for "number of realizations," which is the number of independent draws.) You can use this option to generate multiple samples from the same multivariate normal population. For example, suppose you are conducting a Monte Carlo study and you want to generate 100 samples of size N=50, each drawn from the same multivariate normal population. This is equivalent to drawing 50*100 observations where the first 50 observations represent the first sample, the next 50 observations represent the second sample, and so on. The following statements generate 50*100 observations and then construct an ID variable that identifies each sample:

%let N = 50;            /* sample size */
%let NumSamples = 100;  /* number of samples */
proc simnormal data=OutCorr outsim=SimMVN
               numreal = %sysevalf(&N*&NumSamples) 
               seed = 12345;          /* random number seed */
   var PetalLength PetalWidth SepalLength SepalWidth;
run;
 
data SimMVNAll;
set SimMVN;
ID = floor((_N_-1) / &N) + 1;   /* ID = 1,1,...,1, 2,2,...,2, etc */
run;

After adding the ID variable, you can efficiently analyze all samples by using a single call to a procedure. The procedure should use a BY statement to analyze each sample. For example, you could use PROC CORR with a BY ID statement to obtain a Monte Carlo estimate of the sampling distribution of the correlation for multivariate normal data.

In summary, although the SAS/IML language is the best tool for general multivariate simulation tasks, you can use the SIMNORMAL procedure in SAS/STAT software to simulate multivariate normal data. The key is to construct a TYPE=CORR or TYPE=COV data set, which is then processed by PROC SIMNORMAL.

Share

About Author

Rick Wicklin

Distinguished Researcher in Computational Statistics

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of PROC IML and SAS/IML Studio. His areas of expertise include computational statistics, simulation, statistical graphics, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.

Leave A Reply

Back to Top