This article shows how to simulate data from a mixture of multivariate normal distributions, which is also called a Gaussian mixture. You can use this simulation to generate clustered data. The adjacent graph shows three clusters, each simulated from a four-dimensional normal distribution. Each cluster has its own within-cluster covariance, which controls the spread of the cluster and the amount overlap between clusters.
This article is based on an example in Simulating Data with SAS (Wicklin, 2013, p. 138). Previous articles have explained how to simulate from a multivariate normal distribution and how to generate a random sample from a univariate mixture distribution.
A review of mixture distributions
The graph at the top of this article shows 100 random observation from a Gaussian mixture distribution. A mixture distribution consists of K component distributions and a set of mixing probabilities that determine the probability that a random observation belongs to each component. For example, if π = {0.35, 0.5, 0.15} is a vector of mixing probabilities, then in large random samples about 35% of the observations are drawn from the first component, about 50% from the second component, and about 15% from the third component.
For a mixture of Gaussian components, you need to specify the mean vector and the covariance matrix for each component. For this example, the means and covariances are approximately the estimates from Fisher's famous iris data set, so the scatter plot matrix might look familiar to statisticians who have previously analyzed the iris data. The means of the three component distributions are μ1 = {50, 34, 15, 2}, μ2 = {59, 28, 43, 13}, and μ3 = {66, 30, 56, 20}. The covariance matrices for the component distributions are shown later.
Simulate a Gaussian mixture in SAS
The SAS/IML language is the easiest way to simulate multivariate data in SAS. To simulate from a mixture of K Gaussian distributions, do the following:
- Generate a random draw from the multinomial distribution with probability vector π. This gives the number of observations to sample from each component.
- For each component, simulate from a multivariate normal distribution.
A technical issue is how to pass the mean vectors and covariance matrices into a module that simulates the data. If you are using SAS/IML 14.2 or beyond, you can use a list of lists (Wicklin, 2017, p. 12–14). For compatibility with older versions of SAS, the implementation in this article uses a different technique: each mean and covariance are stored as rows of a matrix. Because the covariance matrices are symmetric, Wicklin (2013) stores them in lower-triangular form. For simplicity, this article stores each 4 x 4 covariance matrix as a row in a 3 x 16 matrix.
proc iml; /* Simulate from mixture of K mulitvariate normal distributions in dimension d. OUTPUT variables: X : a SampleSize x d matrix of simulated observations ID : a column vector that contains the component from which each obs is drawn INPUT variables: pi : 1 x K vector of mixing probabilities, sum(pi)=1 mu : K x d matrix whose i_th row contains the mean vector for the i_th component Cov : K x (d**2) matrix whose i_th row contains the covariance matrix for the i_th component */ start SimMVNMixture(X, ID, /* output arguments */ SampleSize, pi, mu, Cov); /* input arguments */ K = ncol(pi); /* number of components */ d = ncol(mu); /* number of variables */ X = j(SampleSize, d); /* output: each row is observation */ ID = j(SampleSize, 1); /* ID variable */ N = RandMultinomial(1, SampleSize, pi); /* vector of samples sizes for components */ b = 1; /* b = beginning index for group i */ do i = 1 to K; e = b + N[i] - 1; /* e = ending index for group i */ ID[b:e] = i; /* set ID variable */ c = shape(Cov[i,], d, d); /* reshape Cov to square matrix */ X[b:e, ] = RandNormal(N[i], mu[i,], c); /* simulate i_th MVN sample */ b = e + 1; /* next group starts at this index */ end; finish; |
The SimMVNMixture routine allocates a data matrix (X) that is large enough to hold the results. It generates a vector N ={N1, N2,..., NK} to determine the number of observations that will be drawn from each component and calls the RANDNORMAL function to simulate from each Gaussian component. The scalar values b and e keep track of the beginning and ending rows of each sample from each component.
After the module is defined, you can call it for a specific set of parameters. Assume you want to generate K=3 clusters of four-dimensional data (d=4). The following statements specify the mixing probabilities for a three-component model. The mu matrix is a K x d matrix whose rows are the mean vectors for the components. The Cov matrix is a K x (d**2) matrix whose rows are the covariance matrices for the components. The following statements generate a total of 100 observations from the three-component mixture distribution:
/* specify input args; means/cov correspond to sashelp.iris data for species */ pi = {0.35 0.5 0.15}; /* mixing probs for K=3 groups */ mu = {50 34 15 2 , /* means of Group 1 */ 59 28 43 13 , /* means of Group 2 */ 66 30 56 20 }; /* means of Group 3 */ /* specify within-group covariances */ Cov = {12 10 2 1 10 14 1 1 2 1 3 1 1 1 1 1 , /* cov of Group 1 */ 27 9 18 6 9 10 8 4 18 8 22 7 6 4 7 4 , /* cov of Group 2 */ 40 9 30 5 9 10 7 5 30 7 30 5 5 5 5 8 }; /* cov of Group 3 */ /* run the simulation to generate 100 observations */ call randseed(12345); run SimMVNMixture(X, Group, 100, pi, mu, Cov); |
The call to the SimMVNMixture routine returned the simulated random sample in X, which is a 100 x d matrix. The module also returns an ID vector that identifies the component from which each observation was drawn. You can visualize the random sample by writing the data to a SAS data set and using the SGSCATTER procedure to create a paneled scatter plot, as follows:
/* save simulated data to data set */ varNames="x1":"x4"; Y = Group || X; create Clusters from Y[c=("Group" || varNames)]; append from Y; close; quit; ods graphics / attrpriority=none; title "Multivariate Normal Components"; title2 "(*ESC*){unicode pi} = {0.35, 0.5, 0.15}"; proc sgscatter data=Clusters; compare y=(x1 x2) x=(x3 x4) / group=Group markerattrs=(Size=12); run; |
The graph is shown at the top of this article.
Although each component in this example is multivariate normal, the same technique will work for any component distributions. For example, one cluster could be multivariate normal, another multivariate t, and a third multivariate uniform.
In summary, you can create a function module in the SAS/IML language to simulate data from a mixture of Gaussian components. The RandMutinomial function returns a random vector that determines the number of observations to draw from each component. The RandNormal function generates the observations. A technical detail involves how to pass the parameters to the module. This implementation packs the parameters into matrices, but other options, such as a list of lists, are possible.
7 Comments
Very interesting! Given a data cut, is there any way to detect how many clusters in it?
That is an important and difficult question. How well you can fit a Gaussian mixture model to data depends on the sample size, the distance between clusters, and how well you are able to bound and estimate the number of clusters. For some thought on these ideas for one-dimensional mixture distributions, see "The power of finite mixture models."
Maybe PCA can give you a little help to decide how many clusters should be .
proc princomp data=Clusters N=2 noprint out=ScorePlot(keep= Prin1 Prin2);
var x1-x4;
run;
ods graphics / width=800px height=800px;
proc sgplot data=ScorePlot aspect=1;
scatter x=Prin1 y=Prin2 / markerattrs=(symbol=CircleFilled);
refline 0 / axis=x;
refline 0 / axis=y;
run;
Thanks for this interesting topic, Rick!
I have a question when doing simulating using proc iml: what should I do if I want my sample to fall in a certain range, say 0-100?
I know I can use 'do until' if I simulate with data step. I'm just curious if there's a way to do the same job in proc iml, randnormal.
Thank you!
The SAS/IML language supports the DO UNTIL and DO WHILE statements. A common programming paradigm is to set FLAG=0 (which is false) and then
DO UNTIL(FLAG);
/* generate random numbers and set FLAG=1 if a condition is satisfied */
end;
If you are running an acceptance-rejection simulation, here are some efficiency tips.
thanks for the respose Rick!
I tried DO UNTIL loop at first place, but it took a very long time to run.
suppose I'm going to create a 5X10 multivariate matrix with range [0,100], I used the code below and it took SAS almost forever to run:
proc iml;
call randseed(123);
do until (X[,]>=0 & X[,]<=100);
X = RandNormal(1, mean, cov);
end;
Am I using the right code? would you please advice?
Thank you.
Probably not. As I said, depending on the values for MEAN and COV, there is no reason to expect a random variate to be in that range.
You can ask questions at the SAS Support Communities. There is a dedicated community for IML.