In my previous post, I blogged about how to sample from a finite mixture distribution. I showed how to simulate variables from populations that are composed of two or more subpopulations. Modeling a response variable as a mixture distribution is an active area of statistics, as judged by many talks on the topic at JSM 2011. When SAS 9.3 shipped, I promised to discuss my experiences with the new experimental FMM procedure, which fits finite mixture models. This article describes modeling univariate data as a mixture of normal distributions.
Initial thoughts and misconceptions
I confess: I don't know much about finite mixture models. I read some of the documentation and followed some of the examples, but I've barely begun to explore the capabilities of PROC FMM. (Thanks to my colleagues, Dave and John, for reviewing this article.)
The FMM procedure in SAS/STAT software is not a discrimination procedure. When you use the CANDISC and DISCRIM procedures, you must know the group membership for each observation in order to build a model that classifies the group membership of future observations. Similarly, most SAS regression procedures enable you to use a CLASS statement to build a model that accounts for group membership. The beauty of the FMM procedure is that you don't need this knowledge, and yet you can still build a model for response variables that are multi-modal. (You can, however, supply partial group information when you know it by using the PARTIAL option.) In my mind, I pretend that PROC FMM models the components as latent variables, since I'm familiar with that concept.
The way that the FMM procedure models a response variable reminds me of clustering procedures: You tell the FMM procedure that the response variable consists of k components, and the procedure tries to use the data to discover (and model) the group structure as part of building a regression model. This means that there are many parameters in this problem: there are the usual parameters for the regression effects, but there are also parameters that estimate the location, shape, and strength of each component.
A first example: Modeling the distribution of Scrabble® scores
As a first example, I want to see whether the FMM procedure can model a univariate distribution as a mixture of normal densities. The data are the scores for each word played in series of a Scrabble games between my mother and me.
As I wrote previously, the distribution of scores seems to have three components. Most scores are in the 12–15 point range, and correspond to standard plays. There is a small group of larger scores that correspond to triple word scores. For my mother's scores, there is a third group of small and negative scores that correspond to plays made at the end of a game. You can download the data for these Scrabble games.
For the games recorded here, my scores had few end-of-game penalties. Consequently, I assume that my scores fall into two groups. I use the FMM procedure with the K=2 option to model my scores as a mixture of two normal distributions. Notice that I specify the response variable in the MODEL statement, but do not specify any covariate effects, so this is a model for the mean of the SCORE variable:
proc fmm data=scrabble(where=(Player=2)); /* Rick's scores */ model Score = / k=2; run;
The analysis models the distribution of scores as a mixture of two normal densities. The first component has mean 12.7, which represents my average score on a "typical" word play. The variance of this distribution is 44.6, which corresponds to a standard deviation of 6.7 points per play. The second component has a mean of 37.3 and a variance of 8.8 (a standard deviation just under 3 points), and represents my triple word score plays.
The blue curve, which is almost not visible, is the overall density estimate. It is a weighted sum of the two component densities. The weights sum to unity, so there is one parameter, called the mixing parameter, that needs to be estimated by the FMM procedure in addition to the two location and scale parameters for each component. For these data, the mixing probability estimate is 2.86. This estimate is on a logit scale, which means it corresponds to a mixing probability of exp(2.86)/(exp(2.86) +exp(0)) = 0.946. This number is shown in the "Probability" column of the second table. Equivalently, the estimated PDF is 0.95*f1 + 0.05*f2, where f1 is the PDF for the first component and f2 is the PDF for the second component. These weights make sense because my mother always "steals" the triple word plays from me, which means that I have few opportunities to play a triple word score!
Now I'll try to estimate my mother's scores as a mixture of three normal densities by using the following call to the FMM procedure:
proc fmm data=scrabble(where=(Player=1)); /* Mom's scores */ model Score = / k=3; run;
Unfortunately, this approach doesn't work very well for my mother's data. The FMM procedure converges to a set of parameters that correspond to a local maximum of the likelihood function. This solution models the data as a mixture of three distributions that are practically identical in terms of location and scale.
Fortunately, you can give "hints" to the FMM procedure by specifying starting values for the parameter estimates. The following call specifies a guess for the location and scale parameters, based on my knowledge of the game of Scrabble:
proc fmm data=scrabble(where=(Player=1)); model Score = / k=3 parms(-5 4, 13 49, 33 9); /* means: -5, 13, 33 */ run;
This time, the FMM procedure converges to a model that satisfies my prior beliefs. The means of the three components are at -6.3, 12.6, and 27.4. The mixing probability estimates are shown in the "Probabilities" column. They are 0.04 and 0.82, which means that the third mixing probability estimate is 0.14. Notice that the triple-word-score component for my mother accounts for 14% of the probability, whereas it accounts for only 5% of my scores.
Conclusions and gotchas
The FMM procedure did a good job of finding intuitive components for Scrabble scores. For my scores, it found two normal components such that the density of scores is a weighted sum of the components. For my mother's scores, it found three components, although I had to supply a hint. Notice that the procedure needs to estimate eight parameters in order to solve the three-component problem!
There were three lessons I learned from this exercise. The first two are "gotchas" regarding the 9.3 syntax:
- Unlike some other SAS functions, the FMM procedure represents scale in terms of the variance. If you are used to specifying the scale parameter as a standard deviation in the RAND, PDF, and CDF functions and in the UNIVARIATE and SGPLOT procedures, this convention might cause momentary confusion.
- Unlike the HISTOGRAM statement in the UNIVARIATE procedure, in which you specify all of the means followed by all of the standard deviations, the syntax of the PARMS option in the FMM procedure requires that you specify the mean and variance of each component in sequence.
The third lesson concerns the limitations of the FMM procedure. If two components of the response distribution are close to each other, then it is unreasonable to expect the FMM procedure to be able to model them as separate components. This is similar to the fact that clustering procedures have difficulties finding poorly separated clusters. I'll discuss this issue further in a future article.