When I talk with more analytically savvy users of SAS® Text Miner or SAS® Contextual Analysis, I inevitably get asked questions about why SAS uses a completely different approach to topic generation than anybody else and why should they trust the approach SAS adopts?
These are good questions. I first addressed them back in 2010 in a three-part series of blog posts titled The Whats, Whys, and Wherefores of Topic Management.
In that series, I talked about how generating a matrix composition – the singular value decomposition (SVD) – of a term-by-document matrix could place both documents and terms as points in a multidimensional space. In this space, the closeness of any two points relate how similar those particular documents and/or terms are to each other. Then, by rotating the axes in that space so that terms align with these axes, one brings to light interpretable topics. One document might line up well with a few of those topics, meaning it is “about” those topics. And the terms that are strongly aligned with those new axes give a semantic interpretation to those topics.
This method is very similar to factor analysis, developed back in the early 1900s to uncover latent aspects of something – for example, different kinds of intelligence a person might possess based on answers to questions on an IQ test. In fact, factor analysis has been of prime importance over the years. For example, the Myers-Briggs personality inventory aligns an individual on four different personality traits based on answers to a personality inventory.
At any rate, when we first decided to create topics, back in 2008, we compared the topics generated by this “rotated SVD” approach to those created by latent dirichlet allocation (LDA), which was initially developed in 2003, and is the approach “everyone else uses.”
A term-by-document matrix stores the number of times each term occurs in each document in each “cell” of the matrix. An SVD assumes that the values it works with are distributed as a normal bell curve, whereas the LDA models frequencies directly. Advantage: LDA.
However, it turns out that we don’t actually apply the SVD to the counts directly. We apply them to counts that have been weighted, typically using what is known as a tf-idf weighting. In most cases, we multiply the log of the number of times a term occurs in a document (the tf part), with a term weight calculated as the inverse of its frequency in the document collection (the idf part). This actually ends up mapping to a distribution that is close to a bell curve in practice, and it evens out the overall weight of all terms when viewed across an entire document collection. If you’re familiar with principal components analysis, the result is similar to subtracting out the mean and dividing by the standard deviation of each variable in a set of variables.
We tested our approach In 2008 by creating some artificial data that had a known topic structure, and determined that the rotated SVD approach was able to generate topics much closer to that known structure than the LDA. There was no natural way in LDA to do term weighting on the raw frequencies. Once the frequencies are weighted, they are no longer frequencies, and the math behind the LDA no longer applies. Furthermore, the rotated SVD approach is much faster than LDA, and the LDA can generate different results every time you run it. So it was a no-brainer to use the rotated SVD.
Since 2008, though, the world has changed. Nowadays, if someone even mentions topic modeling, it is just assumed that they are using LDA. So it is natural to wonder why SAS doesn’t. Furthermore, LDA has been improved in the last seven years. Notably, most people using LDA today use a “burstiness” model which tries to incorporate this term frequency weightings to generate better results.
So it is time for us to revisit the topic of topics: How does rotated SVD compare to these more modern LDA approaches? Is it still superior, or does LDA with burstiness and other innovations leave our approach gathering dust in the woodshed?
And now that we've reviewed the history, that is the topic for part 2 of this series. Stay tuned. We have done the comparisons, and the results may surprise you.