Topic Modeling of documents is hot in the research community. Conferences are filled with different ways of determining topic models and how to apply them. The prestigious data mining conference KDD has in recent years had entire sections on topic modeling. The leading algorithms all have three-letter acronyms and sound very complicated and sophisticated:
- NMF (Non-negative Matrix Factorization).
- CTM (Correlated Topic Modeling), and, of course, the current darling of the machine learning community.
- LDA (Latent Direchlet Allocation --- uhhh, not to be confused with Linear Discriminant Analysis which has the same acronym). Generally, topic modeling is inferring “latent” (or unobserved) topics represented in your document collection.
Interestingly, all of these algorithms work with the same definition of what a topic is: A topic is a set of terms each with an associated weight. Determining whether a document contains a topic differs slightly between the procedures but is generally based on a weighted sum of the topic’s terms that do occur in that document, normalized in some way by document length, so that long documents do not automatically contain a lot more topics than small documents. Furthermore, taking the top weighted terms for a topic is a way to label the topic to give it a semantic interpretation.
In Part One of this series, I mentioned the Singular Value Decomposition (SVD), which when applied to a matrix of all terms by all documents, can determine an N-dimensional metric space retaining the maximum information possible for that dimensionality. Since its beginning, SAS Text Miner has been computing an SVD on this term-document matrix as input to spatial clustering procedures and for predictive modeling. It has worked very well for this purpose.
Interestingly, an SVD dimension is composed of a collection of terms with associated weights, just as with topic models. People keep asking us, what does an SVD dimension mean? Our answer has always been: Nothing, by itself. Really it just represents variance, and does not necessarily have any semantic interpretation. This answer always seems unsatisfying to the customers that ask it. In part One, I discussed the situation where you are arranging correspondence in a room so that like items were near like items. The SVD is like that. The overall directions in the room are irrelevant in that case.
Secondly, when computing a value for a document on an SVD dimension, there are weights for every term in the entire collection for every dimension. Thus, if you have a 50-dimensional space and have, say, 20,000 terms in your collection, you have to retain a million weights to use for scoring every one of your new documents. Topics, on the other hand, are usually set up to only have weights on a subset of the terms in your collection. If you have fifty topics, and only two hundred terms have weights on average for a topic, you only need to retain ten thousand weights ---- only one percent of the total.
To cut past the mumbo-jumbo, we applied a variant of a 50-year old technique to fix the two problems above, so that the SVD could be used for topic modeling. This technique was first discovered in the context of IQ tests. That fascinating story will be told in part three. And in part four, I will explain how this 50-year-old technique beat the pants off all these new-fangled approaches with the fancy acronyms that have been so evangelized in the last several years. Stay tuned.