Those of you who have seen the new version of SAS Text Miner know that we are transitioning from a “document clustering” approach towards a “topics in documents” approach. I have received a lot of questions about this, so I thought I would address some of them in a multi-part series on this blog. This is part One.
An obvious question is: How do document "topics" and "clusters" differ?
Well, thanks for asking! The simple answer is that documents are characterized by exactly one cluster, but they contain zero or more topics. Okay, let’s try an exercise: look around your room and take note of where everything is positioned. Now imagine that the room is totally empty. Assume you have a collection of documents, perhaps letters that you have received in the mail, and you can put each one anywhere in the room that you like, without worrying about gravity or anything else --- they will always stay exactly where you place them. Furthermore, you want to arrange them so that similar documents are close to each other, while very different documents are far apart.
Now, the way you want to arrange the documents depends on the purpose of the arranging. One way is to set them up so that different parts of the room correspond to ideas characterizing each document in that part of the room. For example, all the documents in the left corner are letters to/from my sister Sally, all the documents in the right upper left are bills that I need to pay, etc. This corresponds to a “Text Clustering” exercise. You are putting every document in a single category of interest. But things become complicated if you have letters that belong to more than one category. For example, if you are renting your home from your sister Sally and you receive a bill from her, where do you put that?
This leads to the alternative approach where you arrange the documents along “dimensions”or directions of interest. Assume that the room is rectangular, and one wall points directly North. In this case, there are three directions --- East to West, North to South, and Floor to Ceiling. You might then decide to represent your documents by three dimensions, say: Business correspondence goes North, Personal Correspondence goes East, and important correspondence near the floor (e.g. Junk mail at the ceiling, a letter from the utility company threatening to cut off your water tomorrow at the bottom). So your rent bill from your sister Sally would go North-East near the bottom, while the personal letter might go South-East somewhere midway between floor and ceiling since the importance is not all that relevant. This alternative approach corresponds to Document topics.
Some of you may know that in SAS Text Miner, we have a way of semi-magically creating this space, which could be many more than the three dimensions in this toy example. It is called the Singular Value Decomposition (or SVD) and it has the mathematical property that it creates an N-dimensional space that is proven to lose as little information as possible when converting (or projecting) documents in this space. Furthermore, the SVD plays a prominent role in both document clustering (as in the original Text Miner Node) and in topic modeling, as in the Text Topic Node that is new in SAS Text Miner 4.2. But more about that in the next part in this series. Stay tuned. I'll post it sometime next week.