Part 3: Understanding Topic Discovery from an “historical” perspective

I mentioned last time that the technique we use to determine topics is a variant of something that has been around for fifty years. In this part I will talk about the intriguing history of this technique, and in the process, I hope to illuminate what we are doing and why.

We start in the mid-1800s with a half-cousin of Charles Darwin: Sir Frances Galton. Stimulated by his cousin’s work, Galton investigates how traits are passed down from generation to generation. He was the first person to attempt to apply the scientific method to psychological phenomena. In the process of his investigations, he created the concept of “correlation.”

Galton’s protégé and statistical “heir” was Karl Pearson. Pearson was the first person to establish “mathematical statistics.” Before him, statistics was primarily concerned with creating tabulations and summarizing counts. He established a good many of the basic statistical techniques still in use today, including the correlation coefficient (r), Chi-square tests, the notion of p values, the use of the normal distribution for modeling data, the method of moments, and together with Galton, linear regression.

Galton and Pearson introduced the use of questionnaires for gathering information on human capabilities, particularly mental capabilities. They were both active in the Eugenics movement… they believed that if one could identify the most intelligent people, those people could be encouraged to reproduce and those less intelligent discouraged (their work was unfortunately seized upon as a raison d'entendre by the founders of the Nazi party in Germany, but I digress…).

Pearson struggled with the notion of how to use multiple items on a questionnaire to get at some general underlying factor of intelligence. He discovered that if you created a matrix of correlations between all the items on a questionnaire containing mental tasks that you could then “project” the results down to a single line, which would represent a significant amount of the total variance among all the answers in the questionnaire: the position that a given questionnaire is projected to on that line represents the intelligence of the person filling out that questionnaire. Pearson further determined that one could “project” any set of items down to any lower dimensionality in a way that “factored” the matrix. One real-life example of subspace “projecting” is how a camera projects the representation of our three-dimensional world into a two-dimensional photograph. Information is lost in that projection, but hopefully the most critical information is retained. He called the projection technique he developed “Principal Components Analysis” which he published in “Philosophical Magazine” in 1901 (this paper is online at http://stat.smmu.edu.cn/history/pearson1901.pdf ).

It did not take long for this concept to catch on. Charles Spearman refined this technique into what he called “factor analysis” in 1904, and the first widely-used IQ test, the Stanford-Binet, appeared in 1908. This test has been updated over the years and is still in common use today. Over time, however, the notion of one, unitary concept of intelligence fell out of favor --- perhaps there are multiple intelligences. For example, a person might have good mathematical skills but be poor verbally, or vice versa. But how do you get to the notion of what these separate intelligences are?

Let’s go back to the photograph example. Nothing is retained in the photograph about the orientation of the camera when the picture is taken. If the camera is tilted in some direction, then the photograph will not necessarily represent the same view that a person standing up would have of the same scene. But if I or somebody else views that photograph, since I have a general up-down left-right orientation in the world, I am usually able to rotate the photograph to restore that familiar orientation. We can say that these “up-down” and “left-right” dimensions are “latent” or implied dimensions or factors in the photograph.

Similarly, when trying to identify multiple intelligences from a principal components projection of say, a set of responses to an IQ test, we are once again trying to extract the “latent” dimensions ---- those representing the different kinds of intelligence, which may not be known directly. The natural solution is to rotate the projection, just as we would rotate the photograph. Over the years, a variety of these “rotations” were developed to try to understand what these separate intelligences may be. If we assume that each separate question on our test primarily measures only one of these “types” of intelligence, then it turns out that the optimal rotation is known as the “Varimax” rotation and was first proposed by Henry Kaiser in 1958. In the case of the IQ test, the Varimax rotation would rotate the axes to line up as closely as possible with the individual test questions.

Our technique for creating topics from text is very similar to what is described above for discovering types of intelligence using an IQ test --- We are projecting the raw matrix that includes all the terms in the collection down to a much smaller dimensionality using a Singular Value Decomposition, which is equivalent to doing a Principal Components Analysis of the co-variances of the individual terms, and then doing a Varimax rotation to determine the “latent” topics represented in the documents. The topics then are represented by the axes of our rotated space since they are lined up as closely as possible with specific terms in a document collection. Presto!

Jim

3 Comments

Faye Merrideth on May 13, 2010 11:27 am

I appreciated reading about the roots of topic discovery in text analytics, Jim. Thanks for making this mostly-easy to understand for your business-side readers.
JD on August 25, 2010 5:31 pm

So what about part 4?
Pingback: Top 6 reasons to start a company blog - Customer Analytics

Blogs

Blogs

Part 3: Understanding Topic Discovery from an “historical” perspective

About Author

3 Comments