SAS® Text Miner 14.1: Faster!


A new version of SAS® Text Miner and SAS® High-Performance Text Mining has recently been made available and I want to demonstrate some of the performance improvements that can be gained with this release. I’ll use a topic analysis that discovers the main themes in a document collection and consists of the following main steps:

  1. Parsing the observations (with complex natural language processing that includes stemming, part-of-speech tagging and noun group discovery).
  2. Summarizing the result into a weighted, sparse term-by-document frequency table.
  3. Factoring that table with the Singular Value Decomposition ( SVD).
  4. Optimally rotating the SVD dimensions to produce 25 topics.

The analysis is run on a data set of short customer comments with the number of observations ranging from 1 million to 4 million. All of the SAS® Text Miner runs are done on the same 2.90 GHz Intel Xeon 5-2677 system that is restricted to 8 threads and 8GB of memory. SAS® High-Performance Text Mining runs on a grid with 144 nodes, each with 2 2.7GHz Intel Xeon E5-2680 CPUs and 256GB of memory.


The Long and Short of It!

In the graph below, the timing of the topic calculation in SAS Text Miner 13.2 is shown in red and SAS Text Miner 14.1 in green.

You can see from the graph that the SAS® Text Miner 14.1 run time is roughly two-thirds of the run time from the previous release. Four million documents were analyzed in roughly an hour. Most of the speed up you see in SAS® Text Miner 14.1 is due to the use of multiple threads for several aspects of the computation.

SAS® Text Miner and SAS® High-Performance Text Mining run times
SAS® Text Miner and SAS® High-Performance Text Mining run times

Scaling Out!

In the figure above, do you see the blue line hovering near the x-axis? When compared to the green and red SAS® Text Miner runs on a single machine, it is barely noticeable. That blue line represents the timing for the same analysis with our high-performance product in this new release. It is here where we truly get the benefits of scaling out across a grid of machines. The analysis of the largest data set, 4 million documents, took less than four minutes with the grid. .

Get More Done!

I think we all usually focus on the notion that these speed improvements will allow us to do larger and larger problems. While this is certainly true, it is not the only benefit. These improvements can also enable us to do a more exhaustive exploration of the model space on the same-sized problem; the speed improvements allow for more time to search for improved solutions to our problems. So I am ready for larger problems and better models in SAS® Text Miner 14.1!



About Author

Russ Albright

Principal Research Statistician Developer

Russ Albright is a research statistician developer at SAS. Over the past 15 years, his text mining work has involved algorithm development, statistical modeling, and high performance computing. He holds a PhD in Applied Mathematics from Clemson University and enjoys helping customers use SAS software on their text analytics problems.

1 Comment

Back to Top