Don’t get me wrong. I have no doubt in the capabilities of our SAS products and SAS solutions! But I wanted to get a firsthand experience of our new solution for text analytics, SAS Contextual Analysis 14.1. And the result is very convincing!
But let’s start from the beginning.
Functions and capabilities of SAS® Contextual Analysis
If you take a look at the product description of SAS Contextual Analysis, you learn that you can use it to analyze large collections of text documents, identify sentiments, and create robust models to categorize and extract content. This allows you to automatically identify topics in your document collections and define categories and rules in natural language to assign documents to these categories.
The self-trial: Text analytics with my two SAS® Press books
To better understand the processes and the outcome of text analytics with SAS Contextual Analysis, I used a document collection that is close to my heart and that I know in great detail: the 59 chapters of my two SAS Press books, Data Preparation for Analytics Using SAS and Data Quality for Analytics Using SAS.
Sure, the small number of 59 documents is not really a “big data problem,” and the SAS In-Memory Analytics engine can also deal with millions of documents. However, I was interested to see whether SAS Contextual Analysis can identify topics in my book chapters and which book chapters should be combined into the same cluster. And no a priori knowledge from me as an author would be used for the categorization.
Text analytics processing with SAS® Contextual Analysis
From a data mining point of view, we are dealing here with a typical unsupervised analysis. Just the data are presented to the analytic tool, and no
additional information of segment assignment is available. SAS Contextual Analysis imports the data, one file per chapter, from a folder on my hard disk and runs through the entire process of text analytics:
- Document parsing and assigning the words to different entities (noun, verb, etc.).
- Synonym detection and the application of stop lists to remove redundant words like “the,” “and,” “of,” “with,” “we,” etc.
- The weighting of the terms and the identification of those terms that are important to define groups of documents.
- Automatic detection of underlying topics in the documents.
It works! Eight clearly separated document clusters as a result
For better illustration, I have used weights of the automatically detected topics for each of the 59 documents to cluster them with SAS® Enterprise Miner™. Eight clusters were automatically detected, which are presented in the table below.
For better visualization, the chapters of the “Data Quality Book” are shown in green and the chapters of the “Data Preparation Book” are shown in yellow.
You can easily see how the chapters grouped to clusters based on content. Some clusters only contain chapters from one book:
- Cluster 1 contains those chapters from the Data Quality Book that deal with the topic of missing values.
- Cluster 7 contains the simulation studies that are described in chapters 15-23 of the Data Quality Book.
Some clusters contain chapters from both books:
- Cluster 8 contains chapters from the Data Preparation Book that deal with analytics data mart structures. And Appendix E in the Data Quality Book is a summary of the content of these chapters. This is an impressive example of documents only grouped based on their content. And chapter content that is considered to be “close” or “similar” is truly detected as such.
The different number of documents per cluster also show that no fixed clustering scheme is used here, but that the document content defines how the groups are set up and how they are populated.
- Cluster 4 only contains a single chapter. This chapter is an introduction to a collection of case studies and obviously does not compare with other chapters in the books.
Moving on to new business cases
These results convinced me even more that SAS Contextual Analysis allows you to gain insight into your document collections. You learn what your customers think and write about your company or organization. You see the topics that are contained in your documents and how you can automatically group them without having to read every single document.