Focusing your Text Mining with Search Queries

5

Recently, I have been thinking about how search can play more of a part in discovery and exploration with SAS Text Miner. Unsupervised text discovery usually begins with a look at the frequent or highly weighted terms in the collection, perhaps includes some edits to the synonym and stop lists, and then performs a topic or cluster analysis of the entire collection. Search is sometimes more of an afterthought in the analysis. But search can become a very useful technique, particularly if you have no additional variables on your data to train with or profile. So, here are some things to consider the next time you're working on understanding the content of your collection. I hope you can add some suggestions or best practices that you have discovered as a comment to this post.

If you think creatively, there are a number of ways to form a query and identify a result set in SAS Text Miner. Some are obvious and some are not. I list three of those ways here and demonstrate some aspects on  VAERS data:

  1.  Interactive and Automated Search in the Filter Node
    This is the obvious one. The Text Filter node of SAS Text Miner provides an interface to query based on a term or set of terms along with the ability to negate a term or match with a wildcard character. The query can be performed in the interactive window, shown to the right, or it can be automated with the node's Search Expression property. If you save your results of the search, the  Text Filter node will output only the documents that match the query.

    Query and Query Results in the Text Filter Node's interactive window
    Query and Query Results in the Text Filter Node's interactive window
  2. Pattern-Based Search in the Text Parsing Node
    The Text Parsing node, in combination with concept categorization rules, contains a powerful mechanism to identify patterns in strings rather than merely the strings themselves. These patterns can help identify when something like a medical diagnosis code occurs in the text or whenever two terms of interest such as “hand” and “rash” appear close to one another. While this approach does not directly create a subset of your documents that matches the pattern, it does add a feature to your term list that can be used to restrict the collection to those observations that match the pattern. Some code that would be helpful in processing your data is found at the end of this SAS Global Forum paper.
  3. Search as User Topics in the Text Topic Node
    The Text Topic node has a terrific feature, the User Topics property, that allows you to provide a set of terms to create user-defined topics. This subset of terms can be thought of as a query set and the documents that belong to the topic are your query results. The output data from this node is not a strict subset of your documents like the Text Filter node accomplishes. Instead it provides a new variable identifier on your documents that indicate which documents match this “query”.  You can create several of your own queries simultaneously with this method.

 

Creating User Topics in the Text Topic Node
Creating User Topics in the Text Topic Node

Now, that you have the query results, how can you leverage them?
With any of the query approaches above, a second analysis can occur on the subset that makes up the retrieved documents. In essence, the query gives your exploration of the text a particular point of view by restricting the analysis to this subset. The observed relationships and patterns may be more meaningful and useful because the query has been used to focus on a particular aspect of the text.  You may find it useful to consider how the clusters and topics change when you transition from the whole collection to the retrieved subset.  With multiple queries, a separate Text Topic node can be run for each query and the topics can be compared.  Shown in the two tables below are the result of each of the two topic runs, one for the documents related to pain and the other for documents related to skin. Note that I dropped the terms that composed the pain and skin queries.  While this example is mainly for demonstrating the process, you can see the similarities and differences between these two topics. With your data, this could provide valuable insight and help you understand the content.

An automatic topic run done on the documents that matched the user-defined "Pain" topic.
An automatic topic run done on the documents that matched the user-defined "Pain" topic.

 

An automatic topic run done on the documents that matched the user-defined "Skin" topic.
An automatic topic run done on the documents that matched the user-defined "Skin" topic.

Certainly there are other things that can be done. For one, the Text Profile node could be used to compare these subsets. Are there other approaches that you use or would like to see that incorporate search? I would love to hear from you!

 

 

Share

About Author

Russ Albright

Principal Research Statistician Developer

Russ Albright is a research statistician developer at SAS. Over the past 15 years, his text mining work has involved algorithm development, statistical modeling, and high performance computing. He holds a PhD in Applied Mathematics from Clemson University and enjoys helping customers use SAS software on their text analytics problems.

5 Comments

  1. This is great. I am just starting to move my analytic into the aspect of dealing with unstructured data analysis. I've heard of libraries and i'm wondering if all the available libraries can actually fit into every text data analysis.

    • Russ Albright

      You really do have to dive in. Part of of the fun of this kind of analysis is learning something new and unexpected.

Back to Top