Recently, I have been thinking about how search can play more of a part in discovery and exploration with SAS Text Miner. Unsupervised text discovery usually begins with a look at the frequent or highly weighted terms in the collection, perhaps includes some edits to the synonym and stop lists, and then performs a topic or cluster analysis of the entire collection. Search is sometimes more of an afterthought in the analysis. But search can become a very useful technique, particularly if you have no additional variables on your data to train with or profile. So, here are some things to consider the next time you're working on understanding the content of your collection. I hope you can add some suggestions or best practices that you have discovered as a comment to this post.
If you think creatively, there are a number of ways to form a query and identify a result set in SAS Text Miner. Some are obvious and some are not. I list three of those ways here and demonstrate some aspects on VAERS data:
- Interactive and Automated Search in the Filter Node
This is the obvious one. The Text Filter node of SAS Text Miner provides an interface to query based on a term or set of terms along with the ability to negate a term or match with a wildcard character. The query can be performed in the interactive window, shown to the right, or it can be automated with the node's Search Expression property. If you save your results of the search, the Text Filter node will output only the documents that match the query.
- Pattern-Based Search in the Text Parsing Node
The Text Parsing node, in combination with concept categorization rules, contains a powerful mechanism to identify patterns in strings rather than merely the strings themselves. These patterns can help identify when something like a medical diagnosis code occurs in the text or whenever two terms of interest such as “hand” and “rash” appear close to one another. While this approach does not directly create a subset of your documents that matches the pattern, it does add a feature to your term list that can be used to restrict the collection to those observations that match the pattern. Some code that would be helpful in processing your data is found at the end of this SAS Global Forum paper.
- Search as User Topics in the Text Topic Node
The Text Topic node has a terrific feature, the User Topics property, that allows you to provide a set of terms to create user-defined topics. This subset of terms can be thought of as a query set and the documents that belong to the topic are your query results. The output data from this node is not a strict subset of your documents like the Text Filter node accomplishes. Instead it provides a new variable identifier on your documents that indicate which documents match this “query”. You can create several of your own queries simultaneously with this method.
Now, that you have the query results, how can you leverage them?
With any of the query approaches above, a second analysis can occur on the subset that makes up the retrieved documents. In essence, the query gives your exploration of the text a particular point of view by restricting the analysis to this subset. The observed relationships and patterns may be more meaningful and useful because the query has been used to focus on a particular aspect of the text. You may find it useful to consider how the clusters and topics change when you transition from the whole collection to the retrieved subset. With multiple queries, a separate Text Topic node can be run for each query and the topics can be compared. Shown in the two tables below are the result of each of the two topic runs, one for the documents related to pain and the other for documents related to skin. Note that I dropped the terms that composed the pain and skin queries. While this example is mainly for demonstrating the process, you can see the similarities and differences between these two topics. With your data, this could provide valuable insight and help you understand the content.
Certainly there are other things that can be done. For one, the Text Profile node could be used to compare these subsets. Are there other approaches that you use or would like to see that incorporate search? I would love to hear from you!