Today’s natural language processing (NLP) systems can do some amazing things, including enabling the transformation of unstructured data into structured numerical and/or categorical data.
Why is this important? Because once the key information has been identified or a key pattern modeled, the newly created, structured data can be used in predictive models or visualized to explain events and trends in the world. In fact, one of the great benefits of working with unstructured data is that it is created directly by the people with the knowledge that is interesting to decision makers. Unstructured data directly reflects the interests, feelings, opinions and knowledge of customers, employees, patients, citizens, etc.
For example, if I am an automobile design engineer, and I have ready access to a good summary of what customers liked or didn’t like about the last year’s vehicle and competitor vehicles in a similar class, then I have a better chance of creating a superior and more popular design this year.
My previous article, “Behind the scenes in natural language processing: Is machine learning the answer?,” mentioned that the two most-common approaches to NLP are rule-based (human-driven) systems or statistical (machine-driven or machine learning) systems. I began the discussion of rule-based systems by describing some benefits. But these systems also pose some challenges, which I will elaborate on here.
Challenges of rule-based systems:
- People – finding the right experts.
- Clarity – defining the goals of the system or model.
- Process – developing, testing and modifying the rules.
- Generalization – understanding and planning for limitations.
Let’s look at each of these.
People. Anticipating, correctly interpreting and understanding the particular types of challenges in language processing is critical to building an effective model or set of rules (terms used interchangeably here) for more complex language processing tasks. So the first challenge of implementing a rule-based system for NLP is finding the right experts and providing them with the right tools to build a set of rules that will be applied to existing and (potentially) new data, with high-quality results.
Finding an expert who can work in a technical system, but who is not afraid to read and analyze text for both meaning and structure, may seem daunting. However, the benefits of doing so will be worth the investment. I recommend linguists with NLP, corpus analysis or computational linguistics exposure, as well as data scientists with a text analysis focus. In other words, analysts who have analyzed text directly – not just applied prebuilt systems to text. Additionally, some technical writers or other linguistically oriented subject matter experts with experience in statistics or analytics are likely to be successful in building good models. The key skill this person brings is understanding how text data must be analyzed in order to get the results desired; this means using the right tools to build the most effective and efficient model.
The tools needed will vary based upon the task at hand and the business goals. This leads us into the second challenge.
Clarity. High-level text processing tasks that align with business needs include methods such as text mining (finding trends and patterns in text data), categorization or clustering (placing text data into groups), information extraction (finding specific types of information and pulling it from text data), and semantic processing (finding relationships between concepts in text data). SAS has a full suite of text analytics solutions that encompasses all of these tasks, and which easily feeds results into further predictive modeling and interactive visual analytics. It’s important to consider the goals of the system the linguistic rules will address so that the rules can be tailored to the specific business goals. Language variation makes modeling patterns difficult unless one can zero in on the patterns that matter for the given task.
For example, trying to classify customer calls into the categories “high risk” or “low risk” of losing the customer is quite different from understanding what customers thought of the latest product in their online review comments. For the former, it may be sufficient to have a list of key words or phrases to mark a call as high or low risk. For the latter, distinguishing sentences like those below will be necessary:
- I wanted to like my new XYZ dishwasher, but the controls were too confusing.
- The XYZ is my new best friend! I didn’t like replacing my old dishwasher, but now I am hooked on brand ABC.
- I would have liked the dishwasher better if it had a timer.
Each of these sentences have some form of the verb “like” and a mention of the product within a few words of each other, but the meanings are not always conveying positive sentiment. Also, the sentence where “like” is negated with “didn’t” IS actually a positive review! SAS® Sentiment Analysis and SAS Contextual Analysis both provide the capability to create rules that are sensitive enough to make these types of distinctions.
The types of text inputs are also important to consider. What were the documents created for? Who created them? Are they edited or unedited? Are they homogeneous or heterogeneous? These questions are important because they reflect what types of language and language variation will be present in the data.
The data should also be aligned with the overall purpose of the analysis, and any data quality issues will need to be addressed. The result of considering these issues will be a better design, incorporating the level of complexity required of the rule set or text model and the best process for measuring quality.
Process. The method used to develop and test the text model must be disciplined and principled in order to assess and manage the quality of the output. Since the measurement of quality in different NLP systems and text analytics models is a complex topic, I will revisit it in more detail in a future article.
Maintenance, auditing and tracing behaviors are also also a part of this challenge and are really the source of many complaints about rule-based systems being too unwieldy. The truth is that if we managed our rule-based systems like we do software code, the idea that these systems can’t be maintained in an orderly fashion would seem silly. I’ve seen sophisticated rule-based systems over the course of my career that were very robust and could accurately analyze both syntactic and semantic aspects of language. The most advanced ones were well-designed and had the proper testing, tracing and maintenance components. The only real deficit was that they involved complex processing, which was slow – but with today’s processors, speed is not such a big problem.
Today more data can be analyzed at faster speeds than ever before. The key is to balance speeds and depth of language analysis to match the types of business questions being asked. For example, if I need to stream data into a decision system while an interaction is taking place, then a simpler model will process data faster (detailed example is here). However, if the decisions being made are high risk and need to be very precise, it will be better to take the time to allow a more complex model to process the data.
Generalization. Once you’ve addressed the first three challenges above, then the last piece – applying the model appropriately – becomes substantially easier. Having the right expert with clear goals and the right process and tools allows you to know the strengths and weaknesses of your text model; improve the model and identify the data that will be used to do so; recognize language change and incorporate it into the model; and develop a schedule of ongoing testing and a test plan. The worst mistake of language analysis is applying a model that has been built on one type of data blindly to a different type of data and expecting good results. The expert discussed above, and a good measurement process for quality, should safeguard against such errors.
Now that we have reviewed the pros and cons of rule-based systems, you may be thinking: Isn’t there an easier way?
Many have wondered the same thing, turning to statistical approaches to NLP for that “easy button.” Machine learning (ML), a subfield of artificial intelligence that employs various statistical methods, is commonly used in NLP, and many people are excited by the success of such applications in recent years. This success of ML approaches in more recent NLP systems is due to two changes in the supporting ecosystem. One is the acceleration of processors; what would have taken days or weeks of processing time 10 years or so ago takes only hours or minutes today. The other is the availability of data, including both tagged and untagged document collections.
The strength of statistical processing of text relies on the fact that language is inherently patterned on multiple levels. The same techniques we apply to other aspects of our world to uncover new patterns can also be successfully applied to language. Clustering, for example, can uncover inherent patterns grouping texts together into related sets; sometimes these sets correspond to meaningful topic areas or areas of human endeavor. This is an example of unsupervised learning applied to texts (using untagged data), which is quick and requires the least upfront knowledge of the data. This type of approach is best applied in situations where little is known about the data, and a high-level view is desired.
Supervised techniques, which are generally more powerful, are frequently used in applications for categorization, voice recognition, machine translation and sentiment analysis. These approaches both leverage and require a pre-tagged data set to be used as training, testing and validation data. In other words, the “supervision” part of machine learning is telling the computer what patterns are important, and providing examples and counter-examples for each distinction the model should make. With sufficient, representative and high-quality training data, such systems perform well across many different tasks in NLP. SAS Text Analytics solutions also enable the application of both unsupervised and supervised machine learning algorithms to text data.
When applying machine learning techniques to NLP analyses, it’s frequently easy to find an algorithm that will build a model, and the process is also usually straightforward. You plug in training data, build the model with a button push or a few configuration steps, and then evaluate the result with your testing or evaluation data. Poof! You now have a model and also metrics on how accurate your model is. Unfortunately that is not the whole story. In the third article of this series, I’ll describe some challenges of applying machine learning models to text data.
Have you run into challenges applying linguistic rules to your text data? How have you overcome them? For additional business examples of text analytics in use, check out these stories: