Voice of the customer analysis (Part 1)

VOC1This is the first of two articles looking at how to listen to what your customers are saying and act upon it – that is, how to understand the voice of the customer. Over the last few years, one of the big uses  for SAS® Text Analytics has been to identify consumers’ perceptions and attitudes from the language they use. This is commonly called sentiment analysis, opinion mining or voice of the customer analysis.

Voice of the customer analysis can have significant value for organizations looking to listen to and understand the customer’s “voice” (e.g., from surveys, social media, complaints or web chat) to improve operations and help direct strategy. This approach can, ultimately, help improve customer satisfaction, net promoter score (NPS) and loyalty while reducing churn and dormancy, thus increasing revenues.

There are however a number of challenges to doing voice of the customer analysis well, especially as the focus often seems to be the “sentiment score.” This binary polarity of the positive/negative score is limiting and has a number of challenges:

  • Customers have different personalities and emotions and communicate in very different ways. Should “amazing” mean a higher sentiment score than “good”? Should someone who swears or uses sarcasm get a more negative sentiment score than someone who says “I was rather disappointed you let me down”? The language a person uses is as much about personality as it is sentiment.
  • The polarity of the sentiment score is often too simplistic; it does not consider your businesses objectives and the brand experience you are aiming to deliver. Rather than producing a score, it’s often more useful to assess what people think of your brand’s products and services and their features. For example, for an airline, features might be check-in, on-board service, price, quality of food or timeliness. These opinions can then be weighted by your brand’s priorities if you want an overall score.
  • Ultimately thought needs to be given to how the sentiment score will be used. Is it for reporting, or could it be used for alerts when it rises or falls? The top-level score on its own is not particularly useful. If a benchmark is the objective, a net promoter score (NPS) may be a more useful metric.

With the voice of the customer analysis in hand, we can then focus on gaining insight into the root causes of satisfaction and dissatisfaction, for example, improved feedback on product design or refined handling of customer interactions.

voice of the customer

Frequency of positive and negative feature mentions, color-coded by the average NPS

A simple example of this is shown in the screenshot, which shows frequency of positive  
and negative feature mentions, color-coded by the average NPS. This shows a positive check-in experience is associated with higher NPS, and although customers are negative about food, this doesn't seem to effect customer NPS scores as much as delays or lost baggage.

Further analysis of feature sentiment can identify the causes of the low satisfaction. For example, poor check-in experience may be caused by perceived poor service quality; queuing time; design of automated check-in machines; problems with baggage allowances; or time for first-class passengers to get to the lounge.

For voice of the customer analysis, your source documents could include surveys (e.g., NPS), complaints, web chat or social media. It’s often worth starting with the internal sources, like surveys and complaints, as these can be directly attributed to a known customer. This will mean that you can also use structured data about these customers and their behavior in your text analysis; the combination will be more powerful than just the text alone. Even in an anonymous NPS survey, the structured questions will help with assessing the causes of dissatisfaction.

Whatever the document source, as part of the definition stage it’s worth reading a small sample of documents. This will give you an initial impression of how the documents fit with your analysis objective. At this stage you should just be aiming to assess how language is used. For example, how long are the documents? Are the documents written by customers or employees? Is the language formal or informal? voice of the customer Do they contain much sarcasm? How concise (or verbose) have the authors been? What’s the quality of spelling like? How technical is the language? Are many abbreviations used? If speech to text technology (or web chat) has been used, does the text differentiate between speakers?

So we have started to define our problem. Next time we’ll explore an approach to voice of the customer analysis that moves beyond the rather narrow view of sentiment polarity and focuses on listening to the voice of the customer so you can make decisions that will improve the customer experience.

Post a Comment

Cognitive Computing - Part 1

Is cognitive computing an application of text mining?

If you have asked this question, you are not alone. In fact, lately I have heard it quite often. So what is cognitive computing, really? A cognitive computing system, as stated by Dr. John E. Kelly III, is one that has the ability to be trained to “… learn at scale, reason with purpose, and interact with humans naturally.”

What does that mean exactly? Perhaps the best known example is IBM’s Watson winning Jeopardy!. Watson was able to understand the questions asked by a human, with all of the intonations, puns and expressions inherent in the asking of the questions and the questions themselves; search for the appropriate answer; give its answer a confidence that would determine whether Watson would “buzz” in; and then provide the answer. Was it perfect? Of course not; just as there are no human experts who truly know everything in their fields, no cognitive computing system is perfect either. Every person, even the most knowledgeable expert, sometimes has to say, “I don’t know,” or “I think xyz, but I’m not completely sure.” The degree of certainty that a human has with knowledge is hard to quantify, and is based mostly on a gut feeling, but the degree of certainty that a cognitive computing system has for any answer is a confidence level. This can be thought of as a score that corresponds to the quality of the decision after the system has evaluated all of its options and information. Based on that number, the machine can decide whether to answer a question, or what the best answer is for a given question.

CC Blog Part 1 Image

The second part of cognitive computing systems is that these systems are able to learn. This promises great things for the future, because not only can everyone interact with the system without having to learn a specific language or a specific interface, but also the machine learns how to interact better with the user over time. How does such a system learn? Typically after it is fed information, usually in large amounts, a cognitive computing system receives inputs and outcome sets. The inputs could include textual data, images, videos, speech, numbers, IoT data, etc. Next, as a human interacts with the system, grading the system’s responses and their accuracy levels, and providing feedback , the system refines its internal training model. To me, it’s analogous to a human studying for a test in a new subject, and using tools such as textbooks and practice exams (Q&A, essentially) as study aides, and perhaps also working with a tutor.

The third exciting component is that cognitive systems take advantage of what computers do really well. They can access and process enormous amounts of data very fast, and the data does not have to be specifically prepared for the processing. Cognitive systems are built to handle unstructured data; this extends the sources of information for these systems far beyond the realm of the traditional databases. The majority of data in the world is unstructured data; this makes sense because unstructured data is the natural way humans communicate with each other. After all, structured data is an artificial construct that was created to be able to apply analytical concepts to better understand the world. Given that unstructured data is the majority of the information out in the world, and if unstructured data will continue to expand in volume at high rates, and given a cognitive computing system can “read” and digest this huge corpus of information, the result can be powerful. This corpus of information even includes the system being able to “see” images in addition to its ability to “read” text and “hear” spoken language. Amazing! This addresses many of the challenges in text mining today by taking advantage of information that is often largely ignored.

So is Siri® a cognitive computing system?

Not quite, but it’s a start. What makes cognitive computing the next era of computing is that it is not a programmed compute system, the way personal computers, tablets, smartphones, and other gadgets are. It’s hard not to love Siri (after all, it can be fun to ask Siri such questions as “what is zero divided by zero?” or “what is the meaning of life?” – if you haven’t already asked one of these questions, take a break right now and try it), but Siri has been highly programmed, and provides scripted answers (I’m sure Apple employees had a lot of fun with this part of the development of Siri). It can learn a little from you, and most likely it will continue to evolve to learn even more from each individual user, but it is not truly a cognitive computing system. Some say it can be categorized as a “cognitive embedded capability,” which is important for cognitive computing. Any development efforts in the world of cognitive computing need to be able to be embedded in the technology and applications that users know and love, such as Siri being a part of the iPhone®. So while Siri has some capabilities that are on the road to a cognitive computing system, its emphasis is on its programmed functionality rather than on what it learns from a user. Instead of assessing information from multiple sources (including what it has been taught), offering hypotheses, and allotting confidence to potential answers, Siri is mostly programmed with canned responses or matches from its available information. After all, Watson beat former Jeopardy! champions, while Siri often cannot even find the nearby location to which I request directions.

Given all of this, what exactly can cognitive computing systems do today?

Stay tuned: This will be discussed in Part 2. Thanks for reading, and let us know your thoughts on cognitive computing. Will it change the world?

Post a Comment

To data scientists and beyond! One of many applications of text analytics

Hi, there! First of all, let me introduce myself, as this is my first blog. I am Simran Bagga, and three weeks ago I became the Product Manager for Text Analytics at SAS. This role might be new to me, but text analytics is not. For the past 12 years, I have helped customers in government, health care, and small businesses understand the value and application of text analytics to enhance their existing business processes. From simple questions to complex application requirements, I have heard it all. And I have seen the field evolve over the years from many different perspectives.

Organizations in every industry have realized the potential of tapping into unstructured text and are embracing the power of this capability at a rapid rate. They want to leverage both internal and external information to solve a variety of different problems. The applications of text analytics are many, from enhancing the customer experience to gaining efficiencies in solving criminal investigations.

One of the misconceptions I often see is the expectation that it takes a data scientist, or at least an advanced degree in analytics, to work with text analytics products. That is not the case. If you can type a search into a Google toolbar, you can get value from text analytics.

An investigative agency recently asked me, “Our focus is closing cases quickly by connecting the dots and finding linkages between incidents. Can SAS help our crime analysts be more effective so they can drill into the incident narratives like a traditional business intelligence application and find that needle in the haystack?”

I love it when customers ask such easy questions. The answer is, “Absolutely.”

I am really excited about a SAS cloud-based offering that many people might not be aware of: the SAS® Text Exploration Framework. This provides an easy, search-based interface for all relevant information, presented in a compelling and visual way, to virtually any question you want to ask of the data.

SAS Text Exploration Framework - Search and Explore


The crime analysts desired the ability for free text search – a search that gave them smart results. Rather than sifting through hundreds of documents that had the term “firearm theft,” for example, they wanted to be prompted for the different crime areas and incident types where those terms were found. The SAS Text Exploration Framework allows them do exactly that and more. They can focus on incidents that occurred within certain time frames, locations, etc. – and ones that are associated with specific crime categories or subcategories – so they can identify linkages across incidents. Terms of interest are highlighted so they can see where they are referenced and explore the data graphically.

The excitement in the room was palpable when the users saw this application of text analytics. A question came up: Can we actually visualize links between entities (criminals, gangs, weapons, etc.) and incidents with the ability to drill down to identify social networks, gang-related crimes, and so forth? The social network capability within the framework supports this type of visualization nicely:

SAS Text Exploration Framework - Social Network Analysis


Various user personas – analysts (data scientists), business analysts, computer scientists, decision makers – want to use analytics to make data-driven decisions, but in different ways. The challenge is how to appeal to all these personas and meet their expectations with individualized text analytics applications, from data-driven algorithms and business rules to applications solving specific problems and cognitive computing.

SAS certainly has the technology and expertise to meet each of these needs. But I am here to understand your needs, what matters to you, and to represent the voice of the customer: your voice – an aspect of my new role that I truly enjoy and am excited about. So don’t hesitate to reach out.

Post a Comment

Come chat with us!

In today’s world of instant gratification, consumers want – and expect – immediate answers to their questions. Quite often, that help comes in the form of a live chat session with a customer service agent.

The logs from these chats provide a unique analysis opportunity. Like a call center transcript, there is two-way dialogue with a question/answer pattern similar to a tennis match, and a “politeness overlay” that can make sentiment analysis tricky. Yet people tend to use an abbreviated style of conversation in chats that is more akin to a tweet or a text message – short and sweet, with fewer words and less formal phrasing than you might see in a call transcript.

So how can we effectively mine these chats in order to:

  • Understand what questions are driving customers to this service channel?
  • Identify key paths to issue resolution (or nonresolution)?
  • Increase agent performance and customer satisfaction?

At SAS, we asked ourselves these very questions when faced with hundreds of thousands of chat logs from sas.com. These inbound chats span multiple years and are in support of customers in more than 35 countries.

Goals for analyzing chat logs

For us, the first objective was simply to understand what topics were occurring across the all the chats. How can we characterize our customer inquiries? Are people looking for information on SAS events or conferences? Are they asking about training and education, or seeking technical support? And if so, for which products? What URLs and resources are most often promoted by the agents? Do these things vary by country, time of year, or even time of day?

Our goals influenced both how we structured the data and what techniques we applied to analyze it. Initially, we consolidated the full chat log into a single text field. This allowed us to use SAS® Text Miner on the entire conversation to identify key topics and clusters which naturally occur in the data. Once we discovered and explored these topics, we were able to re-organize them into classification hierarchies (i.e., taxonomies) that aligned well with our business objectives – one taxonomy for products, and another for inquiry types. Using SAS Contextual Analysis, we then created definitions for these categories using keywords, Boolean expressions and powerful linguistic operators.

Using SAS® Visual Analytics, we are able to easily explore results and surface interactive reports to interested internal parties.  Below are just a few samples from our live chat reports:

Frequency and Duration of Chats by Inquiry Type

Frequency and Duration of Chats by Inquiry Type


Faceted Search – Quickly Inspect a Subset of Chats That Meet Certain Criteria

Faceted Search – Quickly Inspect a Subset of Chats That Meet Certain Criteria


Chats About Our Statistics-Specific Products Over Time

Chats About Our Statistics-Specific Products Over Time

How can you use this information?

Through these explorations, we are continually learning more about our customers and how they want to do business with us.  In the future, our line of questioning might naturally also evolve to exploring the path of the conversation, answering questions like:

  • What are the most frequently asked question-and-answer pairs which yield a successful resolution to the chat?
  • What is the most effective sequence of questions agents should ask to troubleshoot a technical issue?
  • Does customer sentiment change throughout the course of the conversation?

To support these analyses, we would instead structure the data at an utterance level, where each individual speaker’s comment is stored in its own record, with sequence and speaker IDs.  This is the approach one of our Technology-sector clients is taking. Their volume of inbound service requests is increasing faster than they can scale, so by identifying the most effective question-and-answer pairs they hope to make these “troubleshooting tips and FAQs” readily available on less costly channels, such as the website and mobile app.

Another client, a major financial services provider, is analyzing chat topics as well as sentiment at the beginning and end of a chat session to contribute to a “risk” score for that individual.  The idea is to use sentiment to help predict an outcome variable such as attrition/churn, or likelihood to file a formal complaint with regulators.

Learn more about SAS Text Analytics and how you can apply it to your organization’s unstructured data (chat or otherwise!).

Post a Comment

Thankful Tuesday: It’s all about the people

Most Frequent Thankful Tuesday Terms

Most frequent terms in a collection of tweets with the hashtag #thankful.

Did you know that Tuesday, Nov. 24, was #ThankfulTuesday? This themed day was just two days before Thanksgiving and joined the ranks of newly invented opportunities such as #WorldPastaDay and #NationalMotherInLawDay to collectively post on social media about a single topic.

Appropriate to the season of giving thanks, many people took to Twitter to express what they are thankful for. We examined about 4,000 tweets with the hashtag #Thankful in search for an answer to the question: What are Twitter users thankful for on Thankful Tuesday?

Is it the material comforts enjoyed over the past year? Or perhaps the impending craziness of Black Friday shopping? Or the holiday of a massive dinner and televised football consumption?

Our analysis showed that the tweets were much more likely to be about intangible and sentimental things. Using SAS Contextual Analysis, we can see that the five most common words in the document collection, after our search term, “thankful”, were Thanksgiving, day, family, blessed and love.

From the next visualization, you can see that the search term “thankful” was most prominently associated with the terms “happy,” “week,” “good” and “God.”

Thankful Tuesday Terms Visualization

Terms most often mentioned in connection with the term "thankful."

The next screenshot from SAS Contextual Analysis illustrates the most prominent topic clusters. In about 10 percent of the tweets, people mentioned Thanksgiving, holiday, week, feast and God alongside the hashtag #thankful. The second most prominent topic cluster, present in about 9 percent of the tweets, was time (week) with family, friends and food.

Thankful Tuesday Topic Clusters

Most frequent topic clusters in a collection of tweets with the hashtag #thankful.

We can conclude that most Twitter users participating in Thankful Tuesday were looking forward to the Thanksgiving holiday traditions of spending time feasting and enjoying the company of family and friends. All of the topic clusters contained predominately positive sentiments, above 90 percent positive. From all of this data, we can generalize that Twitter users generally feel happy about these topics.

Using concept extraction in SAS Contextual Analysis, we can summarize that the top five things that people are thankful for are:

  1. Family.
  2. Friends.
  3. Students.
  4. People.
  5. Support.

The top 10 things people explicitly mentioned being thankful for also included partnerships, leadership and opportunities. Two interesting and related conclusions can be drawn. First, none of these items are material. Second, the results underscore the “social” or people part of “social media.”

About 40 percent of people used links to web pages as part of their tweets, and almost a third of those were links to Instagram. The Instagram photos included greeting-card-like stock images as well as personal photos.

Additionally, people used an average of two hashtags per tweet, and 40 percent of tweets were directed @someone, reinforcing the previous conclusion about the social aspect of social media.

We join the sentiment of these tweets in being thankful for our many wonderful SAS customers, partners and colleagues around the world. We are also thankful for such powerful analytics products as SAS Contextual Analysis, which make it possible for us to efficiently analyze big data and provide answers to questions big and small.

What are you thankful for this holiday season?

Post a Comment

Focusing your Text Mining with Search Queries

Recently, I have been thinking about how search can play more of a part in discovery and exploration with SAS Text Miner. Unsupervised text discovery usually begins with a look at the frequent or highly weighted terms in the collection, perhaps includes some edits to the synonym and stop lists, and then performs a topic or cluster analysis of the entire collection. Search is sometimes more of an afterthought in the analysis. But search can become a very useful technique, particularly if you have no additional variables on your data to train with or profile. So, here are some things to consider the next time you're working on understanding the content of your collection. I hope you can add some suggestions or best practices that you have discovered as a comment to this post.

If you think creatively, there are a number of ways to form a query and identify a result set in SAS Text Miner. Some are obvious and some are not. I list three of those ways here and demonstrate some aspects on  VAERS data:

  1.  Interactive and Automated Search in the Filter Node
    This is the obvious one. The Text Filter node of SAS Text Miner provides an interface to query based on a term or set of terms along with the ability to negate a term or match with a wildcard character. The query can be performed in the interactive window, shown to the right, or it can be automated with the node's Search Expression property. If you save your results of the search, the  Text Filter node will output only the documents that match the query.

    Query and Query Results in the Text Filter Node's interactive window

    Query and Query Results in the Text Filter Node's interactive window

  2. Pattern-Based Search in the Text Parsing Node
    The Text Parsing node, in combination with concept categorization rules, contains a powerful mechanism to identify patterns in strings rather than merely the strings themselves. These patterns can help identify when something like a medical diagnosis code occurs in the text or whenever two terms of interest such as “hand” and “rash” appear close to one another. While this approach does not directly create a subset of your documents that matches the pattern, it does add a feature to your term list that can be used to restrict the collection to those observations that match the pattern. Some code that would be helpful in processing your data is found at the end of this SAS Global Forum paper.
  3. Search as User Topics in the Text Topic Node
    The Text Topic node has a terrific feature, the User Topics property, that allows you to provide a set of terms to create user-defined topics. This subset of terms can be thought of as a query set and the documents that belong to the topic are your query results. The output data from this node is not a strict subset of your documents like the Text Filter node accomplishes. Instead it provides a new variable identifier on your documents that indicate which documents match this “query”.  You can create several of your own queries simultaneously with this method.


Creating User Topics in the Text Topic Node

Creating User Topics in the Text Topic Node

Now, that you have the query results, how can you leverage them?
With any of the query approaches above, a second analysis can occur on the subset that makes up the retrieved documents. In essence, the query gives your exploration of the text a particular point of view by restricting the analysis to this subset. The observed relationships and patterns may be more meaningful and useful because the query has been used to focus on a particular aspect of the text.  You may find it useful to consider how the clusters and topics change when you transition from the whole collection to the retrieved subset.  With multiple queries, a separate Text Topic node can be run for each query and the topics can be compared.  Shown in the two tables below are the result of each of the two topic runs, one for the documents related to pain and the other for documents related to skin. Note that I dropped the terms that composed the pain and skin queries.  While this example is mainly for demonstrating the process, you can see the similarities and differences between these two topics. With your data, this could provide valuable insight and help you understand the content.

An automatic topic run done on the documents that matched the user-defined "Pain" topic.

An automatic topic run done on the documents that matched the user-defined "Pain" topic.


An automatic topic run done on the documents that matched the user-defined "Skin" topic.

An automatic topic run done on the documents that matched the user-defined "Skin" topic.

Certainly there are other things that can be done. For one, the Text Profile node could be used to compare these subsets. Are there other approaches that you use or would like to see that incorporate search? I would love to hear from you!



Post a Comment

Widening the use of unstructured data

Analyzing text is like a treasure hunt. It is hard to tell Tuba1what you will end up with before you start digging and the things you find out can be quite unique, invaluable and in many cases full of surprises. It requires a good blend of instruments like business knowledge, language processing and advanced analytics capabilities. This variety and complexity reveals the hidden value for the organizations.

90 percent of the digital information is unstructured according to this IDC report. With the further development of the speech-to-text technologies, the actionable volumes of data will get even bigger.

The good news is, the growing diversity of the unstructured data triggers innovative and interesting use cases for business.

Discovering personality types

tuba2At SAS Global Forum in Las Vegas this year, one of the presentations was on the topic of discovering personality type through Text Mining from Deloitte & Touche LLP. They presented the usage of text mining to develop, attract and retain the right mix of consultants for their teams. Apparently career oriented people tend to be more neurotic than open and having a good mixture of personalities in your team makes success more likely.

Can Text Analytics help you to reveal what your customers really mean when they talk about your organization?

Net Promoter Score (NPS) has become one of the key metrics for the measure of the customer satisfaction. Your customer gives you a score out of 10 and then you know how happy that person is with your organization. However, it is always good to cross check whether what your customer says or thinks about you is in line with how he or she scores you. A number could easily be perceived differently by people with different perspectives.

In our analysis of NPS data with scores and also comments of the customers, we could see examples of comments like I cannot say that I am not pleased with the attitude of the lady at the branch but this was my third visit to make a simple money transfer. I will switch to another bank. They are much better at internet banking”. This feedback resulted in an NPS score of 8!? But is this really correct? Because of the nice lady serving? Or because the person was too polite to give a lower score?! Or maybe he or she just doesn’t care? Extracting the topics that describe the real experience of the customer could help to see the bigger picture which could look completely different.

More case studies on text analytics at A2015 in Rome

Text can help you in many different ways and I will be able to share some more customer stories next week at A2015 in Rome:

  • Royal Brompton & Harefield NHS Trust are using text analytics to support clinicians to discover unusual sequence clusters and future diagnosis of new cases.tuba3
  • A US regulating agency is analyzing documents from banks to detect the risk factors that would potentially impact the future trends of the economy.
  • The World Bank, which has one of world’s largest electronic libraries, is categorising thousands of documents in minutes.
  • A major insurer in the UK is integrating their claims advisory notes into their existing fraud models and improving the correct detection rate by 20% and decreasing false alerts by 60 percent.
  • Alberta Parks are analyzing customer feedback data to detect major topics and sentiments and prioritize actions to improve customer satisfaction.

I will also highlight some potential use cases on speech mining to:

  • Detect the sentiment journey of the customers on the phone and improve tuba4the scripts accordingly for call center agents.
  • Use police interview records for cross referencing.
  • Analyze real-time streams of trader conversations in capital markets to detect rogue trading.

Also, A2015 will host another presentation on text analytics from British Airways. They will share their journey and the very interesting outcomes of their project on the Classification of passenger complaints.

I think it is fair to say that with the improvement and penetration of the technology, the insight extracted from unstructured data gets more sophisticated and rewarding each year.

I am looking forward to Rome! And I cannot wait to hear more of these use cases in future especially on real-time text streaming and speech mining.

Post a Comment

Topical advice about topics, redux

In my last post, I talked about why SAS utilizes a rotated Singular Value Decomposition (SVD) approach for topic generation, rather than using Latent Dirichlet Allocation (LDA).  I noted that LDA has undergone a variety of improvements in the last seven years since SAS opted to use the SVD method.  So, the time has come to ask:  How well does the rotated SVD approach hold up with these modern LDA variations?

For the purpose of this comparison, we used the HCA implementation of LDA models.  This is the most advanced implementation we could find for LDA today.  It is written in C (gcc specifically) for high speed, and can run in parallel across up to 8 threads on a multi-core machine.  It does various versions of topic modeling including LDA, HDP-LDA and NP-LDA, all with or without burstiness.  One of the difficult decisions when running LDA is determining good values for the hyper-parameters.  This software can automatically tune those hyper-parameters for you.

We chose three different real-world data sets to do the comparisons.

  1. A subset of the “newsgroup” data set, that contains 200 articles from each of three different usenet newsgroups (so 600 total) from the standard newsgroup-20 collection:  ibm.hardware, rec.autos, and sci.crypt.  We will call this the News-3 collection.
  2. A subset of the Reuter-21578 Text Categorization collection.  This collection contains articles that were on the Reuters newswire in 1987, together with 90-odd different categories (or tags) provided with those articles.  We have included only those that contain at least one of the ten most frequently occurring tags, and label that 9,248 document subset  the Reuter-10 collection.
  3. The  NHTSA consumer complaint database for all automotive consumer complaints registered with the National Highway and Safety Administration during the year 2008.  Each complaint is coded by one or more “affected component” fields.  These fields have a multipart description (for example, brakes: disc).  For our purposes, we utilized only the first part, which generates 27 separate general components.   This data set has 38,072 observations.

Note that these three data sets vary widely in number of observations and number of natural categories.  Also, one thing about topic modeling as opposed to document clustering is that we want documents to be able to contain more than one topic.  In News-3, each document has only one of the categories; while in the other two data sets, multiple labels are often assigned to documents.

The natural criteria to use with these data sets is to see how well computed topics correspond to the known category structure of the data.   To facilitate this, we first parsed the results using SAS Text Miner.  The parsed results were fed into the Text Topic node in Text Miner to get the topics corresponding to the rotated SVD, and fed into the HCA implementations of standard LDA, LDA with burstiness, and HP-LDA with burstiness.   In all three cases, hyper-parameter tuning was performed.

Regardless of which approach is tried, the number of topics is a user-defined input.  In order to explore the effect of this setting, we ran all the algorithms for each data set three times

  1. One run was set to generate the same number of topics as categories (so 3 for News-3, 10 for Reuter-10, and 27 for NHTSA-2008).
  2. A second run generated # topics = 2 times (2x cat) the number of categories.
  3. The third used  # topics = 3 times (3x cat) the number of categories.

To measure how well the category structure was discovered, for each category we identified the topic most closely related, and computed two different measures often used for external validation for clustering techniques:  Normalized Mutual Information (NMI) and Purity .  The results for # topics = 2x cat are shown in the graphs below. Note that higher values are considered better for both these measures.



Although these graphs show results for 2x cat only, the patterns for 1x cat and 3x cat are the same.

One clear takeaway from the above graphs is that standard LDA was inferior to each of the other techniques in every case looked at, for both Purity and NMI measures.  LDA with bursitness generally did better than HP-LDA with burstiness for all cases.   LDA with burstiness got marginally better results for the News-3 and Reuter-10 data than rotated SVD, but rotated SVD got significantly better results for the NHTSA-2008 data.

Taking an average across the different data sets shows a slight edge to rotated SVD which is probably insignificant.  From these results, it appears that both rotated SVD and LDA with burstiness do an equally good job of capturing the category structure in the data.

Going beyond these measures, there are many advantages to the rotated SVD.  The SVD has what is called a convex solution, meaning that there is only one result that maximizes the objective.  If you run it on the same data, it will always get the same result.  LDA can generate different topics each time you run it. Furthermore, there are several hyper-parameters for LDA that have to be carefully tuned for the data.  How many hyper-parameters are there for rotated SVD?  ZeroNada.

So, how does that translate in practice?  It takes vastly longer to calculate LDA with burstiness, optimizing hyper-parameters, than it does to calculate SVD.  For example, running the NHTSA-2008 data through the text topic node for 2x cat in Text Miner took 47 seconds.  LDA with burstiness on the same data: 2,412 seconds.  You do the math.  We have run Text Topic node on the entire million document NHTSA collection without issue.  I shudder to even contemplate running LDA on that large a collection.

Please contact me if you are interested in the spreadsheet with complete results or the specific data sets we used in this experiment.   I would be happy to send them to you, and I can also address how you can go about replicating our results.

If you happen to be at the Analytics 2015 conference this week in Las Vegas, make sure you come to my talk on Tuesday, Oct. 27 at 11:30 am where I will go into considerable detail about these comparisons.

Ta-ta for now.

Post a Comment

Topical advice about topics: comparing two topic generation methods

woman holding an orange and an appleWhen I talk with more analytically savvy users of SAS® Text Miner or SAS® Contextual Analysis, I inevitably get asked questions about why SAS uses a completely different approach to topic generation than anybody else and why should they trust the approach SAS adopts?

These are good questions. I first addressed them back in 2010 in a three-part series of blog posts titled The Whats, Whys, and Wherefores of Topic Management. 

In that series, I talked about how generating a matrix composition – the singular value decomposition (SVD) – of a term-by-document matrix could place both documents and terms as points in a multidimensional space. In this space, the closeness of any two points relate how similar those particular documents and/or terms are to each other. Then, by rotating the axes in that space so that terms align with these axes, one brings to light interpretable topics. One document might line up well with a few of those topics, meaning it is “about” those topics. And the terms that are strongly aligned with those new axes give a semantic interpretation to those topics.

This method is very similar to factor analysis, developed back in the early 1900s to uncover latent aspects of something – for example, different kinds of intelligence a person might possess based on answers to questions on an IQ test. In fact, factor analysis has been of prime importance over the years. For example, the Myers-Briggs personality inventory aligns an individual on four different personality traits based on answers to a personality inventory.

At any rate, when we first decided to create topics, back in 2008, we compared the topics generated by this “rotated SVD” approach to those created by latent dirichlet allocation (LDA), which was initially developed in 2003, and is the approach “everyone else uses.”

A term-by-document matrix stores the number of times each term occurs in each document in each “cell” of the matrix. An SVD assumes that the values it works with are distributed as a normal bell curve, whereas the LDA models frequencies directly. Advantage: LDA.

However, it turns out that we don’t actually apply the SVD to the counts directly. We apply them to counts that have been weighted, typically using what is known as a tf-idf weighting. In most cases, we multiply the log of the number of times a term occurs in a document (the tf part), with a term weight calculated as the inverse of its frequency in the document collection (the idf part). This actually ends up mapping to a distribution that is close to a bell curve in practice, and it evens out the overall weight of all terms when viewed across an entire document collection. If you’re familiar with principal components analysis, the result is similar to subtracting out the mean and dividing by the standard deviation of each variable in a set of variables.

We tested our approach In 2008 by creating some artificial data that had a known topic structure, and determined that the rotated SVD approach was able to generate topics much closer to that known structure than the LDA. There was no natural way in LDA to do term weighting on the raw frequencies. Once the frequencies are weighted, they are no longer frequencies, and the math behind the LDA no longer applies. Furthermore, the rotated SVD approach is much faster than LDA, and the LDA can generate different results every time you run it. So it was a no-brainer to use the rotated SVD.

Since 2008, though, the world has changed. Nowadays, if someone even mentions topic modeling, it is just assumed that they are using LDA. So it is natural to wonder why SAS doesn’t. Furthermore, LDA has been improved in the last seven years. Notably, most people using LDA today use a “burstiness” model which tries to incorporate this term frequency weightings to generate better results.

So it is time for us to revisit the topic of topics: How does rotated SVD compare to these more modern LDA approaches? Is it still superior, or does LDA with burstiness and other innovations leave our approach gathering dust in the woodshed?

And now that we've reviewed the history, that is the topic for part 2 of this series. Stay tuned. We have done the comparisons, and the results may surprise you.

Post a Comment

Speaking the same language in SAS® Text Analytics

The first text analytics product SAS released to the market in 2002 was SAS® Text Miner to enable SAS users to extract insights from unstructured data in addition to structured data.  In 2009, in quick succession, SAS released two new products:  SAS® Enterprise Content Categorization and SAS® Sentiment Analysis.  These products filled niches that SAS® Text Miner did not address: namely tools for people to build and support rule-based taxonomies:  SAS® Enterprise Content Categorization for categories and concepts, and SAS® Sentiment Analysis for tone, or sentiment.

We soon learned that there was overlap between the needs of those writing rules for building taxonomies and those wanting to use SAS® Text Miner to learn or discover relationships in the data.  But alas, the three products did not have an easy mechanism to communicate between them.  One thing we did implement to support integration was to enable the import of concepts built in SAS® Enterprise Content Categorization into the Text Parsing node in SAS® Text Miner.  With this we provided limited communication; much like having an interpreter between two people not speaking the same language.

We learned from this and created SAS® Contextual Analysis, which was first released two years ago.  This product allows users to build rules for concepts and categories within the interface, but also create topics and use machine learning techniques to automatically create category rules.   SAS Contextual Analysis has been hugely successful with users: but we have also found that SAS users can benefit from both SAS Text Miner and SAS Contextual Analysis.  SAS Text Miner provides more flexibility to the experienced user and can be used to build predictive models using not just text, but all the other structured data available.  However, it is a tool that requires more analytical sophistication from users than SAS® Contextual Analysis.

So, many customers use both products.  But they really want them to talk to each other.  If you are such a customer, we now have a solution for you.  We are now providing a downloadable SAS Enterprise Miner node that you can utilize in any project to pull in the categories, concepts, and sentiment score code from a model built in SAS Contextual Analysis, and utilize them in exploration, clustering, or predictive modeling in the SAS Enterprise Miner / SAS Text Miner interface easily.

>What?  Your license for SAS Contextual Analysis is on a different machine than your license for SAS Text Miner?  No problem, included in the documentation is a convenient way to copy the SAS Contextual Analysis model files to your SAS Text Miner installation.

Check out the new node, installation documentation, and Users Guide in this zip file. And take a look at a Text Analytics Community posting that gives more detail including the documentation, if you want to look at that before downloading the node.

Of course, we must add some “small print”:  This node is provided as experimental at this time, so is not directly supported by SAS Technical Support.

Thanks for tuning in, and let me know your experience with the node!

Post a Comment