Widening the use of unstructured data

Analyzing text is like a treasure hunt. It is hard to tell Tuba1what you will end up with before you start digging and the things you find out can be quite unique, invaluable and in many cases full of surprises. It requires a good blend of instruments like business knowledge, language processing and advanced analytics capabilities. This variety and complexity reveals the hidden value for the organizations.

90 percent of the digital information is unstructured according to this IDC report. With the further development of the speech-to-text technologies, the actionable volumes of data will get even bigger.

The good news is, the growing diversity of the unstructured data triggers innovative and interesting use cases for business.

Discovering personality types

tuba2At SAS Global Forum in Las Vegas this year, one of the presentations was on the topic of discovering personality type through Text Mining from Deloitte & Touche LLP. They presented the usage of text mining to develop, attract and retain the right mix of consultants for their teams. Apparently career oriented people tend to be more neurotic than open and having a good mixture of personalities in your team makes success more likely.

Can Text Analytics help you to reveal what your customers really mean when they talk about your organization?

Net Promoter Score (NPS) has become one of the key metrics for the measure of the customer satisfaction. Your customer gives you a score out of 10 and then you know how happy that person is with your organization. However, it is always good to cross check whether what your customer says or thinks about you is in line with how he or she scores you. A number could easily be perceived differently by people with different perspectives.

In our analysis of NPS data with scores and also comments of the customers, we could see examples of comments like I cannot say that I am not pleased with the attitude of the lady at the branch but this was my third visit to make a simple money transfer. I will switch to another bank. They are much better at internet banking”. This feedback resulted in an NPS score of 8!? But is this really correct? Because of the nice lady serving? Or because the person was too polite to give a lower score?! Or maybe he or she just doesn’t care? Extracting the topics that describe the real experience of the customer could help to see the bigger picture which could look completely different.

More case studies on text analytics at A2015 in Rome

Text can help you in many different ways and I will be able to share some more customer stories next week at A2015 in Rome:

  • Royal Brompton & Harefield NHS Trust are using text analytics to support clinicians to discover unusual sequence clusters and future diagnosis of new cases.tuba3
  • A US regulating agency is analyzing documents from banks to detect the risk factors that would potentially impact the future trends of the economy.
  • The World Bank, which has one of world’s largest electronic libraries, is categorising thousands of documents in minutes.
  • A major insurer in the UK is integrating their claims advisory notes into their existing fraud models and improving the correct detection rate by 20% and decreasing false alerts by 60 percent.
  • Alberta Parks are analyzing customer feedback data to detect major topics and sentiments and prioritize actions to improve customer satisfaction.

I will also highlight some potential use cases on speech mining to:

  • Detect the sentiment journey of the customers on the phone and improve tuba4the scripts accordingly for call center agents.
  • Use police interview records for cross referencing.
  • Analyze real-time streams of trader conversations in capital markets to detect rogue trading.

Also, A2015 will host another presentation on text analytics from British Airways. They will share their journey and the very interesting outcomes of their project on the Classification of passenger complaints.

I think it is fair to say that with the improvement and penetration of the technology, the insight extracted from unstructured data gets more sophisticated and rewarding each year.

I am looking forward to Rome! And I cannot wait to hear more of these use cases in future especially on real-time text streaming and speech mining.

Post a Comment

Topical advice about topics, redux

In my last post, I talked about why SAS utilizes a rotated Singular Value Decomposition (SVD) approach for topic generation, rather than using Latent Dirichlet Allocation (LDA).  I noted that LDA has undergone a variety of improvements in the last seven years since SAS opted to use the SVD method.  So, the time has come to ask:  How well does the rotated SVD approach hold up with these modern LDA variations?

For the purpose of this comparison, we used the HCA implementation of LDA models.  This is the most advanced implementation we could find for LDA today.  It is written in C (gcc specifically) for high speed, and can run in parallel across up to 8 threads on a multi-core machine.  It does various versions of topic modeling including LDA, HDP-LDA and NP-LDA, all with or without burstiness.  One of the difficult decisions when running LDA is determining good values for the hyper-parameters.  This software can automatically tune those hyper-parameters for you.

We chose three different real-world data sets to do the comparisons.

  1. A subset of the “newsgroup” data set, that contains 200 articles from each of three different usenet newsgroups (so 600 total) from the standard newsgroup-20 collection:  ibm.hardware, rec.autos, and sci.crypt.  We will call this the News-3 collection.
  2. A subset of the Reuter-21578 Text Categorization collection.  This collection contains articles that were on the Reuters newswire in 1987, together with 90-odd different categories (or tags) provided with those articles.  We have included only those that contain at least one of the ten most frequently occurring tags, and label that 9,248 document subset  the Reuter-10 collection.
  3. The  NHTSA consumer complaint database for all automotive consumer complaints registered with the National Highway and Safety Administration during the year 2008.  Each complaint is coded by one or more “affected component” fields.  These fields have a multipart description (for example, brakes: disc).  For our purposes, we utilized only the first part, which generates 27 separate general components.   This data set has 38,072 observations.

Note that these three data sets vary widely in number of observations and number of natural categories.  Also, one thing about topic modeling as opposed to document clustering is that we want documents to be able to contain more than one topic.  In News-3, each document has only one of the categories; while in the other two data sets, multiple labels are often assigned to documents.

The natural criteria to use with these data sets is to see how well computed topics correspond to the known category structure of the data.   To facilitate this, we first parsed the results using SAS Text Miner.  The parsed results were fed into the Text Topic node in Text Miner to get the topics corresponding to the rotated SVD, and fed into the HCA implementations of standard LDA, LDA with burstiness, and HP-LDA with burstiness.   In all three cases, hyper-parameter tuning was performed.

Regardless of which approach is tried, the number of topics is a user-defined input.  In order to explore the effect of this setting, we ran all the algorithms for each data set three times

  1. One run was set to generate the same number of topics as categories (so 3 for News-3, 10 for Reuter-10, and 27 for NHTSA-2008).
  2. A second run generated # topics = 2 times (2x cat) the number of categories.
  3. The third used  # topics = 3 times (3x cat) the number of categories.

To measure how well the category structure was discovered, for each category we identified the topic most closely related, and computed two different measures often used for external validation for clustering techniques:  Normalized Mutual Information (NMI) and Purity .  The results for # topics = 2x cat are shown in the graphs below. Note that higher values are considered better for both these measures.



Although these graphs show results for 2x cat only, the patterns for 1x cat and 3x cat are the same.

One clear takeaway from the above graphs is that standard LDA was inferior to each of the other techniques in every case looked at, for both Purity and NMI measures.  LDA with bursitness generally did better than HP-LDA with burstiness for all cases.   LDA with burstiness got marginally better results for the News-3 and Reuter-10 data than rotated SVD, but rotated SVD got significantly better results for the NHTSA-2008 data.

Taking an average across the different data sets shows a slight edge to rotated SVD which is probably insignificant.  From these results, it appears that both rotated SVD and LDA with burstiness do an equally good job of capturing the category structure in the data.

Going beyond these measures, there are many advantages to the rotated SVD.  The SVD has what is called a convex solution, meaning that there is only one result that maximizes the objective.  If you run it on the same data, it will always get the same result.  LDA can generate different topics each time you run it. Furthermore, there are several hyper-parameters for LDA that have to be carefully tuned for the data.  How many hyper-parameters are there for rotated SVD?  ZeroNada.

So, how does that translate in practice?  It takes vastly longer to calculate LDA with burstiness, optimizing hyper-parameters, than it does to calculate SVD.  For example, running the NHTSA-2008 data through the text topic node for 2x cat in Text Miner took 47 seconds.  LDA with burstiness on the same data: 2,412 seconds.  You do the math.  We have run Text Topic node on the entire million document NHTSA collection without issue.  I shudder to even contemplate running LDA on that large a collection.

Please contact me if you are interested in the spreadsheet with complete results or the specific data sets we used in this experiment.   I would be happy to send them to you, and I can also address how you can go about replicating our results.

If you happen to be at the Analytics 2015 conference this week in Las Vegas, make sure you come to my talk on Tuesday, Oct. 27 at 11:30 am where I will go into considerable detail about these comparisons.

Ta-ta for now.

Post a Comment

Topical advice about topics: comparing two topic generation methods

woman holding an orange and an appleWhen I talk with more analytically savvy users of SAS® Text Miner or SAS® Contextual Analysis, I inevitably get asked questions about why SAS uses a completely different approach to topic generation than anybody else and why should they trust the approach SAS adopts?

These are good questions. I first addressed them back in 2010 in a three-part series of blog posts titled The Whats, Whys, and Wherefores of Topic Management. 

In that series, I talked about how generating a matrix composition – the singular value decomposition (SVD) – of a term-by-document matrix could place both documents and terms as points in a multidimensional space. In this space, the closeness of any two points relate how similar those particular documents and/or terms are to each other. Then, by rotating the axes in that space so that terms align with these axes, one brings to light interpretable topics. One document might line up well with a few of those topics, meaning it is “about” those topics. And the terms that are strongly aligned with those new axes give a semantic interpretation to those topics.

This method is very similar to factor analysis, developed back in the early 1900s to uncover latent aspects of something – for example, different kinds of intelligence a person might possess based on answers to questions on an IQ test. In fact, factor analysis has been of prime importance over the years. For example, the Myers-Briggs personality inventory aligns an individual on four different personality traits based on answers to a personality inventory.

At any rate, when we first decided to create topics, back in 2008, we compared the topics generated by this “rotated SVD” approach to those created by latent dirichlet allocation (LDA), which was initially developed in 2003, and is the approach “everyone else uses.”

A term-by-document matrix stores the number of times each term occurs in each document in each “cell” of the matrix. An SVD assumes that the values it works with are distributed as a normal bell curve, whereas the LDA models frequencies directly. Advantage: LDA.

However, it turns out that we don’t actually apply the SVD to the counts directly. We apply them to counts that have been weighted, typically using what is known as a tf-idf weighting. In most cases, we multiply the log of the number of times a term occurs in a document (the tf part), with a term weight calculated as the inverse of its frequency in the document collection (the idf part). This actually ends up mapping to a distribution that is close to a bell curve in practice, and it evens out the overall weight of all terms when viewed across an entire document collection. If you’re familiar with principal components analysis, the result is similar to subtracting out the mean and dividing by the standard deviation of each variable in a set of variables.

We tested our approach In 2008 by creating some artificial data that had a known topic structure, and determined that the rotated SVD approach was able to generate topics much closer to that known structure than the LDA. There was no natural way in LDA to do term weighting on the raw frequencies. Once the frequencies are weighted, they are no longer frequencies, and the math behind the LDA no longer applies. Furthermore, the rotated SVD approach is much faster than LDA, and the LDA can generate different results every time you run it. So it was a no-brainer to use the rotated SVD.

Since 2008, though, the world has changed. Nowadays, if someone even mentions topic modeling, it is just assumed that they are using LDA. So it is natural to wonder why SAS doesn’t. Furthermore, LDA has been improved in the last seven years. Notably, most people using LDA today use a “burstiness” model which tries to incorporate this term frequency weightings to generate better results.

So it is time for us to revisit the topic of topics: How does rotated SVD compare to these more modern LDA approaches? Is it still superior, or does LDA with burstiness and other innovations leave our approach gathering dust in the woodshed?

And now that we've reviewed the history, that is the topic for part 2 of this series. Stay tuned. We have done the comparisons, and the results may surprise you.

Post a Comment

Speaking the same language in SAS® Text Analytics

The first text analytics product SAS released to the market in 2002 was SAS® Text Miner to enable SAS users to extract insights from unstructured data in addition to structured data.  In 2009, in quick succession, SAS released two new products:  SAS® Enterprise Content Categorization and SAS® Sentiment Analysis.  These products filled niches that SAS® Text Miner did not address: namely tools for people to build and support rule-based taxonomies:  SAS® Enterprise Content Categorization for categories and concepts, and SAS® Sentiment Analysis for tone, or sentiment.

We soon learned that there was overlap between the needs of those writing rules for building taxonomies and those wanting to use SAS® Text Miner to learn or discover relationships in the data.  But alas, the three products did not have an easy mechanism to communicate between them.  One thing we did implement to support integration was to enable the import of concepts built in SAS® Enterprise Content Categorization into the Text Parsing node in SAS® Text Miner.  With this we provided limited communication; much like having an interpreter between two people not speaking the same language.

We learned from this and created SAS® Contextual Analysis, which was first released two years ago.  This product allows users to build rules for concepts and categories within the interface, but also create topics and use machine learning techniques to automatically create category rules.   SAS Contextual Analysis has been hugely successful with users: but we have also found that SAS users can benefit from both SAS Text Miner and SAS Contextual Analysis.  SAS Text Miner provides more flexibility to the experienced user and can be used to build predictive models using not just text, but all the other structured data available.  However, it is a tool that requires more analytical sophistication from users than SAS® Contextual Analysis.

So, many customers use both products.  But they really want them to talk to each other.  If you are such a customer, we now have a solution for you.  We are now providing a downloadable SAS Enterprise Miner node that you can utilize in any project to pull in the categories, concepts, and sentiment score code from a model built in SAS Contextual Analysis, and utilize them in exploration, clustering, or predictive modeling in the SAS Enterprise Miner / SAS Text Miner interface easily.

>What?  Your license for SAS Contextual Analysis is on a different machine than your license for SAS Text Miner?  No problem, included in the documentation is a convenient way to copy the SAS Contextual Analysis model files to your SAS Text Miner installation.

Check out the new node, installation documentation, and Users Guide in this zip file. And take a look at a Text Analytics Community posting that gives more detail including the documentation, if you want to look at that before downloading the node.

Of course, we must add some “small print”:  This node is provided as experimental at this time, so is not directly supported by SAS Technical Support.

Thanks for tuning in, and let me know your experience with the node!

Post a Comment

SAS® Text Miner 14.1: Faster!

A new version of SAS® Text Miner and SAS® High-Performance Text Mining has recently been made available and I want to demonstrate some of the performance improvements that can be gained with this release. I’ll use a topic analysis that discovers the main themes in a document collection and consists of the following main steps:

  1. Parsing the observations (with complex natural language processing that includes stemming, part-of-speech tagging and noun group discovery).
  2. Summarizing the result into a weighted, sparse term-by-document frequency table.
  3. Factoring that table with the Singular Value Decomposition ( SVD).
  4. Optimally rotating the SVD dimensions to produce 25 topics.

The analysis is run on a data set of short customer comments with the number of observations ranging from 1 million to 4 million. All of the SAS® Text Miner runs are done on the same 2.90 GHz Intel Xeon 5-2677 system that is restricted to 8 threads and 8GB of memory. SAS® High-Performance Text Mining runs on a grid with 144 nodes, each with 2 2.7GHz Intel Xeon E5-2680 CPUs and 256GB of memory.


The Long and Short of It!

In the graph below, the timing of the topic calculation in SAS Text Miner 13.2 is shown in red and SAS Text Miner 14.1 in green.

You can see from the graph that the SAS® Text Miner 14.1 run time is roughly two-thirds of the run time from the previous release. Four million documents were analyzed in roughly an hour. Most of the speed up you see in SAS® Text Miner 14.1 is due to the use of multiple threads for several aspects of the computation.

SAS® Text Miner and SAS® High-Performance Text Mining run times

SAS® Text Miner and SAS® High-Performance Text Mining run times

Scaling Out!

In the figure above, do you see the blue line hovering near the x-axis? When compared to the green and red SAS® Text Miner runs on a single machine, it is barely noticeable. That blue line represents the timing for the same analysis with our high-performance product in this new release. It is here where we truly get the benefits of scaling out across a grid of machines. The analysis of the largest data set, 4 million documents, took less than four minutes with the grid. .

Get More Done!

I think we all usually focus on the notion that these speed improvements will allow us to do larger and larger problems. While this is certainly true, it is not the only benefit. These improvements can also enable us to do a more exhaustive exploration of the model space on the same-sized problem; the speed improvements allow for more time to search for improved solutions to our problems. So I am ready for larger problems and better models in SAS® Text Miner 14.1!


Post a Comment

Towering Insights

The benefits of big data often depend on taming unstructured data. However, in international contexts, customer comments, employee notes, external websites, and the social media labyrinth are not exclusively written in English, or any single language for that matter. The Tower of Babel lives and it is in your unstructured data.

However, while the Babelfish remains a figment of the Hitchhiker’s Guide to the Galaxy, there ARE proven approaches for information extraction from text across multiple languages. I’ll discuss a few of these and illustrate where appropriate with screenshots from the recently released SAS Contextual Analysis 14.1.

One approach focuses on automatic translation of all documents to a common language, then performing the analysis on the translated texts. Typically, this does not produce ‘readable’ texts (just try Google-translating your plea for a raise to your Italian boss and send that off without a proof-read :-). Nonetheless, this approach however can be used glean important insights.

Let’s say we want to work on customer complaints for a product. Generally, those complaints are going to be similar across borders and languages. We would create a project in English as that is the language into which all the complaints have been automatically translated. From that now single-language corpus, the topics below are derived.



These topics can then be ‘promoted’ into categories, which are then used to score incoming complaints. Before scoring, these categories can be refined right down to the syntax level to make sure that the resulting classifications are context appropriate.



When finalized, these categories can be used to score the incoming complaints for direction to the appropriate corporate department downstream.

IMPORTANT TO REMEMBER in this approach: Be sure to not lose track of the mappings to the original document. In any text analysis, this mapping must be maintained, in case the analyst actually wants to read a representative or otherwise key sample set of documents in the original language.

A second, somewhat simpler, and often better approach, is using a targeted start list for the domain. In text analytics, a start list basically tells the system to only treat the terms in the list and ignore others. This start list should contain key terms relevant to the analysis topic. For example, say a pharmaceutical company, or a public health organization wants to look for a select group of symptoms and treatments in a large, international pool of doctor notes.

Such a start list could look like this.


Now, this start list can be followed up with a synonym list can be used to identify the same words in the different analysis languages (in this case the TERMS are in the original languages of English, French and Dutch, and the PARENTS are all in English).



The resulting analysis will then reveal topics which group the documents according to the co-occurance prevalence of these symptoms and treatments in the texts (In this case, irregardless if the orginial document was in English, French or Dutch).

Of course a third possibility remains to conduct separate analyses for the different languages. SAS Contextual Analysis 14.1 now offers support for Chinese, Dutch, Finnish, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish and Turkish.


So don’t let the cat get your tongue when it comes to multi-language text analysis. Use one of these approaches to turn the Tower of Babel into a babbling tower of insights!


Post a Comment

Detect the expected and discover the unexpected - Text analytics in health care

BookWhen I ask people what they know about Denmark they often mention Hans Christian Andersen. He was born in Denmark in 1805 and is one of the most adored children’s authors of all time. Many of his fairy tales are known worldwide as they have been translated into more than 125 languages. His writing is colorful and picturesque and often with a hidden moral or criticism of society. He wanted the reader to detect the expected and discover the unexpected in his fairy tales.

In my business career I also work with detecting the expected and discovering the unexpected. I focus on health care, where Denmark is known worldwide for keeping health care data in electronic medical records (EMR). Unfortunately, reading EMRs isn’t like reading fairy tales, even though the language is both exotic – with Latin phrases – and modern, with text message jargon, medical slang, acronyms and abbreviations. The amount of text and data for doctors to manage is increasing from minute to minute, and the content is hard to consume for the clinicians during already busy days.

Highly complex language – combined with more and more laboratory analysis, X-ray descriptions, medication, guidelines, etc. – creates a situation where the clinician’s tight schedule, combined with the speed of human reading and understanding, becomes inadequate. Therefore, there is a need for advanced methods to extract value from text and data to ensure operational efficiency and reduced patient risk.

Hospital Lillebaelt Five years ago, Hospital Lillebaelt in Denmark came to the same conclusion. The amount of data was simply too large for a normal person to manage. Especially when it came to patient quality initiatives, it was an impossible task to review every patient’s data and to do it in a consistent way.

With that in mind, management at Hospital Lillebaelt started a text analytics initiative in 2010 together with SAS. As the first hospital in Denmark, Hospital Lillebaelt began a journey to discover hidden insights in the massive amount of structured and unstructured data it had. Just as H.C. Andersen was a pioneer with his colorful fairy tales, Hospital Lillebaelt was a pioneer with text analytics in the health care industry.

Innovative doctors like Chief Orthopedic Surgeon Sten Larsen, Dr.Med., Ulrik Gerdes and microbiology professor Jens Kjølseth Møller have seen the the value that text analytics can bring to their field. These solutions have a wide range of use, including determining diagnostic coding from EMR notes, automating the audit process to identify hospital adverse events in EMR notes, and uncovering which patients have a hospital-acquired infection.

SAS Text Analytics

SAS Text Analytics

Importance of transparency These solutions have more than health care in common. They all provide the clinicians with transparency in the results – a type of clinical stewardship that empowers doctors to make decisions based on all the patients’ data. There’s no black box technology. Clinicians can monitor the amount of infections, adverse events, etc., on either a hospital or ward level. They can even drill down to the actual findings on a single patient and get both the structured and unstructured data presented in a way that enables them to do fast root-cause analysis without reading pages and pages of patient information.

The simplicity, mobility and reuse of text analytics has been important from the beginning for these projects. When the projects started, we used text mining to explore the structures in the language, word frequency, abbreviations, word association, clusters and variations. This work gave us a fast and deep understanding of two years of EMR notes that we probably never would have accomplished in another way.

SAS Text Miner

SAS Text Miner

With the text mining approach allowing us to explore data and get an understanding of associations between specific words, we decided to switch to a Boolean categorization technique. This was to ensure full transparency in the results.

From the beginning, we decided on an approach with modules and vocabularies/word lists. Word lists containing nothing but the identified words and synonyms – no Boolean logic. This was to ensure easy editing of the vocabulary. For example, two word lists could be PAIN (pain, painful, hurts, sore, etc.) and KNEE (knee, patella, femoro-patellar, etc.). A module could then be KNEE_PAIN. A simple Boolean rule determining that knee and pain must be in the same sentence and within a distance of five words could look like this: (SENT,(DIST_5,KNEE,PAIN)). As the figure to the right indicates, the modules can become very advanced when negations, word order and time comes into play.

Regular expressions (regex) is another technique that is very convenient in many cases. In health care, this could be used to determine thresholds for fever, blood pressure, etc., and to discover drug doses. Health care business rules composed of a combination of Boolean operators, modules and word lists ensures a solution that is mobile and easy to build upon. TA2


Simplicity and mobility

The reason that this combination and its simplicity are so important is that the treatment methods and medical slang vary from hospital to hospital in the same country. When moving a solution from one country to another (from Denmark to Sweden, for example), simple word lists are more convenient for translation. (Then a lot of other exciting differences can come into play, e.g., morphology or semantics).

These vocabularies and modules would probably never be translated into hundreds of languages, like H.C. Andersen’s fairy tales. However, this type of innovation leads to new ideas – to new innovation. In my next post, I will share how unstructured text can be transformed into something measurable that can be included in another computer science discipline – machine learning.

If you have ideas or comments about how these vocabularies and modules should be handled and versioned, you are welcome to post a comment or write directly to me.


Detect the expected and discover the unexpected!

Post a Comment

"Analytics": meaning and use

As a linguist, I am fascinated with words, their meanings and use. So when I recently saw the words “learning analytics” in a conference paper title, I started thinking about the prevalence of the word “analytics” itself.

In the last decade, we have preceded “analytics” with many modifiers referring to concepts that are relevant to each of us as producers and consumers in the 21st century: data analytics, web analytics, marketing analytics, business analytics, predictive analytics, advanced analytics, text analytics, visual analytics – and now learning analytics. “Analytics” seems omnipresent: in emails, on social media, print ads, commercials, all around us, and ever-growing in popularity. Add to that SAS’ own recent release of Analytics 14.1 and I began to wonder: When did the term first start being used and in what context? How have the meaning and context changed over the years?

Dictionary.com puts the origins of the word in the 1590s (so the term itself is not as new as you may have surmised!). It means, according to Merriam-Webster, a “method of logical analysis.”

Stop right there.

Clearly, Merriam-Webster is behind the times, because it means a lot more to us today than just a method of analysis. So let’s turn to other sources for a more current definition. Dictionary.com and Wikipedia acknowledge the original meaning but add the object of analysis: data – often big data – and the purpose: deriving meaningful patterns. But even that does not quite capture the full meaning; there is more to the term “analytics” than statistical jargon – like logistic regression, for example. After all, I don’t know of any print ads, commercials or movies about the usefulness of logistic regressions, but I thoroughly enjoyed the movie Moneyball, which touts the value of analytics over gut feeling.

The most important part of the definition is what Wikipedia states as the purpose of analytics – “to describe, predict, and improve business performance” – since the most common application of analytics is for business data. The connection between analytics and its business uses is evident in this bubble chart, built with SAS Contextual Analysis from a sample corpus of online blogs, news, reviews and tweets mentioning analytics. In the contexts where “analytics” is mentioned, “business” and “business analytics” also figure prominently alongside “Google”/”web analytics” and “predictive analytics.”

analytics bubble chart

Bubble chart of words and phrases most commonly occurring in the context of analytics on the web

But, I would argue, analytics has a much broader usage than just for business performance – it has come to be applied to performance in every sense of the word, as the phrases “sports analytics,” “performance analytics” and “learning analytics” surely prove. If you think about it, any area where optimal performance is desired could potentially benefit from “analytics,” i.e., data-analysis-driven decision making. To capture the meaning of the term in this day and age, I would propose looking to SAS’ own definition of analytics, which captures all of the crucial elements of why analytics is being easily adapted to nearly every domain: algorithms (methods of analysis), data and a purpose: solving problems and making the best decisions possible.

I would go a step further and make the claim that this purpose is easier to achieve with data visualization, which takes the old adage “a picture is worth a thousand words” to heart and illustrates complex statistical results with comprehensible images (read more about visualizations in this recent blog entry).

An example of translating complex analytics into a meaningful image is the word cloud below, created with SAS Visual Analytics, which shows the top 100 words from my analysis of Internet documents referring to analytics. The size of the font reflects the relative prominence of the term in the data (the corpus of documents).

As you can see by the words highlighted in yellow, this word cloud reinforces the idea that the value of analytics is to provide intelligence to model, track, predict, learn, know, understand, improve – in other words, to make better decisions for one’s company, organization, enterprise or industry. (As a fun challenge, try to locate concepts from the previous sentence in the word cloud).

analytics word cloud

Word cloud of terms commonly used in the context of analytics on the web

Another method linguists use to trace language change, in addition to comparing formal definitions and corpus analysis methods illustrated above, is to zero in on how thought leaders in a domain use the language. One look at the recent Analytics Experience conference agenda also confirms that analytics is all about “transforming data into business value” and that visualization plays a large role in that transformational process.

How have you seen the term analytics applied and used recently? Have you noticed a shift in meaning from a method of analysis to a decision-making tool?

Post a Comment

Streaming Text Analytics: Finding value in real-time events

As technology and analytics continue to evolve, we're seeing new opportunities not only in the way that we analyze data, but also in deployment options. More specifically, real-time deployment of analytical algorithms that enable organizations to detect and respond to security threats, offer timely incentives to customers, and mitigate risk by detecting compliance or safety risks...all in real-time.

Text analytics is utilized in varying ways across organizations. At a high level, text analytics may involve:

  • Identification of data-driven topics and clusters across collections of text.
  • Automatic categorization of textual data to tag categories and sub-categories.
  • Extracting entities (such as name, currencies, ID numbers, company names, or complex facts). This may involve simple keyword tagging, or more advanced matching based on regular expressions, taxonomies, linguistic/NLP patterns, or a combination of these in order to extract information.
  • Sentiment analysis, which is used to understand the polarity of a comment at the document level as well as the category/feature level.
  • ...and more depending on the analytical maturity and business needs of the organization.

In many organizations, these algorithms are applied against historical data in batch mode. Depending on the business requirements, this may be exactly what is needed. But for others, real-time scoring opens up new opportunities and creates additional value for the organization and their customers.

So what is SAS Event Stream Processing? This technology enables organizations to integrate business logic, pattern matching, and statistical algorithms/predictive models against real-time data streams. This data may come from operational transactions, server or network logs, call centers conversations, sensors, or a variety of other sources.

Here are a few use cases where customers have seen value by integrating text analytics with event stream processing technology:

Voice of the Customer

Monitoring customer contact channelsESP_Text1 in real-time enables organizations to quickly identify emerging trends, respond to customer concerns, and escalate critical issues as they occur.​​ Today, many organization analyze call center notes long after the call has ended. I've seen examples where compliance issues and high-value customer complaints have gone undetected or the event was detected too late to be of any value.

E-Surveillance and Fraud DetectionESP_Text

Monitoring both internal and external communication is valuable (and sometimes required) within organizations. In regulated environments, communications around insider trading, collusion, and other fraudulent events can cause reputation and financial damage. Undetected, these events can have huge implications, but just as important, a delayed response can bury the information and further complicate the investigation.

Compliance and Safety

In many industries, early detection of adverse events and safety issues can save millions. This information comes in many forms, standard customer complaints, internal communications, and maybe even social media to name a few. When it comes to safety, real-time response is not only critical, but a delayed response is drastically devalued or worse yet, has no value at all.

The top 3 sections and use cases are just a few, but will hopefully help you in identifying areas beneficial to your organization. Ultimately, the areas listed below are where real-time analytics is critical and where organizations can expect to see significant value and ROI:

  • Safety (Safety of patients or customers.)
  • Security (Security around cyber threats, reputation threats, etc.)
  • Personalization (We've seen over a 20% increase in customer acceptance rates when the message is timed appropriately. This is applicable in call center and marketing settings within organizations.)
  • Risk (Across organizations various types of risk need to be responded to and acted upon immediately.)

Within your organization, where do you see text analytics and event stream processing creating value and opening up new opportunites?

To learn more about SAS' Text Analytics technology, visit SAS Contextual Analysis, and SAS Event Stream Processing.

Post a Comment

Text analytics through linguists’ eyes: When is a period not a full stop?

~ This article is co-authored by Biljana Belamaric Wilsey and Teresa Jade, both of whom are linguists in SAS' Text Analytics R&D.

When I learned to program in Python, I was reminded that you have to tell the computer everything explicitly; it does not understand the human world of nuance and ambiguity. This is as valuable a lesson in text analytics as in programming.

When I share with new acquaintances that we have a team of linguists at our analytics company, they are often puzzled as to what our job entails. I explain that we use our scientific understanding of language to ensure that the computer interprets the symbols of human language correctly; for example, what a word is or where a sentence ends. You might think these are easy tasks; after all, even young children have answers to these questions. But, in fact, teaching a computer the seemingly simple task of where a sentence ends across a wide range of human language texts quickly becomes complex, because a period is not always a full stop.

Take, for example, abbreviations like “Mr.” and “Mrs.” in English, “Dipl.-Ing.” in German, “par ex.” in French, “г.” in Russian, etc. In all of these cases and across most languages, the period does not necessarily signify the end of the sentence. Instead, it means information has been left out that we, as humans, can guess from context: “Mr.” really means “mister,” “Mrs.” refers to a married woman (did you know it is short for “mistress”?), “Dipl.-Ing.” stands for “Diplom-Ingenieur” (an engineering degree), “par ex.” stands for “par example” (“for example”) and “г.” most often stands for “год” (“year”) or “город” (“city”). You might think telling the computer to ignore the period in these cases is a good way to avoid interpreting the period as the end of the sentence. But that won’t work everywhere – just consider the first sentence of this paragraph, where the period comes after the abbreviation “etc.” but it also doubles as a sentence ender!

The situation is no less complex with numerals. In some parts of the world, including the US, South Asia and Australia, periods are used to separate the decimals from the integer and commas are used to separate thousands, for example: “100,000.25.” But in other parts of the world, including Europe and most of South America, convention dictates that the roles of the period and comma are reversed: Commas are used for decimals whereas periods separate hundreds, for example: “100.000,25.” In these cases, the entire numeral needs to be interpreted as one unit, and thousands of units of currency might be at stake.

Read More »

Post a Comment