Detect the expected and discover the unexpected - Text analytics in health care

BookWhen I ask people what they know about Denmark they often mention Hans Christian Andersen. He was born in Denmark in 1805 and is one of the most adored children’s authors of all time. Many of his fairy tales are known worldwide as they have been translated into more than 125 languages. His writing is colorful and picturesque and often with a hidden moral or criticism of society. He wanted the reader to detect the expected and discover the unexpected in his fairy tales.

In my business career I also work with detecting the expected and discovering the unexpected. I focus on health care, where Denmark is known worldwide for keeping health care data in electronic medical records (EMR). Unfortunately, reading EMRs isn’t like reading fairy tales, even though the language is both exotic – with Latin phrases – and modern, with text message jargon, medical slang, acronyms and abbreviations. The amount of text and data for doctors to manage is increasing from minute to minute, and the content is hard to consume for the clinicians during already busy days.

Highly complex language – combined with more and more laboratory analysis, X-ray descriptions, medication, guidelines, etc. – creates a situation where the clinician’s tight schedule, combined with the speed of human reading and understanding, becomes inadequate. Therefore, there is a need for advanced methods to extract value from text and data to ensure operational efficiency and reduced patient risk.

Hospital Lillebaelt Five years ago, Hospital Lillebaelt in Denmark came to the same conclusion. The amount of data was simply too large for a normal person to manage. Especially when it came to patient quality initiatives, it was an impossible task to review every patient’s data and to do it in a consistent way.

With that in mind, management at Hospital Lillebaelt started a text analytics initiative in 2010 together with SAS. As the first hospital in Denmark, Hospital Lillebaelt began a journey to discover hidden insights in the massive amount of structured and unstructured data it had. Just as H.C. Andersen was a pioneer with his colorful fairy tales, Hospital Lillebaelt was a pioneer with text analytics in the health care industry.

Innovative doctors like Chief Orthopedic Surgeon Sten Larsen, Dr.Med., Ulrik Gerdes and microbiology professor Jens Kjølseth Møller have seen the the value that text analytics can bring to their field. These solutions have a wide range of use, including determining diagnostic coding from EMR notes, automating the audit process to identify hospital adverse events in EMR notes, and uncovering which patients have a hospital-acquired infection.

SAS Text Analytics

SAS Text Analytics

Importance of transparency These solutions have more than health care in common. They all provide the clinicians with transparency in the results – a type of clinical stewardship that empowers doctors to make decisions based on all the patients’ data. There’s no black box technology. Clinicians can monitor the amount of infections, adverse events, etc., on either a hospital or ward level. They can even drill down to the actual findings on a single patient and get both the structured and unstructured data presented in a way that enables them to do fast root-cause analysis without reading pages and pages of patient information.

The simplicity, mobility and reuse of text analytics has been important from the beginning for these projects. When the projects started, we used text mining to explore the structures in the language, word frequency, abbreviations, word association, clusters and variations. This work gave us a fast and deep understanding of two years of EMR notes that we probably never would have accomplished in another way.

SAS Text Miner

SAS Text Miner

With the text mining approach allowing us to explore data and get an understanding of associations between specific words, we decided to switch to a Boolean categorization technique. This was to ensure full transparency in the results.

From the beginning, we decided on an approach with modules and vocabularies/word lists. Word lists containing nothing but the identified words and synonyms – no Boolean logic. This was to ensure easy editing of the vocabulary. For example, two word lists could be PAIN (pain, painful, hurts, sore, etc.) and KNEE (knee, patella, femoro-patellar, etc.). A module could then be KNEE_PAIN. A simple Boolean rule determining that knee and pain must be in the same sentence and within a distance of five words could look like this: (SENT,(DIST_5,KNEE,PAIN)). As the figure to the right indicates, the modules can become very advanced when negations, word order and time comes into play.

Regular expressions (regex) is another technique that is very convenient in many cases. In health care, this could be used to determine thresholds for fever, blood pressure, etc., and to discover drug doses. Health care business rules composed of a combination of Boolean operators, modules and word lists ensures a solution that is mobile and easy to build upon. TA2


Simplicity and mobility

The reason that this combination and its simplicity are so important is that the treatment methods and medical slang vary from hospital to hospital in the same country. When moving a solution from one country to another (from Denmark to Sweden, for example), simple word lists are more convenient for translation. (Then a lot of other exciting differences can come into play, e.g., morphology or semantics).

These vocabularies and modules would probably never be translated into hundreds of languages, like H.C. Andersen’s fairy tales. However, this type of innovation leads to new ideas – to new innovation. In my next post, I will share how unstructured text can be transformed into something measurable that can be included in another computer science discipline – machine learning.

If you have ideas or comments about how these vocabularies and modules should be handled and versioned, you are welcome to post a comment or write directly to me.


Detect the expected and discover the unexpected!

Post a Comment

"Analytics": meaning and use

As a linguist, I am fascinated with words, their meanings and use. So when I recently saw the words “learning analytics” in a conference paper title, I started thinking about the prevalence of the word “analytics” itself.

In the last decade, we have preceded “analytics” with many modifiers referring to concepts that are relevant to each of us as producers and consumers in the 21st century: data analytics, web analytics, marketing analytics, business analytics, predictive analytics, advanced analytics, text analytics, visual analytics – and now learning analytics. “Analytics” seems omnipresent: in emails, on social media, print ads, commercials, all around us, and ever-growing in popularity. Add to that SAS’ own recent release of Analytics 14.1 and I began to wonder: When did the term first start being used and in what context? How have the meaning and context changed over the years? puts the origins of the word in the 1590s (so the term itself is not as new as you may have surmised!). It means, according to Merriam-Webster, a “method of logical analysis.”

Stop right there.

Clearly, Merriam-Webster is behind the times, because it means a lot more to us today than just a method of analysis. So let’s turn to other sources for a more current definition. and Wikipedia acknowledge the original meaning but add the object of analysis: data – often big data – and the purpose: deriving meaningful patterns. But even that does not quite capture the full meaning; there is more to the term “analytics” than statistical jargon – like logistic regression, for example. After all, I don’t know of any print ads, commercials or movies about the usefulness of logistic regressions, but I thoroughly enjoyed the movie Moneyball, which touts the value of analytics over gut feeling.

The most important part of the definition is what Wikipedia states as the purpose of analytics – “to describe, predict, and improve business performance” – since the most common application of analytics is for business data. The connection between analytics and its business uses is evident in this bubble chart, built with SAS Contextual Analysis from a sample corpus of online blogs, news, reviews and tweets mentioning analytics. In the contexts where “analytics” is mentioned, “business” and “business analytics” also figure prominently alongside “Google”/”web analytics” and “predictive analytics.”

analytics bubble chart

Bubble chart of words and phrases most commonly occurring in the context of analytics on the web

But, I would argue, analytics has a much broader usage than just for business performance – it has come to be applied to performance in every sense of the word, as the phrases “sports analytics,” “performance analytics” and “learning analytics” surely prove. If you think about it, any area where optimal performance is desired could potentially benefit from “analytics,” i.e., data-analysis-driven decision making. To capture the meaning of the term in this day and age, I would propose looking to SAS’ own definition of analytics, which captures all of the crucial elements of why analytics is being easily adapted to nearly every domain: algorithms (methods of analysis), data and a purpose: solving problems and making the best decisions possible.

I would go a step further and make the claim that this purpose is easier to achieve with data visualization, which takes the old adage “a picture is worth a thousand words” to heart and illustrates complex statistical results with comprehensible images (read more about visualizations in this recent blog entry).

An example of translating complex analytics into a meaningful image is the word cloud below, created with SAS Visual Analytics, which shows the top 100 words from my analysis of Internet documents referring to analytics. The size of the font reflects the relative prominence of the term in the data (the corpus of documents).

As you can see by the words highlighted in yellow, this word cloud reinforces the idea that the value of analytics is to provide intelligence to model, track, predict, learn, know, understand, improve – in other words, to make better decisions for one’s company, organization, enterprise or industry. (As a fun challenge, try to locate concepts from the previous sentence in the word cloud).

analytics word cloud

Word cloud of terms commonly used in the context of analytics on the web

Another method linguists use to trace language change, in addition to comparing formal definitions and corpus analysis methods illustrated above, is to zero in on how thought leaders in a domain use the language. One look at the recent Analytics Experience conference agenda also confirms that analytics is all about “transforming data into business value” and that visualization plays a large role in that transformational process.

How have you seen the term analytics applied and used recently? Have you noticed a shift in meaning from a method of analysis to a decision-making tool?

Post a Comment

Streaming Text Analytics: Finding value in real-time events

As technology and analytics continue to evolve, we're seeing new opportunities not only in the way that we analyze data, but also in deployment options. More specifically, real-time deployment of analytical algorithms that enable organizations to detect and respond to security threats, offer timely incentives to customers, and mitigate risk by detecting compliance or safety risks...all in real-time.

Text analytics is utilized in varying ways across organizations. At a high level, text analytics may involve:

  • Identification of data-driven topics and clusters across collections of text.
  • Automatic categorization of textual data to tag categories and sub-categories.
  • Extracting entities (such as name, currencies, ID numbers, company names, or complex facts). This may involve simple keyword tagging, or more advanced matching based on regular expressions, taxonomies, linguistic/NLP patterns, or a combination of these in order to extract information.
  • Sentiment analysis, which is used to understand the polarity of a comment at the document level as well as the category/feature level.
  • ...and more depending on the analytical maturity and business needs of the organization.

In many organizations, these algorithms are applied against historical data in batch mode. Depending on the business requirements, this may be exactly what is needed. But for others, real-time scoring opens up new opportunities and creates additional value for the organization and their customers.

So what is SAS Event Stream Processing? This technology enables organizations to integrate business logic, pattern matching, and statistical algorithms/predictive models against real-time data streams. This data may come from operational transactions, server or network logs, call centers conversations, sensors, or a variety of other sources.

Here are a few use cases where customers have seen value by integrating text analytics with event stream processing technology:

Voice of the Customer

Monitoring customer contact channelsESP_Text1 in real-time enables organizations to quickly identify emerging trends, respond to customer concerns, and escalate critical issues as they occur.​​ Today, many organization analyze call center notes long after the call has ended. I've seen examples where compliance issues and high-value customer complaints have gone undetected or the event was detected too late to be of any value.

E-Surveillance and Fraud DetectionESP_Text

Monitoring both internal and external communication is valuable (and sometimes required) within organizations. In regulated environments, communications around insider trading, collusion, and other fraudulent events can cause reputation and financial damage. Undetected, these events can have huge implications, but just as important, a delayed response can bury the information and further complicate the investigation.

Compliance and Safety

In many industries, early detection of adverse events and safety issues can save millions. This information comes in many forms, standard customer complaints, internal communications, and maybe even social media to name a few. When it comes to safety, real-time response is not only critical, but a delayed response is drastically devalued or worse yet, has no value at all.

The top 3 sections and use cases are just a few, but will hopefully help you in identifying areas beneficial to your organization. Ultimately, the areas listed below are where real-time analytics is critical and where organizations can expect to see significant value and ROI:

  • Safety (Safety of patients or customers.)
  • Security (Security around cyber threats, reputation threats, etc.)
  • Personalization (We've seen over a 20% increase in customer acceptance rates when the message is timed appropriately. This is applicable in call center and marketing settings within organizations.)
  • Risk (Across organizations various types of risk need to be responded to and acted upon immediately.)

Within your organization, where do you see text analytics and event stream processing creating value and opening up new opportunites?

To learn more about SAS' Text Analytics technology, visit SAS Contextual Analysis, and SAS Event Stream Processing.

Post a Comment

Text analytics through linguists’ eyes: When is a period not a full stop?

~ This article is co-authored by Biljana Belamaric Wilsey and Teresa Jade, both of whom are linguists in SAS' Text Analytics R&D.

When I learned to program in Python, I was reminded that you have to tell the computer everything explicitly; it does not understand the human world of nuance and ambiguity. This is as valuable a lesson in text analytics as in programming.

When I share with new acquaintances that we have a team of linguists at our analytics company, they are often puzzled as to what our job entails. I explain that we use our scientific understanding of language to ensure that the computer interprets the symbols of human language correctly; for example, what a word is or where a sentence ends. You might think these are easy tasks; after all, even young children have answers to these questions. But, in fact, teaching a computer the seemingly simple task of where a sentence ends across a wide range of human language texts quickly becomes complex, because a period is not always a full stop.

Take, for example, abbreviations like “Mr.” and “Mrs.” in English, “Dipl.-Ing.” in German, “par ex.” in French, “г.” in Russian, etc. In all of these cases and across most languages, the period does not necessarily signify the end of the sentence. Instead, it means information has been left out that we, as humans, can guess from context: “Mr.” really means “mister,” “Mrs.” refers to a married woman (did you know it is short for “mistress”?), “Dipl.-Ing.” stands for “Diplom-Ingenieur” (an engineering degree), “par ex.” stands for “par example” (“for example”) and “г.” most often stands for “год” (“year”) or “город” (“city”). You might think telling the computer to ignore the period in these cases is a good way to avoid interpreting the period as the end of the sentence. But that won’t work everywhere – just consider the first sentence of this paragraph, where the period comes after the abbreviation “etc.” but it also doubles as a sentence ender!

The situation is no less complex with numerals. In some parts of the world, including the US, South Asia and Australia, periods are used to separate the decimals from the integer and commas are used to separate thousands, for example: “100,000.25.” But in other parts of the world, including Europe and most of South America, convention dictates that the roles of the period and comma are reversed: Commas are used for decimals whereas periods separate hundreds, for example: “100.000,25.” In these cases, the entire numeral needs to be interpreted as one unit, and thousands of units of currency might be at stake.

Read More »

Post a Comment

Why I’m not worried by double negatives?

Double negatives seem to be everywhere, I have noticed them a lot in music recently. Since Pink Floyd sang "We don't need no education", to Rihanna's "I wasn’t looking for nobody when you looked my way". My own favourite song with a double negative is "I can't get no sleep" - Faithless.

This last one is maybe the most appropriate, because I have been thinking about double negatives a lot recently. The Oxford Dictionary definition of double negative is as follows:

  1. A negative statement containing two negative elements (for example "he didn't say nothing")
  2. A positive statement in which two negative elements are used to produce the positive force, usually for some particular rhetorical effect, for example "there is not nothing to worry about"

However this definition misses the point that their usage can be extremely nuanced.  Double negatives are often used in litotes, which is a figure of speech where an understatement is used to emphasise a point, by stating a negative to affirm a positive. Context is everything, the phrase "not bad" can be used to indicate a range of opinions from just average to brilliant. They can also diminish the harshness of an observation "the service wasn’t the best”, might be used by some people, as a politer way of saying “the service was bad”.

I've been thinking about double negatives and negation in language, because I have recently worked on several projects with business to consumer companies, analysing their complaint and NPS survey feedback data. Maybe it's a particularly English trait, but my countrymen seem to use negation in this type of survey feedback a lot. An airline customer will say "their meal wasn't bad" or a bank customer is "not worried that their interest payment is delayed ". Of course they negate their positives too, so the airline customer "is not pleased they had to queue at check in" and the bank customer "isn't happy with the mistake setting up a power of attorney".

This type of language usage can sometimes cause a problem for some Text Analysis solutions because the primary approach they utilize is to summarize the words in documents mathematically. So if the same words are used in very differing contexts, such as negations, there is a risk documents might be classified in the wrong topic. For example SAS Text Analytics uses a “bag-of-words” vector representation of the documents. This is a powerful statistical approach which uses term frequency, but it does ignore the context of the term. So if the same words are used in very differing contexts, such as negations. There is a risk documents might be classified in the wrong topic.

Fortunately SAS Text Analytics also provides an extremely effective feature, which allows you to treat terms differently according to context. The approach is described further in this white paper “Discovering What You Want: Using Custom Entities in Text Mining”.

I used this approach to define a library of approximately 6,500 positive and negative words and treated these differently if they were negated. You can almost think of this as a new user defined ‘part of speech’, this then gives more information to the mathematical summarisation of the documents and ultimately discovers more useful topics with less false positives.

I’m embarrassed to admit I only speak English, but interestingly I learnt whilst researching this blog post, that double negation is not used the same way across different languages. For example, it is extremely uncommon in Germanic languages, in some languages, like Russian, it is required whenever anything is negated, whereas in other languages double negatives actually intensify the negation. However aside from handling negations, this hybrid approach combing linguistic rules with algorithms, can be used in lots of other ways too. For example dealing with homonyms (e.g. same pronunciation, same spelling, different meaning i.e. “lean” (thin)  -vs-  “lean” (rest against)) or heteronyms (e.g. different pronunciation, same spelling, different meaning i.e. “close” (shut) -vs-  “close” (near)), if these are used a lot in your corpus of documents.

Possibly the most beneficial use of all is to differentiate between language usage, that maybe specific to your corpus of documents. For example an insurance assessor maybe taking notes about an accident and write:

“… the customer works on an industrial estate and would like us to assess the damage on the car there”


“ … the accident happened early evening on the southern industrial estate”

In this example I could build linguistic rules to identify the time and location of accidents. This may improve model accuracy to detect insurance fraud if there is a correlation between crash for cash accidents, locations and times. There are lots of very specific examples like this, where this hybrid text mining approach, which combines linguistic rules with machine learning, can significantly improve our text analysis results.

Post a Comment

Behind the scenes in natural language processing: Overcoming key challenges of rule-based systems

A while ago, I started this series on natural language processing (NLP), and discussed some of the challenges of computers interpreting meaning in human language based on strings of characters. I also mentioned that today’s NLP systems can do some amazing things, including enabling the transformation of unstructured data into structured numerical and/or categorical data.

Why is this important? Because once the key information has been identified or a key pattern modeled, the newly created, structured data can be used in predictive models or visualized to explain events and trends in the world. In fact, one of the great benefits of working with unstructured data is that it is created directly by the people with the knowledge that is interesting to decision makers. Unstructured data directly reflects the interests, feelings, opinions and knowledge of customers, employees, patients, citizens, etc.

For example, if I am an automobile design engineer, and I have ready access to a good summary of what customers liked or didn’t like about the last year’s vehicle and competitor vehicles in a similar class, then I have a better chance of creating a superior and more popular design this year.

My previous article, “Behind the scenes in natural language processing: Is machine learning the answer?,” mentioned that the two most-common approaches to NLP are rule-based (human-driven) systems or statistical (machine-driven or machine learning) systems. I began the discussion of rule-based systems by describing some benefits. But these systems also pose some challenges, which I will elaborate on here.

Read More »

Post a Comment

Event Stream Processing with Text Analytics

Is text analytics part of your current analytical framework?

For many SAS customers, the answer is yes, and they've uncovered significant value as a result.

As text data continues to explode both in volume and the rate at which it's being generated, SAS Event Stream Processing can be used to analyze not only high-velocity structured data, but also the text (by using text models in stream).

In some cases, standard batch processing delivers the analytical insight sufficient for organizations. Yet, what about those other situations where taking action immediately, as an event is happening, is critical? These sub-second actions and real-time alerts can save or make millions of dollars for a company.

Below, I describe techniques that highlight streaming analysis of text data (and many of these elements also apply to structured data as well). My hope is this will trigger ideas and use cases for you to think about within your company.

1.)   Data Quality and Cleansing

Anyone who has worked with social data (or any text data for that matter) understands that it can be cluttered with noise, encoding issues, abbreviations, misspellings, etc. If not corrected, this can lead to inaccurate results and even processing errors. So why not deploy Event Stream Processing to correct and transform variables before they hit your database? As you’d expect, not every data quality issue can be resolved on the frontend of data collection, but by applying known corrections upfront, you have the ability to enrich your data and enhance the value of data sitting within your database.

Image 1: Diagram of an Event Stream Processing flow, integrating text analytics, pattern detection, and predictive modeling.

Image 1: Diagram of an Event Stream Processing flow, integrating text analytics, pattern detection, and predictive modeling.

2.)   In-Stream Sentiment Analysis and Categorization

SAS has a powerful set of text analytics technologies that customers have been using for over 10 years. In the latest release of SAS Event Stream Processing (version 3.1, which comes out in May), customers who currently license SAS Sentiment Analysis, SAS Content Categorization, or SAS Contextual Analysis can now deploy these models against streaming data. This opens a window of opportunity to tag unstructured data on the fly (such as sentiment scoring, classifying documents, or extracting entities). These results are then inputs to event stream models for additional scoring, or to generate alerts, prompts, or to take a specific action. To learn more about SAS Text Analytics, check out SAS Contextual Analysis, SAS Text Miner, and SAS Sentiment Analysis.

3.)   Embedded Modeling

In text analytics, the goal is to convert unstructured data into some structured format, such as flags, scores, categories, and entities. For many applications, these new variables are most valuable when they are used to enhance predictive models, trigger alerts, create risk scores, enrich content, and to ultimately track and report. Through embedded analytics, SAS DS2 code (and functions in C++, XML, and regular expressions may also be used) can be deployed within event stream processing flows, which means real-time scoring of both structured and unstructured text using regression models, decision trees, and more.

4.)   Integrated Data Sources

In many situations, insights from streaming data can only be realized when multiple data streams are integrated together. SAS Event Stream Processing allows users to join and merge data in stream, so that the calculations and models may be applied to the comprehensive dataset. For example, a large call center has streaming data in the form of customer complaints and service-

Image 2: SAS Event Stream Processing Streamviewer

Image 2: SAS Event Stream Processing Streamviewer

related questions. Once a customer comment is received, SAS Event Stream Processing can extract the customer name and/or customer ID and match it to transactional history for that customer, while also categorizing the reason(s) of the complaint or question. This in turn can trigger a prompt to the agent to adopt a retention strategy or potentially upsell them to a new product or service.

5.)   Emerging Issue Detection

As data floods into you organization, it is sometimes difficult to spot emerging trends and issues. Currently, many organizations run batch jobs to detect and resolve these issues. Because SAS Event Stream Processing can be both stateful and stateless, aggregations and advanced models can be used to identify emerging topics, categories, sentiment and other indicators in real-time. These emerging issues can be detected using sophisticated pattern matching that supports detecting patterns based on the relationship of one event to one or more other events within a defined period of time. Thresholds can be set and events can be used to determine the relevancy and immediacy of any associated instruction/action. This changes the process from being reactive to proactive in the sense that an emerging issue can be monitored in real-time.

Real-time systems such as SAS Event Stream Processing are used for a variety of purposes. By integrating this technology as a front end to key, time-sensitive deployments, organizations gain a competitive advantage in both time and quality.

To learn more about SAS Event Stream Processing, check out the following links for more information and feel free to contact us if you'd like more information.

Also, if you're out in Dallas at SAS Global Forum next week, be sure to stop by and check out SAS Event Stream Processing.

Post a Comment

What’s it take to be a data scientist?

In February of this year, the Washington Business Journal reported the US Government appointed its first Chief Data Scientist, DJ Patil. With this, I think it’s now safe to say that Data Science is officially sanctioned as new mode in organizations. Those that can apply the necessary finesse along with business acumen to make sense of big data has formalized into a ‘new’ profession.

I talked to one of our own, to find out his thoughts in what it takes to be a data scientist. And true to his ilk, SAS’s Adam Pilz applied text analytics to figure out what skills were being sought to fill this coveted role.

Adam Pilz, SAS

Adam Pilz, SAS

Crawling just over 7,000 public postings from a job website, Adam investigated the key elements companies were looking for in a data scientist. They must be highly educated to attain a job.” Masters degrees or greater was seen as a requirement for 81% of the advertised jobs – comparing that to the 12% of the American population. Indeed, there is a clear distinction between the level of scholarship obtained by the general public and that required of a data scientist.

In terms of the prowess of data scientists? He saw the top 10 most desirable analytical skills mentioned by a prospective employer were:


Adam suspects that the first two categories (machine learning and optimization) may simply be popular buzzwords added to job postings, and perhaps optimization may be the Human Resources department’s way of describing how to make things better - versus the mathematical method. If that holds true than it’s possible that text analytics is the most sought after skill in the data scientist market. At a minimum it’s in the top three.

He saw that text analytics and forecasting were the fastest growing desirable skills. And of course, as with all text analysis, various synonyms were captured for each of the terms seen above. For example, content analysis, NLP, sentiment analysis, text classification, topic extraction, etc. are all included in the term ‘text analysis’.

Data wrangling’ is a fun term. It conjures up romantic notions of the wild west, within (no doubt) the Text Frontier - wrestling big data beasties, captured by causally (and similarly) dressed cowboys who are methodical in their approach (big buckle bragging rights will be seen at this year’s SAS Global Forum in Dallas, as a matter of fact).

Breaking this out further, Adam compared:

  • “lower level skills” = those that are lower in importance as education attainment increases, relative to
  • “higher level skills” = those that become more important to have as education increases.

In the two charts below, the skills are in ranked order of importance.

Importance of ‘lower level’ analytical skills with educational attainment

Importance of ‘lower level’ analytical skills with educational attainment

As education level increases (from left to right), skills like data wrangling, data visualization and basic statistics are not prominently featured as required skills for data scientists, as Masters and then again, PhDs are expected to focus their time on more sophisticated types of analysis.

Importance of ‘higher level’ analytical skills with educational attainment

Importance of ‘higher level’ analytical skills with educational attainment

Text analytics, on the other hand, jumps significantly in importance ranking as skill level rises, possibly because the outputs of such an analysis are highly sensitive to the methods used and thus impacted by subject matter expertise. Linear regression and design of experiments both become more important with increasing education, and generalized linear models show up as a required skill for PhDs.

I also asked Adam if he has seen any trends in the usage of the term ‘data scientist’. He said that “the level of education required to be a data scientist has remained the same for the last year, but there are important geographical differences”.  Backing this up, he pointed to differences seen in the highest level of education mention in the job postings, segmenting based on the highest level of education that was required. When looking at the entire US, he found that Bachelor’s was the least sought after degree for data science positions, only seen in 19% of the job postings, while Masters were the most cited educational requirement, garnering 54% of the advertised positions. PhDs claimed the remaining 27%.

Geographical differences in required skill level were found in Silicon Valley relative to the rest of the country. He saw that inside Silicon Valley, PhDs were required for 50% of the jobs listed (and Masters were required for 36%). This was in contrast to jobs outside of Silicon Valley, where PhDs and Masters were identified for 33% and 55% of job postings, respectively.

It’s been said that SAS has more PhDs on staff than does any single university in the United States. And if you’re using open source code for example, perhaps you do need more PhDs on staff to make sure that algorithms are behaving correctly. I know here at SAS we build that expertise right into the software.

I asked Adam what software he used to do this analysis. He initially used Base SAS® and found that once he’d written the code to tag terms he was able to find them in the postings. However, he soon moved to SAS® Contextual Analysis. The difference? SAS Contextual Analysis highlighted the word tagged by the category and so he was able to search for the specific terms and see what else people talking about. He found that the text analytics software gave him the insight into how different postings were saying similar things, in addition to informing what other phrases he might want to investigate, concluding that the text analytics approach was ”..More enlightenment than discovery”.

Adam did this research before coming to SAS – in his search for a new career. We are thrilled to have him and his data scientist prowess as part of the SAS family.

Regardless of title (Adam is described as a Solutions Architect), the skills attributed to data science have been held by those in the analytic field for some time.

Do you see yourself as a data scientist?


Post a Comment

Don’t Second Guess – Depend on Prescriptive Analytics

I don’t know why I’m on this medical theme lately – maybe it’s because my parents are aging. They talk about bits falling off, take lots of naps and describe how body parts don’t work like they used to. They’ve gone to pre-packaged pills – dividing up their medications by day and time of day by the local pharmacist. It’s helped a lot. I’ve got a lot more confidence that my Dad won’t (again!) take a sleeping pill, first thing in the morning - before he gets in the car to drive. Ugh.

Confidently knowing what action needs to be taken because it’s pre-packaged is very appealing to many aspects of business too.  Wouldn’t it be nice to know that front-line workers make all their decisions in a particular situation based on expert advice that includes organizational policies and requirements? Even when it’s not people, but other systems or even devices making decisions.  Like in the Internet of Things, isn’t it necessary for those things to base action on situational understanding - triggering a specific (and appropriate) action for a particular scenario? Yes, particularly when we turn our decisions over to machines to make them for us.

Pre-packaged pills

Pre-packaged pills

Taking prescriptive actions includes the benefit of:

  • Consistency – under the same scenario conditions the same action is take
  • Repeatability – when the same situation arises, you can reuse the same logic
  • Efficiency – no additional energy is spent investigating the action to be taken, it’s prescribed.

And together, these things reduce the risk of the wrong decision being made and an inappropriate action being taken.

So where do you get the expertise in the prescribed action? My parents get it from a subject matter expert - the pharmacist, who, based on his training, directions from the doctor and knowledge of current medications defines which pills go into which sealed envelop. In fact, their pharmacist was able to decrease their medications – simply because of his perspective on the buffet of pills they’d been prescribed over the years.

The Wrinklies

The Wrinklies

Organizations get the expertise from a few places. From their analytical experts who examine operational data to assess the pros and cons of different factors influencing behaviors and outcomes – summarized in advanced analytic models. From business analysts, who consider situational conditions, organizational policies, regulatory controls and analytical model scores in relation to decision objectives. They develop the business rules that define the conditions under which an analytical model is relevant. From their IT departments who have spent time collecting, cleansing and normalizing operational data to ensure currency, accuracy and availability. And from corporate executives who determine organizational policies and mandates to align stakeholder and compliance requirements.

In a recent IIA Research Brief the difference between predictive and prescriptive analytics is detailed. The paper also goes into more depth of how you gain (likely untapped) prescriptive insight from unstructured text data – it’s amazing the direction that is often included in narrative. Going beyond the data discussion, it describes how prescriptive actions are codified using the discipline of enterprise decision management. And lastly, it explores the impact of big data in the form of streaming data – necessitating more operational and tactical decision discipline.

The Wrinklies (my nickname for my folks) have taken some of the guess-work out of their routine and we all agree, they are better off for it. Giving you more confidence, what operational activities in your organization would you like to see prescribed?

Post a Comment

Seth Grimes with More on Text Analytics

Perhaps it’s the same for you - it’s getting harder to get to all the conferences I’d like to attend. One of the benefits of getting out there is a chance to learn about different perspectives in an industry. When someone has a broad perspective, particularly if they’ve been in an area for a number of years, their focal lens can often see unique trends.

Earlier this year, I was able to catch up with Seth Grimes and get his perspective on:

Seth Grimes, Alta Plana Corporation

Seth Grimes, Alta Plana Corporation

  • The extent to which text analytics become an essential part of data analysis?
  • Why have some organizations not realized positive ROI – and how can they improve
  • What’s unique about text analytics?
  • What are the hottest issues in the text analytics market over the next few years?

Check out the free recording of our discussion so that you can hear what he had to say!

Looking at this field since 2002, and with his market survey running since 2009 – Seth Grimes, Alta Plana Corporation and industry expert in text analytics, has seen the text analytics field from being an interesting concept, to an applied discipline. He released his most recent study this year.

Unstructured text analysis is expected to grow even more next year – particularly in organizations who’ve started to understand big data. To date, many have focused on the more traditional structured side of big data. The next chapter is to understand the unstructured – text being a large part of that.

So as Seth says we can only expect “more”. What does more text analytics mean to you?

Post a Comment