~ Contributed by Fiona McNeil ~
With almost 2 billion internet users worldwide at the end of 2009, and the extensive document stores contained in most organizations, not to mention the duplication of materials – the problem of thoroughly evaluating text in a consistent, efficient manner is daunting - to say the least.
That’s where Text Analytics comes in – making sense of it all so that you can act on the information - fast.
As a concept, text is pretty straightforward: free-form words, sentences and paragraphs contained in blogs, tweets, web sites, reports, commentaries, claims, applications and all kinds of other documents. Unlike numerical data which is considered “structured”, text doesn’t have rows and columns to define relationships and is thus termed “unstructured”. Common estimates suggest 70-80% of all data is unstructured. That definition includes voice and images.
Without understanding this text data, organizations are then making decisions based on only 20 – 30% of data available. Is that good enough, or even wise?
At its most basic level, “analytics” refers to using technology and methods to speed up and improve the process of discovery, insight and action – identifying patterns or connections the human brain may struggle to see. Done right, it improves consistency, efficiency and effectiveness providing more time for employees to think strategically. Text Analytics is the automation of the human component – having machines read, interpret and action unstructured data.
The subject is fraught with terminology and definitions that can easily confuse. For example, terms like taxonomy and ontology are often used interchangeably, even though their purpose, use and capabilities are very different. And while they complement each other in automating business processes, they are different. No matter what your current state of analyzing text and regardless of terminology used, there are 3 simple rules to keep in mind.
Rule 1: Text data is messy
Text data, by its inherent free-form nature and an individual’s writing style, is plagued by misspellings, acronyms, clipped text (e.g. ttfn), emoticons, etc. Data pre-processing is required. As with any analytic project, the results are dependent on the quality of the input and the exploration of that data, thereby modifying the data to best formulate the analysis.
If you are looking to identify consistent patterns across a document collection, like AFA Insurance has in their workplace and injury claims assessments – differences in the input text need to be addressed (creating synonyms, for example) in order to find new insights across the collection that simply would not be found by looking at reports in isolation. On the other hand, it may be those very differences that you want to find amongst the text collection – as is often the case examining claims for suspicious activity, abuse and fraud, and in doing so, potentially saving millions of dollars.
So yes, there’s some initial work with the data. However, the resulting models are embedded into operational systems, automating this pre-processing which in turn saves substantial manual effort. When ChinaHR.com applied their analysis, they were able to deal with documents in different formats, writing styles and dialects – automatically matching résumés to job postings with 95% accuracy in a fraction of the time.
Rule 2: Text models change over time
It may be that your evaluation of customer reviews sparked a new social media branding campaign, or your ability to identify potential product issues before customers even notice has lead to pro-actively predicting emerging future problems, as Sub-Zero has successfully done.
Business analytics solves problems – and once solved, behavior changes - your customers are now more satisfied and hopefully, their perceptions of your company have improved. The way you operate may have changed too. All of which means the content or words found in tweets, blogs and other forms may also change – by design.
And so must the models. New concepts will emerge, neologisms develop, more involved analysis is done as each issue is resolved, new sources of input are included – this is the benefit of text analytics – it continues to improve the business over time - refining your insights every step of the way. You’ll need an open, management environment to test and validate models – one that permits users to override, modify and extend models, rules and taxonomies. They’ll need to be managed and controlled with explicit administration rights and audit capabilities within the system. As you begin to deal with more languages, acquire a new company, extend to new research areas and augment your document management systems, scalable, flexible technology becomes critical.
Realize that your needs and insights will demand multiple ways to examine text in order to improve different aspects of your organizational operations over time – there’s no one size fits all. For example, AGATA found that by using a variety of integrated methods they effectively created a new social lending environment – bringing together both structured and unstructured data in a financial social network.
Rule 3: Collect metrics from model implementation
Text models and rules need to be tested and validated to ensure that input data has not changed significantly to warrant new modeling coefficients, or that classifications are still within acceptable standards. But even more importantly from a business context – you’ll need to measure the improvements gained.
Ongoing monitoring will inform any new actions that are required – the reporting practices analyzed by the University of Louisville have been able to improve both patient care and hospital accreditation rankings. Improvements and efficiencies gained through the automation of document materials can be measured in hard numbers – resource cost reduction, dollars saved, fewer hospital accidents and workplace injuries. Measures of the decreases in the number of complaints, product defect reductions and time spent searching for relevant materials can all be quantified.
The Hong Kong Efficiency Unit in their public service application have been able to help improve service delivery, inform better decisions and ultimately improve public satisfaction with the government. Such measures provide the evidence that improvements are being made, and investment in the technology, new operational practices, and policies are justified. You’ll be able to prove the value of that 70 – 80% unstructured data.
If you’d like to learn more, check out our Text Analytics 101 Webinar where we outline how Natural Language Processing (NLP) works – the method used by machines to evaluate the words, the meaning constructed in sentences, and how that is extracted from electronic text. Or follow us on our dedicated Text Frontier blog.
Ref: 1 = World Internet Usage Statistics: http://www.internetworldstats.com/stats.htm, viewed June 23, 2010.