Hybrid approach to text analytics

Double negatives seem to be everywhere, I have noticed them a lot in music recently. Since Pink Floyd sang "We don't need no education", to Rihanna's "I wasn’t looking for nobody when you looked my way". My own favourite song with a double negative is "I can't get no sleep" - Faithless.

This last one is maybe the most appropriate, because I have been thinking about double negatives a lot recently. The Oxford Dictionary definition of double negative is as follows:

A negative statement containing two negative elements (for example "he didn't say nothing")
A positive statement in which two negative elements are used to produce the positive force, usually for some particular rhetorical effect, for example "there is not nothing to worry about"

However this definition misses the point that their usage can be extremely nuanced. Double negatives are often used in litotes, which is a figure of speech where an understatement is used to emphasise a point, by stating a negative to affirm a positive. Context is everything, the phrase "not bad" can be used to indicate a range of opinions from just average to brilliant. They can also diminish the harshness of an observation "the service wasn’t the best”, might be used by some people, as a politer way of saying “the service was bad”.

I've been thinking about double negatives and negation in language, because I have recently worked on several projects with business to consumer companies, analysing their complaint and NPS survey feedback data. Maybe it's a particularly English trait, but my countrymen seem to use negation in this type of survey feedback a lot. An airline customer will say "their meal wasn't bad" or a bank customer is "not worried that their interest payment is delayed ". Of course they negate their positives too, so the airline customer "is not pleased they had to queue at check in" and the bank customer "isn't happy with the mistake setting up a power of attorney".

This type of language usage can sometimes cause a problem for some Text Analysis solutions because the primary approach they utilize is to summarize the words in documents mathematically. So if the same words are used in very differing contexts, such as negations, there is a risk documents might be classified in the wrong topic. For example SAS Text Analytics uses a “bag-of-words” vector representation of the documents. This is a powerful statistical approach which uses term frequency, but it does ignore the context of the term. So if the same words are used in very differing contexts, such as negations. There is a risk documents might be classified in the wrong topic.

Fortunately SAS Text Analytics also provides an extremely effective feature, which allows you to treat terms differently according to context. The approach is described further in this white paper “Discovering What You Want: Using Custom Entities in Text Mining”.

I used this approach to define a library of approximately 6,500 positive and negative words and treated these differently if they were negated. You can almost think of this as a new user defined ‘part of speech’, this then gives more information to the mathematical summarisation of the documents and ultimately discovers more useful topics with less false positives.

I’m embarrassed to admit I only speak English, but interestingly I learnt whilst researching this blog post, that double negation is not used the same way across different languages. For example, it is extremely uncommon in Germanic languages, in some languages, like Russian, it is required whenever anything is negated, whereas in other languages double negatives actually intensify the negation. However aside from handling negations, this hybrid approach combing linguistic rules with algorithms, can be used in lots of other ways too. For example dealing with homonyms (e.g. same pronunciation, same spelling, different meaning i.e. “lean” (thin) -vs- “lean” (rest against)) or heteronyms (e.g. different pronunciation, same spelling, different meaning i.e. “close” (shut) -vs- “close” (near)), if these are used a lot in your corpus of documents.

Possibly the most beneficial use of all is to differentiate between language usage, that maybe specific to your corpus of documents. For example an insurance assessor maybe taking notes about an accident and write:

“… the customer works on an industrial estate and would like us to assess the damage on the car there”

“ … the accident happened early evening on the southern industrial estate”

In this example I could build linguistic rules to identify the time and location of accidents. This may improve model accuracy to detect insurance fraud if there is a correlation between crash for cash accidents, locations and times. There are lots of very specific examples like this, where this hybrid text mining approach, which combines linguistic rules with machine learning, can significantly improve our text analysis results.

3 Comments

Tuba Islam on June 26, 2015 8:54 am

Very interesting topic on double negatives. And great insight. Thank you Matthew. I am curious to hear your thoughts on sarcasm which also "might not be very uncommon" for Englishmen 🙂
- matthewstainer on June 26, 2015 10:30 am
  
  Thank you Tuba, I think you are right some English people, might consider sarcasm a national aptitude. Although interestingly it was the American author Donn Rittner said 'I speak two languages, English and Sarcasm'.
pranav waila on June 26, 2015 1:44 pm

Very interesting point, although i feel in case of sentiment analysis it won't affect adversely.

Blogs

Blogs

Why I’m not worried by double negatives?

About Author

3 Comments