The benefits of big data often depend on taming unstructured data. However, in international contexts, customer comments, employee notes, external websites, and the social media labyrinth are not exclusively written in English, or any single language for that matter. The Tower of Babel lives and it is in your unstructured data.
However, while the Babelfish remains a figment of the Hitchhiker’s Guide to the Galaxy, there ARE proven approaches for information extraction from text across multiple languages. I’ll discuss a few of these and illustrate where appropriate with screenshots from the recently released SAS Contextual Analysis 14.1.
One approach focuses on automatic translation of all documents to a common language, then performing the analysis on the translated texts. Typically, this does not produce ‘readable’ texts (just try Google-translating your plea for a raise to your Italian boss and send that off without a proof-read :-). Nonetheless, this approach however can be used glean important insights.
Let’s say we want to work on customer complaints for a product. Generally, those complaints are going to be similar across borders and languages. We would create a project in English as that is the language into which all the complaints have been automatically translated. From that now single-language corpus, the topics below are derived.
These topics can then be ‘promoted’ into categories, which are then used to score incoming complaints. Before scoring, these categories can be refined right down to the syntax level to make sure that the resulting classifications are context appropriate.
When finalized, these categories can be used to score the incoming complaints for direction to the appropriate corporate department downstream.
IMPORTANT TO REMEMBER in this approach: Be sure to not lose track of the mappings to the original document. In any text analysis, this mapping must be maintained, in case the analyst actually wants to read a representative or otherwise key sample set of documents in the original language.
A second, somewhat simpler, and often better approach, is using a targeted start list for the domain. In text analytics, a start list basically tells the system to only treat the terms in the list and ignore others. This start list should contain key terms relevant to the analysis topic. For example, say a pharmaceutical company, or a public health organization wants to look for a select group of symptoms and treatments in a large, international pool of doctor notes.
Such a start list could look like this.
Now, this start list can be followed up with a synonym list can be used to identify the same words in the different analysis languages (in this case the TERMS are in the original languages of English, French and Dutch, and the PARENTS are all in English).
The resulting analysis will then reveal topics which group the documents according to the co-occurance prevalence of these symptoms and treatments in the texts (In this case, irregardless if the orginial document was in English, French or Dutch).
Of course a third possibility remains to conduct separate analyses for the different languages. SAS Contextual Analysis 14.1 now offers support for Chinese, Dutch, Finnish, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish and Turkish.
So don’t let the cat get your tongue when it comes to multi-language text analysis. Use one of these approaches to turn the Tower of Babel into a babbling tower of insights!