~ This article is co-authored by Biljana Belamaric Wilsey and Teresa Jade, both of whom are linguists in SAS' Text Analytics R&D.
When I learned to program in Python, I was reminded that you have to tell the computer everything explicitly; it does not understand the human world of nuance and ambiguity. This is as valuable a lesson in text analytics as in programming.
When I share with new acquaintances that we have a team of linguists at our analytics company, they are often puzzled as to what our job entails. I explain that we use our scientific understanding of language to ensure that the computer interprets the symbols of human language correctly; for example, what a word is or where a sentence ends. You might think these are easy tasks; after all, even young children have answers to these questions. But, in fact, teaching a computer the seemingly simple task of where a sentence ends across a wide range of human language texts quickly becomes complex, because a period is not always a full stop.
Take, for example, abbreviations like “Mr.” and “Mrs.” in English, “Dipl.-Ing.” in German, “par ex.” in French, “г.” in Russian, etc. In all of these cases and across most languages, the period does not necessarily signify the end of the sentence. Instead, it means information has been left out that we, as humans, can guess from context: “Mr.” really means “mister,” “Mrs.” refers to a married woman (did you know it is short for “mistress”?), “Dipl.-Ing.” stands for “Diplom-Ingenieur” (an engineering degree), “par ex.” stands for “par example” (“for example”) and “г.” most often stands for “год” (“year”) or “город” (“city”). You might think telling the computer to ignore the period in these cases is a good way to avoid interpreting the period as the end of the sentence. But that won’t work everywhere – just consider the first sentence of this paragraph, where the period comes after the abbreviation “etc.” but it also doubles as a sentence ender!
The situation is no less complex with numerals. In some parts of the world, including the US, South Asia and Australia, periods are used to separate the decimals from the integer and commas are used to separate thousands, for example: “100,000.25.” But in other parts of the world, including Europe and most of South America, convention dictates that the roles of the period and comma are reversed: Commas are used for decimals whereas periods separate hundreds, for example: “100.000,25.” In these cases, the entire numeral needs to be interpreted as one unit, and thousands of units of currency might be at stake.
Another example of extensive use of periods in the middle of a coherent string of text (also called a token) is within the digital locations like URLs, paths and email addresses. For example, in the hierarchical part of the simple URL “https://www.sas.com,” there are two periods in the web location.
In a path name to a specific file, you may find periods at the beginning indicating the current directory, multiple periods indicating a parent directory or a period near the end that indicates the file extension. Other periods may also occur in a full path, as in this example of a relative path: “../../this_system/src-1.0/about/.contact.me.txt.” In email addresses, periods may occur before the “@” sign, as well as afterwards. The latter usage is similar to the hierarchical part of a URL, as in “Mary.Poppins@sas.com.”
Many other patterns also include periods. Dates, times and telephone numbers frequently use periods to indicate abbreviations for the day of the week or the month, or as a separator equivalent of a space, slash or hyphen. Consider the following examples: “Feb. 15, 2015,” “Mon. June 15, 2015,” “15.06.2015” and “1.800.727.0025”.
Software versions have periods as well, indicating a separation between major version, minor version and other maintenance or hot fix versions, such as in “SAS® 9.1.3” or “SAS 9.4m2” (Read more about SAS software versions). Other places you will see periods include outline formatting, measurements, equations, ellipses, library classification systems, and references to laws and parts of contracts. In all of these cases, treating the period as a full stop or sentence ender will give you undesirable results for text analysis.
With all these periods roaming around, it is quite amazing that computer systems processing human languages get it right as often as they do! Actually, determining how periods should be interpreted IS one of the easier language problems, because periods are either the end of a sentence or not, and they are either a word- or token-breaking character or not.
There are many other aspects of human languages across the world that are much more complex. It is a good thing that SAS Text Analytics customers don’t have to worry about these details, because there are linguists making sure that the underlying natural language processing (NLP) components do the right thing!
Do you have questions about language processing you have always wanted to ask a linguist? Or maybe you have data where periods play additional roles? We would be happy to hear from you in the comments section below.
1 Comment
I'm still learning from you, but I'm trying to reach my goals. I certainly love reading all that is written on your blog.Keep the tips coming. I enjoyed it!