Last Month at the
SIGIR meeting in Boston , one of the presentations given by
a Teragram customer attracted notice in a twitter post.
The NY Times automated the tagging of topics for their online website by their implementation of software to automatically build their indexes. However, as the tweet points out - the machine has NOT replaced Man because the newspaper continues to rely on MANUAL entries by people who maintain and build the
New York Times Index, a more traditional index.
Stephen Arnold wondered in his blog why an organization might continue to require human labor on a task machine can now perform? Could be political resistance to change? or perhaps the machine fails sometimes? Perhaps the employees without skills to be reassigned are in fact prime for the next round of employees to see a "pink slip" as budgets get cut?
Mr. Arnold's ideas are all valid possibilities and I've seen cases of each in my experiences transferring technology from the research lab into business production environments. Those who put a stake in the ground and step forward to be the first to serve as role models for how text analytics can carry their business forward - ought to pause and consider their own culture.
Since the original question was about a Teragram customer implementation, I asked
Saratendu Sethi , the director of Engineering at Teragram to share what he's pbserved in his consulting engagements. Here is his response.
First of all, even if automatic categorization guarantees >99% accuracy, for a News company, it is absolutely critical to not portray any wrong information for even 1%. This can only be verified by having humans validate the categorization results. They are doing that on a subset of articles, e.g. front-page articles.
Secondly, new topics constantly emerge in the coverage of current events. Even the best text mining algorithms can’t achieve perfection in spotting emerging topics because these algorithms are usually based on processing
past content. Also, the
definition of emerging topics is based on human perception
which is affected by time, location and the type of entities involved in the event. Therefore, these topics have to be manually spotted and added to documents/taxonomy while they are emerging.
Having said that, the following are
four benefits that
Teragram categorization achieves for New York Times:
(1) If two people are asked to suggest categories on the same document on their own, they are always going to come up with different categories. Automatic categorization
enforces consistency and removes human subjectivity by automatically suggesting them categories.
(2) Automatic categorization
saves time because it is easier to ask editors to select appropriate categories from an automatically generated list rather than having them to think about them. With automatic categorization, I can just spend few seconds but with manual categorization I have to use few minutes to read the content and decide the appropriate topics
(3)
Entity extraction (e.g. identifying person, locations, etc), which doesn’t require much human input, is automated.
(4)
Automatic categorization enables New York Times to process all their past archives. Currently New York Times re-processes all their past 25years of content with updated taxonomies every few months.
4a. The human editors are only reviewing articles for current day (~500-1000 articles/day) whereas the past archives might include 100K articles/year.
4b. If “
swine flu” was only identified as a News topic in 2009, then automatic categorization allows NYT to find out what other news appeared in past.
So what do you conclude from this post? How would YOU answer the question posed in the title of this entry ? It is cost effective to apply MAN "and" Machine together -- or has the science progressed enought to replace MAN ? Is it time to choose and go with Man "or" Machine approach when deciding about becoming more efficient?
Saratendu answers with the "AND" operator -- and thats the answer I prefer too --cause i'm not comfortable letting those sci fi robots and machines take over my world.
How about you?