In this blog series, I am exploring if it’s wise to crowdsource data improvement, and if the power of the crowd can enable organizations to incorporate better enterprise data quality practices.
In Part 1, I provided a high-level definition of crowdsourcing and explained that while it can be applied to a wide range of projects and activities, applying crowdsourcing to data improvement involves three aspects: type of data, kind of crowd, and form of improvement. Part 2 focused on type of data and kind of crowd. Part 3 focuses on form of improvement, which can be divided into two categories: data quality and data enrichment.
Crowdsourcing data quality
This form of improvement typically asks the crowd to check for defects in provided data samples in order to discover existing data quality issues and, if possible, resolve them. A few examples include the following:
- Would you like to play a game? — Luis von Ahn, who is considered one of the pioneers of crowdsourcing, remarked, “For the first time in history, we can get one hundred or two hundred million people all working on a project together. If we can use their brains for even ten or fifteen seconds, we can create lots of value.” He called it harnessing brain cycles and he did so by creating online games. One of them, licensed to Google in 2006, was the ESP Game, where two Internet players are shown an image. If they type in the same word to describe it, another image pops up. The goal is to match descriptions on 15 images in under 3 minutes. The metadata created by this game helped Google improve image search results. Another von Ahn game, purchased by Google in 2009, was ReCAPTCHA, where an image of squiggly letters is presented as a Turing test to users to verify access to a website or complete an online purchase by correctly typing the distorted letters. An added benefit, since most ReCAPTCHAs came from scanned images of pages from old books, was users helped Google digitize, word by wiggly word, vast libraries of world literature.
- Quality through quantity — Crowdsourcing is a good big data example of how quality can sometimes be achieved through quantity. Spelling errors and other mistakes in web search terms are a great example. Google alone receives 100 billion searches per month. With that many queries, correct spelling can be determined by successful searches without using a spell checking algorithm, and search term recommendations can be made that help users find what they’re looking for with greater speed and accuracy.
- Quality Assurance — Automated data quality processes rely on the periodic review of random data samples by humans. A common example is data matching, where groups of matched records are reviewed to verify the matching algorithm (e.g., identifying duplicate records). With crowdsourcing, this time-consuming and resource-constrained activity can be performed faster, more frequently, and with larger data samples. Adding social (e.g., ability to see the comments from other reviewers) and ranking (e.g., up/down votes or five-star ratings) components can provide detailed feedback for improving the process as well as the data.
Crowdsourcing data enrichment
This form of improvement typically asks the crowd to generate reference data by providing their own examples of common data errors, known data variants, or supporting information. A few examples include the following:
- Crowd carpooling — Waze is a GPS-supported smartphone app providing turn-by-turn directions enhanced with user-submitted data providing route details, travel times, and location-based alerts (e.g., speed traps). What differentiated this app was its enriched social data, which also made its developers rich when Google bought Waze for one billion dollars in 2013.
- Cómo se dice? — While enterprises often have a global reach, expertise is always local, especially linguistic expertise. Your globally dispersed customers can add local expertise to your data. Examples include helping you linguistically localize the descriptions of your products and services, and multilingual translations of common terms (e.g., señor/monsieur/herr and calle/rue/straße).
- Algorithmic enrichment — Despite our increasing reliance on algorithms, human intelligence is still needed to enrich data for machine learning by adding appropriateness to relevance, such as with web ad placement. While guns are relevant, for example, to a news story about a mass murder shooting spree, an ad selling guns appearing alongside this article would not be appropriate.
Add your voice to improve data’s crowd
If you have an experience or perspective to share about crowdsourcing data improvement, or a specific question that was not addressed by this blog series, then please post a comment below.
I agree that crowdsourcing definitely has an advantage for tasks such as data quality assessment and improvement. My colleagues and me performed a crowdsourcing data quality assessment experiment specifically for Linked Data Quality . On the one hand our experiments confirm that the use of "crowds" is cost-effective and less time-consuming, there are certain tasks that the crowd could not perform. These tasks were mainly concerned with the structure of Linked Data that the crowds are perhaps unaware of. Nevertheless use of crowds to not only find and verify data quality problems but also fix those problems is definitely an advantage.