Crowdsourcing data improvement: Part 1

James Surowiecki wrote a book about The Wisdom of Crowds. Jeff Howe, who co-coined the term crowdsourcing, wrote a book about Why the Power of the Crowd Is Driving the Future of Business. In this blog series, I explore if it’s wise to crowdsource data improvement, and if the power of the crowd can enable organizations to incorporate better enterprise data quality practices.

Let’s start with a definition. Crowdsourcing is obtaining services, ideas, or content by soliciting contributions from a large group of people, most often via the Internet, rather than from traditional employees or suppliers. Contributors to crowdsourcing projects may be unpaid volunteers (e.g., contributing to Wikipedia) or paid freelancers (often via websites such as Mechanical Turk or oDesk), and may have relevant experience or vetted expertise, but more often they have little experience and limited qualifications (one aspect that makes crowdsourcing cost-effective).

While crowdsourcing can be applied to a wide range of projects and activities, this blog series focuses on crowdsourcing data improvement, specifically looking at the following aspects:

Type of data — Crowdsourcing doesn’t work for all types of data. It’s limited to the data the crowd has access to, typically via the Internet. So obviously many categories of enterprise data will not be suited for outsourcing to an external crowd. While internal crowds of employees can be afforded more data access, security and privacy issues will still prevent sensitive data from being crowdsourced (e.g., customer data containing personal financial information or patient data containing personal health information).
Kind of crowd — Crowd composition is about more than external versus internal. Experts versus novices, and how many of each, is a tricky thing to balance. Enterprise data management seems like it should be left to subject matter experts, and domain expertise is occasionally essential, but the aggregate of a large, and largely amateur, crowd often meets, or even exceeds, expert performance (which was one of the key points that Surowiecki’s book emphasized and did a great job explaining).
Form of improvement — Crowdsourcing can not only be used for data quality, by distributing data to check for, and correct, errors in data, but also for data enrichment, by generating reference data. You can provide data samples for the crowd to review, or you can ask the crowd to provide their own data examples. With the former, you are asking the crowd to discover existing data quality errors. With the latter, you are asking the crowd to provide examples of common data errors or known data variants.

Add your voice to improve data’s crowd

If you have an experience or perspective to share about crowdsourcing data improvement, or a specific question you want to see answered by this blog series, then please post a comment below.

Learn more about data improvement practices and techniques in the TDWI report, Data Quality Challenges and Priorities.