I've spent much of my career managing the quality of data after it was moved from its sources to a central location, such as an enterprise data warehouse. Nowadays not only do we have a lot more data – but a lot of it is in motion. One of the biggest data movers is the Internet of Things (IoT), which is comprised of machines with embedded software, sensors and connectivity enabling them to collect and exchange data. IoT produces a mixture of machine-generated and human-generated data in motion. When we have traditionally waited until data stopped moving before we assessed its quality, how much quality should we demand from data in motion?
Data quantity is more important (according to Autocorrect)
An everyday example of data in motion is the use of mobile devices, especially smartphones. Human-generated data, even when the humans are not in motion, is often rife with defects (i.e., data quality issues). In-motion humans generate emails, texts and social media status updates, many containing typos. This is why most of us have a love-hate (mostly hate) relationship with AutoCorrect. It corrects our spelling and/or automatically completes words as we type. Although a useful feature, it can lead to correct but contextually incorrect data. For example, my devices like to change “data quality” to “data quantity” perhaps indicating my tablet and smartphone are apart of the big data conspiracy theory that quantity trumps quality.
Sensors are not sentient (so defects don’t hurt their feelings)
Machine-generated data can be better but that doesn’t mean it will be defect-free. For example, defective data can be caused by a machines’s poorly calibrated sensor. If that sensor is monitoring oxygen levels on a spacecraft moving humans through outer space (e.g., on a mission to Mars) then defective data could make the astronauts dead on arrival. Other issues, such as a power loss, mechanical failure, loss of connectivity or software crash, could also cause poor sensor data quality. Sensor data, therefore, may be mostly human-free – but it’s not defect-free.
Data quality is a moving target
David Loshin blogged about the characteristics of IoT data quality, explaining how some data quality dimensions (accuracy, consistency, completeness, timeliness) take on slightly different meanings when applied to data in motion, since most of it is used in aggregate. In other words, data in motion often comes to a rolling stop when it’s used. Loshin emphasized a focus on end-user data usability. As more data moves into aggregate metrics, different uses will have different expectations, or concerns, about data quality. Data quality, therefore, is a moving target.
Learn more in this paper: How Streaming Data Analytics Enables Real-Time Decisions