The anticipation of massive volumes of data streaming from automated sources has the big data community drooling at the opportunities for analysis. For example, as the energy utilities industry continues to deploy home-based smart meters in concert with additional sensors peppered across the grids, there will be a transition from the expected once-a-month manual monitoring to massive data sets automatically generated and communicated across the system. This will help in improving energy distribution, managing the component lifecycle across the network, reducing costs and ultimately predicting and reacting to grid events such as surges and outages.
Clearly, the predictive capabilities associated with these big data analytics applications are dependent on trustworthy data. In fact, you might say that the expectations nicely map to the traditional dimensions of data quality: data values are expected to be complete, accurate, timely, current and consistent, among others. So one should take some comfort in knowing that this machine-to-machine data is not only automatically generated and transmitted, its systemic isolation allows it to remain unsullied by human hands – those same hands that are so often the source of data issues.
That last statement raises an interesting question, though. If we expect that the data is always going to be correct, then we don’t need to monitor the data streams for validity. Of course, you say, one or more of the sensors might malfunction and begin to generate bad data, so we will need data quality measures. At the same time, the data streams will need to be monitored for behaviors that are outside expectations.
But what’s the difference? How do we know when a data value represents aberrant behavior that needs to be addressed vs. incorrect data values generated by a failing device?