In my last post we started to look at two different Internet of Things (IoT) paradigms. The first only involved streaming automatically generated data from machines (such as sensor data). The second combined human-generated and machine-generated data, such as social media updates that are automatically augmented with geo-tag data by a mobile device.
In both of these cases, much of the data is automatically created – so what does it mean to talk about data quality? The answer requires two tasks: a reconsideration of the dimensions of data quality, and a focus on end-user data usability.
There are many possible dimensions of data quality, but let’s focus on four key ones: accuracy, consistency, completeness and timeliness. In a big data environment that supports an IoT framework, we are no longer just monitoring the quality of the data coming from a single source. Rather, quality characteristics have to be applied at the aggregate level. From this perspective, the dimensions take on slightly different meanings:
- Accuracy. Do the values that have been accumulated from across the network of IoT devices accurately reflect what was produced at each device? For example, if we have ten devices within the same room reporting the ambient temperature, are all of those devices reporting the same temperature, or temperatures that are within a reasonable deviance from each other?
- Consistency. Are the values logged within the big data environment consistent with the context in which the values were produced by each device? For example, if multiple events are reported by an app on a mobile device and they are tagged with a geolocation, are those geolocations the same or close to each other?
- Completeness. Have all the data values been accumulated at the big data environment? Are there any gaps in a series of reported events or sensor values that should have been captured?
- Timeliness. Are the values being captured within a reasonable time frame? If much of the data is streamed and coming from a wide variety of devices, are there monitoring points to ensure that the collective data set is synchronized?
These questions just brush the surface. We can continue to drill down into each of these dimensions – and add some of the other dimensions – to build up an array of expectations regarding data usability. And that leads us to the second task of characterizing data quality in terms of end-user data usability.
While some IoT applications are driven by monitoring operational behavior, much attention is given to IoT analytics and how the results of analytical modeling and pattern analysis can identify business opportunities. Some examples include predictive maintenance (on the industrial side) and customer behavior analysis (on the mobile smart device side). In either case, the usability of the data is not measured in terms of source data quality, but rather how the users interpret the data for its various blended uses.
And that is where self-service data preparation tools can add value. These tools are conglomerates of data quality functionality – profiling, standardization and transformations, all rolled up together and driven by the user. By allowing users to research what the data looks like (especially important as new device streams are added to the mix) and also define their own data quality criteria, you empower them to devise reports and analyses that meet their specific objectives without forcing their quality criteria on others using the same data. In turn, the data’s overall usability is increased.