I recently participated in an interesting recorded video web seminar with Scott Chastain from SAS about the concept of “big data quality” in which we discussed both the sources of big data streams as well as what could be meant by data quality for those big data sets. One conceptual source of big data is referred to as machine-to-machine (M2M) data, and this includes data sets automatically generated by devices such as sensors and meters that are then forwarded to other systems within a network.
Some examples of M2M data sources include energy meters (such as the emerging use of smart meters), pipe sensors (used in oil and gas transport), operational flight data generated by airplanes, continuous monitoring of vital signs in a hospital environment, transportation sensors (such as fluid and pressure monitors across fleets of trucks or railroad cars. Here are some thoughts about abstractly describing the archetypical scenario:
- A networked environment composing a holistic system;
- One or more attached devices that automatically monitor specific measure(s) associated with one or more system activities;
- For each device, a defined period at which the measure is monitored and reported;
- One or more target devices for collecting reported data;
- A set of rules specifying the intended operation within a discrete set of expected behaviors; and
- A set of actions to be taken when the system does not operate within the discrete set of expected behaviors.
Our examples above conform to these criteria. For example, in the hospital environment, you may have multiple (yet different) monitors such as pulse, blood pressure, blood oxygen content, rate of medication delivery, etc., connected to each patient. All monitors may be sampling measures at their prescribed rates, and each can forward the measures to centralized repository that scans for both individual health events requiring attention as well as collections of measures that are indicative of a systemic event (such as a localized power failure) that also need attention.
Fortunately, the sets of rules that describe both expected and deviant behaviors effectively describe the data expectations. That provides the data practitioners with the foundation for defining measures for ensuring that the generated data conforms to expectations. But does that really mean data quality? More on this in the next post.