Streaming technologies have been around for years, but as Felix Liao recently blogged, the numbers and types of use cases that can take advantage of these technologies have now increased exponentially. I've blogged about why streaming is the most effective way to handle the volume, variety and velocity of big data. That's because it provides a faster way to gain business insights from big data than what traditional store-it-first-analyze-it-later approaches are typically capable of delivering.
So, we can agree that streaming is beneficial for big data analytics. But could streaming also be beneficial for data quality?
I believe that it is. As I see it, using analytics to perform a rapid data quality assessment is one of the biggest overlaps between analytics and data quality – other than analytical models being better with better data. Indeed, this approach to assessing data quality has become a necessity now that we have so many data sources within and outside of the enterprise for business users to consider.
Even when you employ a reusable set of data management processes to manage data where it lives so that data quality rules are consistently applied across all data sources, it doesn’t change the fact that some sources will have higher data quality levels than others. That's true in general and for specific use cases.
Using streaming for a data quality assessment requires us to realize that streaming does not have to be limited to what we normally think of as a streaming data source. Streaming sources are usually things like the semistructured firehouse of social media status updates (e.g., tweets), or the continuous querying of data as it flows through a system or application (i.e., event streams). But traditional structured data sources can also be streamed to evaluate which data source to use, or when searching for additional data to augment a primary data source.
To use streaming for a data quality assessment isn’t fundamentally different than how event stream processing (ESP) filters, normalizes, categorizes, aggregates, standardizes and cleanses big data before it's stored in, and could potentially pollute, a data lake. In this streaming use case, ESP leverages reusable data quality rules to filter event streams, make them ready for consumption, and potentially trigger real-time actions or alerts.
So, in much the same way that ESP can determine if big data is eventful, streaming can determine if a data source (or a blend of data from multiple sources) is of sufficient quality to meet the needs of a particular business use. This, in essence, enables the enterprise to stream to better data quality.
1 Comment
Nice post Jim, thanks. Event cleansing is important, and that can be done with ESP's relational window operators and/or with the Blue Fusion expression engine functions from DMP that are also in ESP.