In my last post I set the stage for data quality considerations for big data. Today, I’ll cover the following big data and data quality considerations:
- Data quality efforts should be "fit for purpose"
- Extend data quality by thinking “outside the box”
Data quality efforts should be "fit for purpose"
Your data quality approach should be designed with several factors in mind – it doesn’t make sense to apply one data quality approach for all data or information related projects. You should consider where the data came from, how the data will be used, how the data will be consumed, who will use the data, and perhaps most importantly, what decisions will be made with the data. Here are several considerations relative to big data:
Consider the type of data: The data quality requirements for different forms of data will vary and your approach should match the needs of the data. For example:
- Big data projects that relate to traditional forms of data like transaction data related to key entities like customers, products, etc., can leverage existing data quality support as long as it scales to meet the needs of massive volume.
- Big data relating to machine or sensor data (e.g., RFID tags, manufacturing sensor data, telco, utilities, etc.) will not be prone to input error that affects data that is entered by humans but as additional sensor information comes on line, it could be that sensors are emitting invalid data. Assuming that you trust your machine or sensor data, data quality related to discovery, the ability to link data with other systems, the ability to enrich data may still be extremely important.
- Social media data such as Twitter, Facebook, etc., is similar to machine data in that the data quality issues resulting from user input, overlapping systems, etc., will not be the primary issue. It’s also important to note that there is a structured component to this information – structure around a Tweet stream relative to meta-data description along with the text string that contains the content of the tweet. So, this will involve a combination of entity matching, monitoring to ensure that the tweet stream is not interrupted along with the ability to analyze the text, which will bring in data quality considerations related to text data.
Not all analysis requires exactness: If you are attempting to identify a general pattern and you have a lot of data, the extraneous data is not likely to impact the overall conclusion. For example, if you have a massive amount of clickstream data and you are looking for patterns (where people leave a site, which path is more likely to result in purchase or conversion, etc., the outliers will not impact the overall conclusion. In this case, it’s more of an analytics process vs. a data quality process – data quality will not be in question, but relevance will – for example, if someone accidentally ends up on your website, they aren’t really part of the population that you are concerned with (unless you are analyzing why they are there in the first place). Same with bots vs. actual users, bot traffic is not likely to be erroneous, but it is possible to extend your data quality efforts to include relevance as a quality. Same with types of users – actual customer behavior vs. competitor traffic, etc., it’s a segmentation topic not a data quality topic.
Don’t cleanse away analytical value: Risk - Outliers may actually indicate a risk or breach – unusual transactions should not be cleansed away because they fall outside of the norm, they may represent fraud. Instead of using anomaly detection to determine data quality issues, use anomaly detection to identify meter problems, potential fraud, etc.
Design the data quality process to map to the various stages of data usage: Processes up front in the analytical lifecycle like data discovery, data exploration, opportunity identification, data relationship research, etc., are better performed on the data prior to any cleansing taking place. For example, assessing the value of the various attributes by analyzing access frequency, detecting outliers or discovering correlations between attributes may form the initial stages in understanding data distribution. Then once it is clear about the questions that you are driving towards, the type of analytics that will be leveraged, etc., you can make the proper determination about data quality, etc. You may even leverage a gradual cleansing process as part of your strategy.
Extend data quality by thinking “outside the box”
The data quality discipline has matured rapidly over the last several years. Even with these advancements there are opportunities to leverage data quality principles in new ways. And it is possible to leverage analytics to make subjective decisions based on content that cannot be supported by a limited view of data quality. So thinking outside of the typical approach, here are some initial considerations relative to big data:
Extend data quality or monitoring capabilities to the analytical modeling process: Use data quality mechanisms to determine missing the impact of missing attributes on analytic algorithms. And, use data quality rules and monitoring capabilities to assess the accuracy of the model over time. For example, assess the potential degradation of analytics performance by measuring and alerting based on analytical model drift.
Using analytics to assess quality: With contextual data, mechanically based data quality is not sufficient. For example, organizations should be looking to extend their quality efforts to assess social data that has been self-reported. Information that people self-report about medication taken, time spend studying, etc., is often misrepresented by the user (they intentionally fabricate the amount of meds taken, time spent studying, etc.). In this case, traditional data quality approaches will be insufficient, but analytics can be used to provide some level of value assessment. Same with sentiment data, considering transactional data and sentiment data relating to purchase behavior – if sentiment is negative and purchase behavior is positive, this could indicate a data quality problem, or it could relate to the customer being locked in without additional choices. Either way it will take further analysis that will not be addressed by mechanical data quality efforts.
Use data quality capabilities to assess collection level of machine data: Is your data collection process reliable? Does the data represent the proper time frame? Is there something about the data that signals that the collection is missing data. For example, if data is missing from a device, it could represent a problem with the data trail, or it could show that the device was off-line and not generating data. Consider extending your data quality approach to help determine whether the reported data indicates a problem with the sensor infrastructure.
Use data quality to ensure summarization efforts are valid: If the system leverages a summarization technique as a mechanism for dealing with extreme volume, consider applying data quality approaches to the summarized data as a means to validate the summary. This could be used in situations where a device or distributed component summarizes the data that is being returned, or for data processing that functions like MapReduce that summarize data for downstream processing or analysis.
Check back tomorrow for my final post on big data and data quality. I’ll cover the following topics:
1 Comment
Pingback: Data quality considerations for Big Data - Information Architect