In my last post, I raised three questions about data preparation (as opposed to “data cleansing”) for ensuring optimal results from analytics. I can summarize these three questions as:
- Should you clean data by correcting or transforming values that appear to be incorrect?
- Should you hesitate to remove questionable records from the data set?
- How are data quality specifications defined?
In essence, these are all actually different facets of the same question: When it comes to managing data quality for analytics, how does one characterize the criteria for data usability in an environment where different data consumers have different levels of expectation for data utility?
The real challenge here is to consider the desired outcome. That is, data quality for any specific analytics process must be defined in relation to the extent that the decisions made based on your analytics would be impacted as a result of potential data issues. This suggests a process for instituting data trust:
- Specify the measures for optimal desired outcomes.
- Consider the types of data issues that might occur.
- Speculate how each data issue would affect the desired outcomes.
- Determine whether there is a means for improving the quality of the data through cleansing/transformation.
- Speculate whether each data cleaning task would potentially skew the analysis.
- Speculate whether the potential for analysis skew would affect the desired outcomes.
This process should expose the types of data issues that would have the greatest business impact if ignored, as well as identify the data corrections that would have unintended negative consequences.
Data issues: An example
Let's consider the example of data analytics for an eCommerce recommendation. Presumably, this analysis process looks at many eCommerce transactions associated with a collection of customer profiles and automatically generates product purchase suggestions customized for each customer. The desired outcome is to maximize the volume of sales generated as a result of automated recommendations. Let’s assume that there are two kinds of data issues that might occur:
- Errors in the customer identifying information that prevents one from matching to a customer profile. In this case, you cannot definitely link a website visitor to a known customer record. A potential impact is that your system cannot determine the customer’s purchasing profile, and therefore would not be able to look at the associated transaction patterns that would enable automated recommendation. As a result, there will be no increased sales as a result of the analysis.
- Errors in the sequence of transactions. In this case, erroneous transaction patterns (such as which links were clicked through, or which mouse actions were performed) will be used to create the recommendation models. But because the patterns are not correct, they may not match to transaction patterns of other similar customers. The result would be that the wrong predicted recommendations would be presented.
We can further refine our consideration to say that providing an imprecise product recommendation is better than *no* recommendation, since the visitor might still take that suggestion (even if it were not the best one). In Case Number 1, anything that can be done to improve the linkage of the website visitor to a known customer record would lead to improved results because there would be some product suggestion (even if the wrong customer profile were found). This suggests the need for data standardization and cleansing.
Case Number 2 is a bit different – continued reliance on incorrect transaction sequences will lead to many incorrect analytical models that typically make imprecise recommendations. With Case 1, we get imprecise suggestions for an occasional visitor, but Case 2 leads to imprecise suggestions for many visitors. This indicates a much greater urgency to identify the root cause of the data issue and fix it. Yet it's not clear whether there are any ways to “correct” or “cleanse” those erred transactions.
This example shows how a data trust process might work. In upcoming posts we can look at more examples and then see how to formalize the process and integrate it with the governance program.Download – SAS: A Comprehensive Approach to Big Data Governance, Data Management and Analytics