The increased interest in supporting self-service data access for data scientists has been driven by the idea that a standardized data warehouse representation imposes numerous constraints on the methods – and consequently the conclusions – of data analysis. We've become accustomed to inserting the IT department into the process, making the IT staff the custodians and therefore the gatekeepers of the data. However, our typical approach has been to enforce standardizations and transformations that convert the source data into a regulated set of dimensional models, often with many aspects of the original source data “washed out” along the way.
The availability of a growing array of tools that can analyze both structured and unstructured data has whetted the analyst’s appetite for access to data sets in their original format. To address this, organizations are instituting data lakes and dumping their raw data sets there as the data scientists plan their methods of analytical attack.
Collaborative – or not?
In this scenario, the downstream consumers no longer have the luxury of depending on data warehouse standardizations and transformations. As they are on their own for data preparation, the emergence of data wrangling or data preparation tools gives them the ability to profile, scan, assess and then apply custom transformations to the data as they see fit to make the data usable for their specific analytical purposes. The challenge, however, is that when there is no standard applied to the data in a general way, each analyst is free to make his/her own choices about the transformations applied to the different data values.
When this happens in a virtual vacuum, the lack of communication among analysts will allow inconsistent transformations to be applied. In other words, Analyst A might use one set of business rules to standardize data values while Analyst B applies a completely different (and conflicting) set of rules to the same source data. Their results, based on these conflicting sets of rules, may also end up being inconsistent. This becomes (at best) embarrassing and (at worst) influential on the ways that senior managers make business decisions in light of the analysts’ conclusions.
So from a naïve perspective, self-service data preparation does not necessarily improve collaboration. That being said, it would be ill-advised for any organization to allow self-service data preparation to be performed in an ungoverned way. In fact, implementing collaboration among data consumers regarding their data assessments and subsequent applications of transformations is an emerging data governance practice.
A better approach to self-service data preparation
There is a need to coordinate the sequences of data preparation tasks. Assign a data steward to oversee the creation of data preparation workflows and to recognize when similar activities are being performed. Many data preparation tools provide a means for analysts to share their workflows; use that framework to establish a collaboration network around each data source and its corresponding data elements. Allow the analysts to share their thoughts about the best ways to prepare the data, and allow a healthy discourse that can help crowdsource standards that are voluntarily complied with. As a result, analytical outcomes will be more semantically aligned, resulting in fewer inconsistencies overall.
So from a more thoughtful perspective, under the right circumstances, self-service data preparation can be made to improve collaboration. And when that collaboration leads to grass-roots specification and observance of standards, the overall quality and usability of the data will increase dramatically.Download a BARC research study about data preparation