Does self-service data preparation improve collaboration?


Business people collaborate about self-service data preparationThe increased interest in supporting self-service data access for data scientists has been driven by the idea that a standardized data warehouse representation imposes numerous constraints on the methods – and consequently the conclusions – of data analysis. We've become accustomed to inserting the IT department into the process, making the IT staff the custodians and therefore the gatekeepers of the data. However, our typical approach has been to enforce standardizations and transformations that convert the source data into a regulated set of dimensional models, often with many aspects of the original source data “washed out” along the way.

The availability of a growing array of tools that can analyze both structured and unstructured data has whetted the analyst’s appetite for access to data sets in their original format. To address this, organizations are instituting data lakes and dumping their raw data sets there as the data scientists plan their methods of analytical attack.

Collaborative – or not?

In this scenario, the downstream consumers no longer have the luxury of depending on data warehouse standardizations and transformations. As they are on their own for data preparation, the emergence of data wrangling or data preparation tools gives them the ability to profile, scan, assess and then apply custom transformations to the data as they see fit to make the data usable for their specific analytical purposes. The challenge, however, is that when there is no standard applied to the data in a general way, each analyst is free to make his/her own choices about the transformations applied to the different data values.

When this happens in a virtual vacuum, the lack of communication among analysts will allow inconsistent transformations to be applied. In other words, Analyst A might use one set of business rules to standardize data values while Analyst B applies a completely different (and conflicting) set of rules to the same source data. Their results, based on these conflicting sets of rules, may also end up being inconsistent. This becomes (at best) embarrassing and (at worst) influential on the ways that senior managers make business decisions in light of the analysts’ conclusions.

So from a naïve perspective, self-service data preparation does not necessarily improve collaboration. That being said, it would be ill-advised for any organization to allow self-service data preparation to be performed in an ungoverned way. In fact, implementing collaboration among data consumers regarding their data assessments and subsequent applications of transformations is an emerging data governance practice.

A better approach to self-service data preparation

There is a need to coordinate the sequences of data preparation tasks. Assign a data steward to oversee the creation of data preparation workflows and to recognize when similar activities are being performed. Many data preparation tools provide a means for analysts to share their workflows; use that framework to establish a collaboration network around each data source and its corresponding data elements. Allow the analysts to share their thoughts about the best ways to prepare the data, and allow a healthy discourse that can help crowdsource standards that are voluntarily complied with. As a result, analytical outcomes will be more semantically aligned, resulting in fewer inconsistencies overall.

So from a more thoughtful perspective, under the right circumstances, self-service data preparation can be made to improve collaboration. And when that collaboration leads to grass-roots specification and observance of standards, the overall quality and usability of the data will increase dramatically.

Learn about SAS Data Preparation (and try it for free!)

About Author

David Loshin

President, Knowledge Integrity, Inc.

David Loshin, president of Knowledge Integrity, Inc., is a recognized thought leader and expert consultant in the areas of data quality, master data management and business intelligence. David is a prolific author regarding data management best practices, via the expert channel at and numerous books, white papers, and web seminars on a variety of data management best practices. His book, Business Intelligence: The Savvy Manager’s Guide (June 2003) has been hailed as a resource allowing readers to “gain an understanding of business intelligence, business management disciplines, data warehousing and how all of the pieces work together.” His book, Master Data Management, has been endorsed by data management industry leaders, and his valuable MDM insights can be reviewed at . David is also the author of The Practitioner’s Guide to Data Quality Improvement. He can be reached at

Leave A Reply

Back to Top