I'm a very fortunate woman. I have the privilege of working with some of the brightest people in the industry. But when it comes to data, everyone takes sides.
Do you “govern” the use of all data, or do you let the analysts do what they want with the data to arrive at conclusions that could change business? These are hard decisions that require many conversations.
Let’s examine this dilemma by starting with some definitions.
- Data preparation is the act of cleansing, formatting, integrating and loading data into a data store for consumption by other applications or reporting/analytics, etc.
- Data wrangling could also be called “composting.” We land the data, and apply cleansing, formatting and integration on the application layer (ELT) – not in the database. This requires a data scientist or data analyst to merge and manage the data, in whatever way they need it for a specific analysis.
Most organizations have some sort of an operational data store as well as a data warehouse. Most of these data stores have been managed and governed with specific patterns for read, insert, update and delete functions. That said, with the introduction of the data lake, how do we manage and govern the data? Or does it even matter?
One of my first thoughts was: “Heck, I have all this data – why not just send it to the data lake from the data warehouse and/or operational data store?” Sounds logical to me. But what if the requirements for this analysis does not need to have good quality data? Interesting – right? Not necessarily the way we have gathered requirements in the past.
So, what do we need to do to make to make sure data is consumed the way it needs to be consumed, and from the "correct" data store?
Watch for Part 2 of this series where we'll continue this discussion.
Got 2 minutes? Watch a video to learn more about data preparation for analytics.