Data preparation and data wrangling, Part 2 (yippee, bring your lasso)

129617657 In Part 1 of this two-part series, I defined data preparation and data wrangling, then raised some questions about requirements gathering in a governed environment (i.e., ODS and/or data warehouse). Now – all of us very-managed people are looking at the horizon, and we see the data lake. How do we manage THAT?

As data lake usage continues to grow, we may need to modernize our thoughts around different types of data and how it should be managed. The definition of data management, consequently, will take on some new aspects. For example, data in the data lake may not be checked for quality or integrated with other data, but may not be used for governmental and/or external reporting. That’s a rule that can be put in place… So, yep data lake: here we come!

Here are some other thoughts:

Data lakes may be used as a staging area for data discovery.
Data lakes must have a pattern of ELT where technical metadata is prevalent and easily exposed.
Data lakes are only used for raw, unformatted data to glean business knowledge that will be applied to the operational application, operational data store and the data warehouse.
Data lakes are not used for compliance reporting.

Data scientists spend up to 80% of their time formatting data for their specific needs. Does that change in the data lake environment? This implies that everyone does not need the data the same way (sorry, my IT friends). Governance, usage monitoring and metadata are all areas of concern on our new data lake platforms. We need to be actively involved in the implementation and use of the data lake. This will help in the long run.

For example, why write a ton of Java code (hello, 1999) for data manipulation to maintain? How would I even find metadata to do impact analysis for change or enhancement? All of these are questions your organization needs to address.

Just remember – some data is not meant to be lassoed. Some data should be governed based on requirements.

Download a paper about 5 data management for analytics best practices.