Data preparation and data wrangling, Part 2 (yippee, bring your lasso)


129617657In Part 1 of this two-part series, I defined data preparation and data wrangling, then raised some questions about requirements gathering in a governed environment (i.e., ODS and/or data warehouse). Now – all of us very-managed people are looking at the horizon, and we see the data lake. How do we manage THAT?

As data lake usage continues to grow, we may need to modernize our thoughts around different types of data and how it should be managed. The definition of data management, consequently, will take on some new aspects. For example, data in the data lake may not be checked for quality or integrated with other data, but may not be used for governmental and/or external reporting. That’s a rule that can be put in place… So, yep data lake: here we come!

Here are some other thoughts:

  • Data lakes may be used as a staging area for data discovery.
  • Data lakes must have a pattern of ELT where technical metadata is prevalent and easily exposed.
  • Data lakes are only used for raw, unformatted data to glean business knowledge that will be applied to the operational application, operational data store and the data warehouse.
  • Data lakes are not used for compliance reporting.

Data scientists spend up to 80% of their time formatting data for their specific needs. Does that change in the data lake environment? This implies that everyone does not need the data the same way (sorry, my IT friends). Governance, usage monitoring and metadata are all areas of concern on our new data lake platforms. We need to be actively involved in the implementation and use of the data lake. This will help in the long run.

For example, why write a ton of Java code (hello, 1999) for data manipulation to maintain? How would I even find metadata to do impact analysis for change or enhancement? All of these are questions your organization needs to address.

Just remember – some data is not meant to be lassoed. Some data should be governed based on requirements.

Download a paper about 5 data management for analytics best practices.


About Author

Joyce Norris-Montanari

President of DBTech Solutions, Inc

Joyce Norris-Montanari, CBIP-CDMP, is president of DBTech Solutions, Inc. Joyce advises clients on all aspects of architectural integration, business intelligence and data management. Joyce advises clients about technology, including tools like ETL, profiling, database, quality and metadata. Joyce speaks frequently at data warehouse conferences and is a contributor to several trade publications. She co-authored Data Warehousing and E-Business (Wiley & Sons) with William H. Inmon and others. Joyce has managed and implemented data integrations, data warehouses and operational data stores in industries like education, pharmaceutical, restaurants, telecommunications, government, health care, financial, oil and gas, insurance, research and development and retail. She can be reached at

Related Posts

Leave A Reply

Back to Top