Lots of organizations have implemented big data solutions – and this big data is often put into a data lake for economic reasons. As data lakes continue to grow in popularity across the enterprise ecosystem, the question comes to mind: Should the data inside of these nonrelational systems be subject to data governance? The simple answer to this question is yes: You do need data lake governance, provided the data is being used for business decision-making purposes. Let’s explore the value of governing these nonrelational systems.
To set a foundation, data governance is a defined data management process that an organization follows to ensure high-quality data is available throughout the data's full life cycle. So, whenever the organization is using the data inside of a data lake to help guide their decisions, the data needs to be governed.
Data governance provides the same value for a data lake that it provides for traditional relational systems. Here are a few business benefits of data lake governance:
- Identified data ownership helps you understand who to go to when there are questions about the data.
- Data definitions and standards allow you to know the right values for a data element.
- Remediation processes performed on the data inside of the data lake provide workflow and escalation procedures on inaccurate data.
- Quick assessment of the data's usability for a specific business process reduces the likelihood of inconsistent reporting results.
- Security of sensitive data improves as you implement controls on who can access the data. This lowers the chances of data theft and cybercrimes, while helping you adhere to regulatory requirements.
- Data is traceable, so you can understand the entire life cycle of the information residing in the data lake – this includes metadata management and lineage visibility.
- Monitored health of the data inside the lake helps ensure that it adheres to governance standards.
Data lake governance – Tools and approaches
The great news is if your company already has a data governance program in place, adding a data lake just adds another source to be governed. So you can use the same data governance framework and council for governing your data lake. The main difference might be with the tools you'll need to ensure that the data in the lake adheres to the data governance program.
When you're choosing tools to apply data governance standards inside of the data lake, make sure your choice is driven by the data governance program's business requirements. The tools should never drive the data governance program. Another important consideration for tool selection is to make sure the data lake metadata can be knitted into the metadata of the current governance tool set for lineage, metadata management, traceability and transparency requirements of the data governance program.
Learn more in this white paper: The SAS Data Governance Framework