As the application stack supporting big data has matured, it has demonstrated the feasibility of ingesting, persisting and analyzing potentially massive data sets that originate both within and outside of conventional enterprise boundaries. But what does this mean from a data governance perspective?
Most aspects of data governance for internally created data sets are not significantly impacted. Because the processes are managed internally, the definition and implementation of policies governing these data sets should be directly integrated into the development life cycle. But data sets that are acquired from outside the organization’s boundaries pose challenges when there is little or no metadata or information about their provenance.
As interest in exploiting big data analytics explodes, it's valuable to establish good practices early on that will prevent rampant and uncontrolled data downloads and ingestion. Not doing so will create a risk of conflicting interpretation of data semantics – and that can lead to undesired analytical outcomes.
The importance of process
We must consider establishing procedures for introducing new data sources and artifacts into the environment that will simplify integration efforts while harmonizing inferred semantics and interpretations to ensure consistent use. An overall process would analyze ingested data sets to determine the best ways to align the content in a managed and governed internal information architecture.
An example process would embrace these types of procedures:
- Initial assessment. This procedure performs an objective analysis of the acquired data asset. If the data set is structured, data profiling tools could be used to:
- Statistically analyze the values in each column.
- Do type inferencing (i.e., attempt to infer whether the values are character strings, numeric values, dates, etc.).
- Match column names (if provided) to known data types.
- Identify columns and reference domains that are undetermined so they can be investigated further.
- Metadata validation. Discovered inferred metadata must be reviewed by subject matter experts to determine if the inferences are correct and to assess alignment with data sets that are already managed within the context of data policies.
- Determination of downstream touch points. Ensuring usability of an acquired data set requires knowing the contexts in which that data set will be used. Part of the acquisition and ingestion process is to engage potential data consumers and schedule time to discuss potential use cases.
- Documentation of usage models. The ways data might be used direct the design on an internal target data model as well as the methods to be applied in staging and transformation. The usage models will describe how the new data set will augment information delivery to an existing or planned business process, along with the expectations for quality and accessibility for the data set.
- Requirements analysis for data transformation. This is the design stage for developing the transformation sequences that map data from its input format to the desired target representations.
- Transformation development. In developing transformations, you should test and validate that the ingested data sets can be used.
- Integration testing. Here, you should ensure that downstream consumers can incorporate the acquired data into their business processes.
These are just a few ideas for procedures to follow when instituting a governed process for data acquisition. By formally developing data acquisition policies, you can ensure consistency in transformation as well as in use.