Data preparation strengthens Hadoop information chain

business man thinking about data prep for analytics and Hadoop Hadoop has driven an enormous amount of data analytics activity lately. And this poses a problem for many practitioners coming from the traditional relational database management system (RDBMS) world.

Hadoop is well known for having lots of variety in the structure of data it stores and processes. But it's fair to say there is additional variation in the control and quality of the information chains that feed into and out of a typical Hadoop installation.

I spoke to an analytics business leader at a recent data management event. He confided that managing and tracking the rivers of information flowing into the big data environment had become his company's biggest challenge, both from a regulatory and administrative standpoint. They had opened the floodgates to enterprise analytics – and now, they were retrospectively trying to add controls and measures with varying degrees of success.

One of the problems with Hadoop is that it has lots of moving parts that are evolving rapidly. Hadoop doesn't have a cohesive interface that many with an RDBMS background will recognise. Historically, managing information that flows into and out of Hadoop environments has required extensive coding and plenty of data hand-offs between data providers and consumers. All of this creates a greater likelihood for data quality defects to creep into the process.

Formal data preparation processes can certainly help in this situation, for a variety of reasons:

Data preparation removes hand-coding and manual manipulation of data and replaces it with a more user-friendly environment that reduces the reliance on IT.
Data preparation controls start to introduce much-needed data quality measures, checks and balances into the big data landscape, which are evidently much needed.
Data preparation enables a multiphase approach that can encompass data discovery and mashing of data to find the optimal sources and formats – while delivering greater stability of ongoing information chains required for high quality, frequent analytics and management reporting.

In short, introducing data preparation into your Hadoop environment can facilitate a much simpler information management capability for your big data landscape. In turn – as you continue to tweak, upgrade and replace your individual Hadoop elements – your vital information chains should still hum along with far greater stability and quality.

Download a paper about 5 data management for analytics best practices