Data Management has been the foundational building block supporting major business analytics initiatives from day one. Not only is it highly relevant, it is absolutely critical to the success of all business analytics projects.
Emerging big data platforms such as Hadoop and in-memory databases are disrupting traditional data architecture in the way organisations store and manage data. Furthermore, new techniques such as schema on-read and persistent in-memory data store are changing how organisations deliver data and drive the analytical life cycle.
This brings us to the question of how relevant data management is in the era of big data? At SAS, we believe that data management will continue to be the critical link between traditional data sources, big data platforms and powerful analytics. There is no doubt that the WHERE and HOW big data will be stored will change and evolve overtime. However that doesn’t affect the need for big data to be subject to the same quality and control requirements as traditional data sources.
Fundamentally, big data cannot be used effectively without proper data management
Data has always been more valuable and powerful when it is integrated and this will remain to be true in the era of big data.
It is a well known fact that whilst Hadoop is being used as a powerful data storage repository for high volume, unstructured or semi-structure information, most corporate data are still locked in traditional RDBMs or data warehouse appliances. The true value of weblog traffic or meter data stored in Hadoop can only be unleashed when they are linked and integrated with customer profile and transaction data that are stored in existing applications. The integration of high volume, semi-structured big data with legacy transaction data will provide powerful business insights that can be game changing.
Data has always been more valuable and powerful when it is integrated and this will continue to be true in the era of big data.
Big data platforms provide an alternative source of data within an organisation’s enterprise data architecture today, and therefore must be part of an organization integration capability.
Just because data lives and comes from a new data source and platform doesn’t mean high levels of quality and accuracy can be assumed. In fact, Hadoop data is known to be notoriously poor in terms of its quality and structure simply because of the lack of control and ease of how data can get into a Hadoop environment.
Just like traditional data sources, before raw Hadoop data can be used, it needs to be profiled and analysed. Often issues such as non-standardised fields and missing data become glaringly obvious when analysts try to tap into Hadoop data sources. Automated data cleansing and enrichment capabilities within the big data environment are critical to make the data more relevant, valuable and most importantly, trustworthy.
As Hadoop gains momentum as a general purpose data repository, there will be increasing pressure to adopt traditional data quality processes and best pracrices.
It should come as no surprise that policies and practices around data governance will need to be applied to new big data sources and platforms. The requirements of storing and manage metadata, understanding lineage and implementing data stewardship do not go away simply because the data storage mechanism has changed.
Furthermore, the unique nature of Hadoop as a highly agile and flexible data repository also brings new challenges around privacy and security around how data needs to be managed, protected and shared. Data Governance will play an increasingly important role in the era of big data as the need to better align IT and business increases.
Data Governance will play an increasingly important role in the era of big data as the need to better align IT and business increases
Whilst the technology underpinning how organisations store their data is going through tremendous change, the need to integrate, govern and manage the data itself have not changed. If anything, the changes to the data landscape and the increase in types and forms of data repositories will make the tasks around data management more challenging than ever.
SAS recognises the challenge faced by our customers and has continued to investment in our extensive Data Management product portfolio by embracing big data platforms from leading vendors such as Cloudera and Hortonworks as well as supporting new data architecture and data management approaches.
As this recent NY Times article appropriately called out, a robust and automated data management platform within a big data environment is critical to empower data scientists and analyst so that they can be freed from doing “Data Janitor” work and focus on the high value activities.