The topic of big data is everywhere in the world of data analysis, and for good reason – the sheer size and variety of data sources boggles the mind. Consider this one factoid: 90% of the data ever produced in human history has been produced in the last two years. This is causing a fundamental shift in the assumptions behind data management in every organization. As a result, C-level executives are struggling to make sense of, much less take advantage of, the onslaught of data.
Most of the discussion about big data has focused on technology – what tools work best, what the tools can do and how they are different from “traditional” data management technologies. This discussion is useful, but it is stuck in an obsolete assumption about data processing, demonstrated by the new paradigm introduced by big data:
There is no way that any organization can store and analyze all of the data available to it in the world of big data.
That statement is discouraging to someone like me, whose career was spent organizing data in traditional data architectures. But there is good news:
There is no need for any organization to store and analyze all of the data available to it in the world of big data.
Traditional data architectures assume that all data produced by source systems must be available for downstream analysis on a daily or ad-hoc basis. Data warehousing systems traditionally facilitated this analysis, and the selection of relevant data was mostly done by business analysts actually doing the analysis (in the form of SQL queries in its simplest form). The source systems have the benefit of being designed often by the same staff that designs other source systems, and integration with outside systems is similarly controlled with obvious integration points (such as Customer ID for an integration of credit scoring).
Given the first statement above, it is physically impossible to store all data that might be used for analytics for two reasons: the sheer volume of the data would exhaust most every data storage infrastructure, and reporting tools would rapidly become inefficient, and eventually unusable, in such an environment. Big data is also variable in structure and inherently incomplete, providing even less control over the data that organizations were exercising in traditional architectures.
Today’s organizations must choose which data to analyze before it even enters the data architecture. This change arises from the second statement above – most organizations have no need to have information about Lady Gaga’s fashion choices or the latest standings in the English Premiere League. So this extraneous data would be filtered before entering the organization. Changing this assumption about data storage requires changes in how the organization defines and implements data acquisition, storage, and disposal requirements.
Let’s examine each of these areas for some of the necessary changes:
- Data Acquisition – Big data takes design control away for the sources, so business analysts must become more actively involved in the definition of what data is required and how it is integrated with existing systems. Complicating things, big data sources do not generally provide a complete entity such as Customer or Product, and they may or may not provide obvious integration points. This requires an enhanced partnership between IT staff, who can profile available sources and provide expertise in integration, and business analysts, who provide business insight into data requirements.
- Data Storage – Both business and IT need to factor storage requirements into their system design activities, which has not generally been the case since data storage has become a commodity over the past 10-15 years. Organizations need to devise new and innovative methods of using data that is not stored within the organization.
- Data Disposal – Many organizations have never considered strategies for data disposal since their data infrastructures have had plenty of room for all data consumed. IT and business need to become more active in defining these requirements to ensure that the system remains efficient.
All of these challenges are not fundamentally technical in nature, but they are rooted in enhanced communication. Tools such as SharePoint, wikis and metadata repositories can help with efficiency and scalability, but an enhanced willingness for business and IT to communicate and work together as a team is the most important requirement for success in Big Data integration. Does your organization have what it takes to succeed?
1 Comment
Pingback: Keep your perspective with big data - Information Architect