Back before storage became so affordable, cost was the primary factor in determining what data an IT department would store. As George Dyson (author and historian of technology) says, “Big data is what happened when the cost of storing information became less than the cost of making the decision to throw it away.”
But as storage costs have plummeted, data volumes have increased astronomically. I believe that cost is still a consideration for data retention – but risk, productivity, and the analytical purpose and intended use of the data need to come to the forefront of storage considerations.
Big data and Hadoop
At IT workshops and conferences, the topics of big data and Hadoop are a centerpiece of many conversations. Executives are excited about the possibilities of using big data to better serve customers, and they’re eager to quickly turn their new "big data insights" into action.
IT organizations, in turn, prepare a battery of technologies to host this data (mainly in Hadoop), focusing primarily on how to store it all. Analytics teams try to figure out how their current tools will help them extract insights and how they can rapidly turn those insights into tangible business improvements.
Not to dampen enthusiasm – but big data does not always equal big value.
To get real value out of your data, you should think through how you can harvest the relevant data that aligns with your analytical strategy. That strategy, by the way, should include more than traditional data retention considerations, like how long to keep data. It should also account for how you are going to manage big data in a way that makes it ready for analytics.
How much data should you store?
Obviously, you don’t need to store all of your data. Relevant data is a function of your analytical maturity. Consider, for example, analytical data structure requirements.
It’s complicated to determine how much transactional data (time periods) you need to have to develop an accurate forecasting model. That’s because the answer depends on the correlation of events of interest over time, and how often trends or cyclical patterns occur.
Many organizations that take advantage of Hadoop use a data landing zone as their single source of data. This is approaching big data from an acquisition perspective – not an analytics perspective.
I find this approach odd. It’s like filling my warehouse with inventory without knowing my customers' needs. Companies like this would benefit from mapping their Hadoop data stores to specific analytical solutions. (By the way, SAS tools can help you easily convert that data into valuable information.)
Data storage – a different perspective
The Internet of Things has led to all sorts of data being stored in Hadoop. This includes logs from website visits, location tracking from cell phones, vital signs from patients, data from amusement park customers with smart wristbands, and media consumption patterns from downloads and streaming. This data in its raw form is not ready for analytics.
In the case of website visits, data is usually restaged in different ways depending on the application it will support. The most common structures are star schemas that allow for efficient reporting. But these structures are not suitable for advanced analytics because they’re designed to create reports.
In an attempt to capture the richness of the information, logs are mapped to features that correspond to the visit characteristics (e.g., site visited, timestamp, page color, device used to access the page). This practice is problematic from a content and structure perspective, because feature content needs to be extensively documented and categorized.
I was involved in a project where there were more than 75,000 features. You can imagine how time consuming it becomes to understand features and their relationships once they’re stored in a transactional manner where each feature is represented by a code. (For a review of this project, watch this video.)
Thinking from a structural perspective, a transactional table of features with timestamps has many rows. Each business question may require data to be stored, organized, transformed and served to users in different ways. For example, if you want to quickly guess website visitors’ demographic characteristics based on visit patterns, you first have to determine the appropriate time window, how quickly this window may change and how the “guessing” process can take place (real-time scoring or batch). Answering these questions will help reveal the best way to treat data upstream.
We should also note that the data structures used to develop analytical models based on data mining or machine learning have different requirements than the structures that support analytic business applications. In the digital example, the data required to build a predictive model allows for more than 75,000 possible features, but the scoring process only requires 30 features. That makes deployment of the scoring simpler than building the analytical data marts.
Finding the balance
So, what is the right balance between short- and long-term data storage in the world of big data? I rely on the following to decide:
- A practical perspective based on the main business uses for the data.
- The data structures needed for analytics.
- The relevancy of the data over time (and its changing nature).
- Development and deployments needs.