Putting a price on data storage

2

One obvious result of the data explosion is that organizations now have more data to manage. There’s simply too much data pouring in to process, clean and store it all. Since it’s impossible to warehouse every single piece of data, it’s important to start thinking about the value of your data and how much you want to invest in it.

Let’s consider an extreme example: federal law enforcement agencies. In developed countries these organizations are challenged with huge data streams coming in from travel records, voice recordings, online transactions, official government applications and many other sources. As a result, it’s very hard to figure out what to keep, what to analyze now, what to save for later and what to throw away.

This isn’t an article about public safety, so don’t stop reading if you’re in a different industry. I think you’ll find by the end that the example relates to your business as well. One solution for data collection in law enforcement agencies is to monitor their data streams for the appearance of important concepts or trends that raise a red flag. Essentially, they would store data in raw formats until a topic of interest is identified, or they would store only a portion of the data up until that moment. But once the flag is raised, everything related to that topic would get stored and the data would be optimized for analysis.

If we think of the data stream as a surface of water, law enforcement agencies are constantly monitoring the surface for a ripple. If they had sonars collecting and storing data in detail at multiple layers below the surface, they’d be collecting so much data that they couldn’t keep them all on. So, the solution is to use one or two sonars until one picks up an important ripple, and then turn on as many related sonars as possible to start collecting and processing related data.

In other words, the agency only partially listens to conversations across the table until something perks its interest. When it hears something that sounds important, ideally, it could backtrack to some of the stored data and begin preparing that data for analysis.

This example shows how mature organizations understand the value of learning from data and creating a process to manage and store data for different uses. One simple method I like to use is to look at your data in three categories:

  • Two-dollar data: A lot of traditional, historical reporting has been regulatory or financial in nature with requirements for data to be consistent and auditable. Since backup and recovery were important, enterprise data warehouses were built to ensure data was handled consistently. Data placed in a data warehouse can be thought of as two-dollar data after the notion that each byte costs around $2 to store and manage.
  • One-dollar data: More recently, organizations have data coming in from external feeds and from many other parts of the organization that aren’t strictly financial. This data is useful for analysis; however, it does not need to meet the stringent management and quality criteria of the two-dollar data. You can think of this as one-dollar data because you still care about the accuracy and reliability of the data but you don’t need to spend quite as much looking after it. You can store it in a data mart for regular updating and reporting, and even cleanse it and establish an audit trail for that price, but your standards and data governance processes can be a bit more lenient.
  • Fifty-cent data: Increasingly, organizations are looking at more and more transient or nonessential data that could still be used to make decisions. Examples include temperature control information in storage containers, location data from smart devices, instrumentation output on airplanes or medical devices, sensors and indicators in green buildings, and even information on social media sites. Some organizations use this kind of data to gain insight but don’t archive it or worry about the absolute “accuracy” and auditablity of the data. Other organizations might store this type of data quickly and cheaply for use at a later date – like the law enforcement agencies discussed above. Since you can manage this data cheaply, we call it fifty-cent data.

It’s important to consider the different ways you can manage your data, because it has different cost dynamics. Why would you put your fifty-cent data in an enterprise data warehouse? You wouldn’t. You just need to store it in a format that lets you retrieve it in bulk quickly when you need it.

TDWI has covered the shifting storage landscape recently in its paper Data Requirements for Advanced Analytics. The chart on page 4, which describes enterprise data warehouses, data marts and analytic databases, aligns pretty closely with the two-dollar, one-dollar and fifty-cent data concepts discussed in this blog post.

Businesses are struggling with what data to keep and how to store it, because there's simply too much to save it all. However, if you throw away most of your data, your ability to go back and find early predictors is very limited. Ultimately, organizations that only store two-dollar data will be missing out on opportunities, so low-cost storage initiatives should be a part of your data storage landscape.

Share

About Author

Keith Collins

Senior Vice President and Chief Technology Officer

Keith is responsible for leading the Research and Development, Information Services and Technical Support Divisions at SAS. He fosters close working relationships with marketing and sales to ensure that SAS technologies are aligned with customer needs and market demand. He has been instrumental in leading SAS' evolution as a provider of industry-specific solutions that deliver the benefits of powerful analytic technologies into the hands of users. A graduate of North Carolina State University in computer science, Keith is a devoted supporter of the university. He is the founding member of the strategic advisory board of the department of computer science.

2 Comments

  1. Nice article; I liked the way Keith categorizes data. However, I'm wondering if there is a mis-print: Did he really mean $2 per *byte*? Seems like that would make for a very expensive EDW.

  2. Good point. I should have said "if you considered the relative value of data elements stored in your data warehouse as two dollar data".
    You're correct. If we were literally paying $2/byte we surely wouldn't be experiencing the explosion of data.

Back to Top