Data Super Savers vs Data Science

“Dear Cat,
I got an email from my IT department that says:
[We are nearing capacity on the Flotsam Drive. Please clear data from any folders you are no longer using so we can save disk space.
Thanks,
The IT Department]

Doesn’t this strike you as a bit old-fashioned? I mean, isn’t disk space practically free now?
Signed,
DataLover”

Dear DataLover,

My first reaction to this is that yes! You’re right! Disk space is practically free. Why are we worried about storing some extra files? I am a bit of a data hoarder, though, so perhaps my views require some analysis.

Certainly, the message coming out of providers of software and services for the Hadoop ecosystem is that a good data science citizen keeps everything. Long gone are the days when we had to carefully scrub the data, roll the files up to something compact, and get rid of the excess to free up storage space. There might be untapped value in unstructured logs, transactional databases, and other “clutter” files.

So, when is it better to keep versus eliminate data? I have a few thoughts about this.

If you can gain more from using the data than you spend to keep the data, then by all means, keep the data. Sources might be surprising. Data scientists make billions of dollars for their companies annually by making data products out of log files and other data that has historically been considered garbage or exhaust.
If the data get no use, and are old enough that data products would not benefit from them, then it is best to delete. But I would ask, if the data are not used, should they be? There could be value there.
If there is historical information about your company’s performance that can be tied to specific initiatives, then keep the data. There is something to learn here. As an example, if you can track marketing campaigns, staffing decisions, acquisitions and merger information, etc. then you can see which activities were followed by changes in revenue, customer reach, profit, market share, etc. This is not causal information, but it can direct you to your next business experiment in a hurry.
If the data can place your organization at risk, then it is prudent to eliminate. This is the case with personally identifiable information (PII), financial records that are no longer needed for audit trails, email records that may contain proprietary conversations with clients, and so on. In this case, there is more to lose from keeping the data than can be gained from eliminating the files.
And finally, if the data include pictures of your boss at the last company picnic wearing that Hello Kitty costume and dancing the electric slide, it’s probably best to just let it go. Nobody needs to see that.

Our experiences are different, so I’d love to hear your thoughts about this in the comments below. And, if you would like to spend some quality time talking data hoarding with my colleagues and me, consider coming to one of our data scientist training courses. See you in class!

Strategies and Concepts for Data Scientists and Business Analysts

Data Science: Building Recommender Systems with SAS and Hadoop

Bogota

7 Comments

Segun Peter Alade on November 9, 2017 10:27 am

Good afternoon (Nigeria time)
I am a research student in a nigerian university. I came across a paper of yours titled comparison of missing data handling methods. I made reference to the paper in my research but the paper does not have a year of publication. Just asking for the year of publication.
Thanks
Cat Truxillo on March 23, 2015 1:37 pm

Lots of good ideas here! Old and redundant data files are of little use. Scrimping on a few TB of disk space when data scientists need more space to work efficiently is a false economy. I dream of having my own little sandbox with a few Petabytes to play in, and four 50" monitors on the wall to look at it all! That would be living the dream.
Thanks for reading!
Gerhard Svolba on March 23, 2015 11:17 am

Excellent discussion. One point that I would like to raise here (also comes up every time when I teach my Data Preparation Class). We (data miners, statisticians, data scientists) are well known to be "hungry for data" and there are obvious reasons for that ( http://blogs.sas.com/content/subconsciousmusings/2013/12/03/the-hungry-statistician-or-why-we-never-can-get-enough-data/ ) .
However, we should also bear in mind that we stand in the spotlight if we are too demanding for data (keep the detailed version of the data, request historic data and historic snapshots of data). With the help of senior management, we might push these data requests through and get the ressources and effort from IT. But when we present the results of our models, we have to be prepared to answer the question, whether this effort did result in a more accurate model, or better business value, ... I know it is hard, but always try to "forecast" the benefit of special data requirements.
Frank Damico on March 20, 2015 3:06 pm

Great suggestions here and I agree with Thomas to make sure there is no hoarding of useless duplicated data that was copied or migrated many years ago. If your Data Security team, IT and Analyst teams agree to discard duplicate data that has not been used at all for the last 3-5+ years, and is very unlikely to be used in the future -- then it sounds like a smart move to delete duplicate archives with no future value.
Dan on March 20, 2015 8:42 am

I'm pretty sure that the picture of my boss in a Hello Kitty costume doing the Electric Slide is the very last thing I delete. And I print a copy for safekeeping. And email it to myself and you.

Hopefully the file doesn't fill up your email storage on top of the Flotsam Drive.
Michelle Homes on March 18, 2015 7:30 pm

Some good points you raise Catherine. It's the battle of data storage versus data and information value. Finding the balance with co-operative IT and business teams are key.

Hope to see you in Sydney!
Thomas Billings on March 18, 2015 11:12 am

It is common for some files to be redundant on a system - e.g., large number of backups or multiple versions of a file when only the latest is relevant. Some old test files may be useless (while others are important). So indeed keep what is important but there is usually need for some cleanup.

Blogs

Blogs

Data Super Savers vs Data Science

About Author

Related Posts

7 Comments

Blogs

About Author

Related Posts

QPSOLVE: A new SAS IML function for quadratic optimization

How to use keyword-value pairs when calling SAS IML subroutines

Isotonic regression: An application of quadratic optimization

7 Comments