Data Super Savers vs Data Science

7

Computer Files“Dear Cat,
I got an email from my IT department that says:
[We are nearing capacity on the Flotsam Drive. Please clear data from any folders you are no longer using so we can save disk space.
Thanks,
The IT Department]

Doesn’t this strike you as a bit old-fashioned? I mean, isn’t disk space practically free now?
Signed,
DataLover”

Dear DataLover,

My first reaction to this is that yes! You’re right! Disk space is practically free. Why are we worried about storing some extra files? I am a bit of a data hoarder, though, so perhaps my views require some analysis.

Certainly, the message coming out of providers of software and services for the Hadoop ecosystem is that a good data science citizen keeps everything. Long gone are the days when we had to carefully scrub the data, roll the files up to something compact, and get rid of the excess to free up storage space. There might be untapped value in unstructured logs, transactional databases, and other “clutter” files.

So, when is it better to keep versus eliminate data? I have a few thoughts about this.

  • If you can gain more from using the data than you spend to keep the data, then by all means, keep the data. Sources might be surprising. Data scientists make billions of dollars for their companies annually by making data products out of log files and other data that has historically been considered garbage or exhaust.
  • If the data get no use, and are old enough that data products would not benefit from them, then it is best to delete. But I would ask, if the data are not used, should they be? There could be value there.
  • If there is historical information about your company’s performance that can be tied to specific initiatives, then keep the data. There is something to learn here. As an example, if you can track marketing campaigns, staffing decisions, acquisitions and merger information, etc. then you can see which activities were followed by changes in revenue, customer reach, profit, market share, etc. This is not causal information, but it can direct you to your next business experiment in a hurry.
  • If the data can place your organization at risk, then it is prudent to eliminate. This is the case with personally identifiable information (PII), financial records that are no longer needed for audit trails, email records that may contain proprietary conversations with clients, and so on. In this case, there is more to lose from keeping the data than can be gained from eliminating the files.
  • And finally, if the data include pictures of your boss at the last company picnic wearing that Hello Kitty costume and dancing the electric slide, it’s probably best to just let it go. Nobody needs to see that.

Our experiences are different, so I’d love to hear your thoughts about this in the comments below. And, if you would like to spend some quality time talking data hoarding with my colleagues and me, consider coming to one of our data scientist training courses. See you in class!

Strategies and Concepts for Data Scientists and Business Analysts

Data Science: Building Recommender Systems with SAS and Hadoop

  • Bogota
Share

About Author

Catherine Truxillo

Catherine Truxillo, Ph.D. has been a Statistical Training Specialist at SAS since 2000 and has written or co-written SAS training courses for advanced statistical methods including: multivariate statistics, linear and generalized linear mixed models, multilevel models, structural equation models, imputation methods for missing data, statistical process control, design and analysis of experiments, and cluster analysis. Although she primarily works with advanced statistics topics, she also teaches SAS courses using SAS/IML (the interactive matrix language), SAS Enterprise Guide, SAS Enterprise Miner, SAS Forecast Studio, and JMP software. Before coming to SAS, Catherine completed her Ph.D. in Social Psychology with an emphasis in Statistics at The University of Texas at Austin. While at UT Austin, she completed an internship with the Math and Computer Science department's statistical consulting help desk and taught a number of undergraduate courses. While teaching and performing her own graduate research, she worked for a software usability design company conducting experiments to assess the ease-of-use of various software interfaces and website designs. Cat's personal interests include triathlon, hiking the woods near her home in North Carolina, and having tea parties with her two children.

Related Posts

7 Comments

  1. Segun Peter Alade on

    Good afternoon (Nigeria time)
    I am a research student in a nigerian university. I came across a paper of yours titled comparison of missing data handling methods. I made reference to the paper in my research but the paper does not have a year of publication. Just asking for the year of publication.
    Thanks

  2. Catherine Truxillo
    Cat Truxillo on

    Lots of good ideas here! Old and redundant data files are of little use. Scrimping on a few TB of disk space when data scientists need more space to work efficiently is a false economy. I dream of having my own little sandbox with a few Petabytes to play in, and four 50" monitors on the wall to look at it all! That would be living the dream.
    Thanks for reading!

  3. Gerhard Svolba

    Excellent discussion. One point that I would like to raise here (also comes up every time when I teach my Data Preparation Class). We (data miners, statisticians, data scientists) are well known to be "hungry for data" and there are obvious reasons for that ( http://blogs.sas.com/content/subconsciousmusings/2013/12/03/the-hungry-statistician-or-why-we-never-can-get-enough-data/ ) .
    However, we should also bear in mind that we stand in the spotlight if we are too demanding for data (keep the detailed version of the data, request historic data and historic snapshots of data). With the help of senior management, we might push these data requests through and get the ressources and effort from IT. But when we present the results of our models, we have to be prepared to answer the question, whether this effort did result in a more accurate model, or better business value, ... I know it is hard, but always try to "forecast" the benefit of special data requirements.

  4. Great suggestions here and I agree with Thomas to make sure there is no hoarding of useless duplicated data that was copied or migrated many years ago. If your Data Security team, IT and Analyst teams agree to discard duplicate data that has not been used at all for the last 3-5+ years, and is very unlikely to be used in the future -- then it sounds like a smart move to delete duplicate archives with no future value.

  5. Dan Kelly

    I'm pretty sure that the picture of my boss in a Hello Kitty costume doing the Electric Slide is the very last thing I delete. And I print a copy for safekeeping. And email it to myself and you.

    Hopefully the file doesn't fill up your email storage on top of the Flotsam Drive.

  6. Michelle Homes

    Some good points you raise Catherine. It's the battle of data storage versus data and information value. Finding the balance with co-operative IT and business teams are key.

    Hope to see you in Sydney!

  7. Thomas Billings on

    It is common for some files to be redundant on a system - e.g., large number of backups or multiple versions of a file when only the latest is relevant. Some old test files may be useless (while others are important). So indeed keep what is important but there is usually need for some cleanup.

Leave A Reply

Back to Top