I got an email from my IT department that says:
[We are nearing capacity on the Flotsam Drive. Please clear data from any folders you are no longer using so we can save disk space.
The IT Department]
Doesn’t this strike you as a bit old-fashioned? I mean, isn’t disk space practically free now?
My first reaction to this is that yes! You’re right! Disk space is practically free. Why are we worried about storing some extra files? I am a bit of a data hoarder, though, so perhaps my views require some analysis.
Certainly, the message coming out of providers of software and services for the Hadoop ecosystem is that a good data science citizen keeps everything. Long gone are the days when we had to carefully scrub the data, roll the files up to something compact, and get rid of the excess to free up storage space. There might be untapped value in unstructured logs, transactional databases, and other “clutter” files.
So, when is it better to keep versus eliminate data? I have a few thoughts about this.
- If you can gain more from using the data than you spend to keep the data, then by all means, keep the data. Sources might be surprising. Data scientists make billions of dollars for their companies annually by making data products out of log files and other data that has historically been considered garbage or exhaust.
- If the data get no use, and are old enough that data products would not benefit from them, then it is best to delete. But I would ask, if the data are not used, should they be? There could be value there.
- If there is historical information about your company’s performance that can be tied to specific initiatives, then keep the data. There is something to learn here. As an example, if you can track marketing campaigns, staffing decisions, acquisitions and merger information, etc. then you can see which activities were followed by changes in revenue, customer reach, profit, market share, etc. This is not causal information, but it can direct you to your next business experiment in a hurry.
- If the data can place your organization at risk, then it is prudent to eliminate. This is the case with personally identifiable information (PII), financial records that are no longer needed for audit trails, email records that may contain proprietary conversations with clients, and so on. In this case, there is more to lose from keeping the data than can be gained from eliminating the files.
- And finally, if the data include pictures of your boss at the last company picnic wearing that Hello Kitty costume and dancing the electric slide, it’s probably best to just let it go. Nobody needs to see that.
Our experiences are different, so I’d love to hear your thoughts about this in the comments below. And, if you would like to spend some quality time talking data hoarding with my colleagues and me, consider coming to one of our data scientist training courses. See you in class!