In my previous post, I discussed how data has an expiration date, after which it should at least be archived, or possibly even deleted.
Rob Karel added an excellent comment about the very real risk associated with data retention, and why data retention policies are necessary to “balance the desire to archive and purge unused or expired data to reduce storage costs and business risk, with business and legal discovery retention management requirements. These policies must clarify what data needs to be stored for how long, in what format, applying what rules, with what level of masking or encryption, and with what access guidelines. Big data evangelists must contend with this balance.”
My thoughts about this challenge, especially within the context of big data, are being influenced by the book I’m reading, The Half-life of Facts: Why Everything We Know Has an Expiration Date by Samuel Arbesman, which explains that facts change all the time. For example, smoking has gone from doctor recommended to deadly, we used to think the Earth was the center of the universe, and that Pluto was a planet. In short, what we know about the world is constantly changing. (Although we don’t always get the memo — for lots of examples, check out the Wikipedia list of common misconceptions).
“Knowledge is like radioactivity,” Arbesman explained. “Just as we know that a chunk of uranium can break down in a measurable amount of time — a radioactive half-life — facts, in aggregate, also have half-lives: We can measure the amount of time for half of a subject’s knowledge to be overturned.”
This made me wonder if it would be possible to establish an expiration date for data by measuring its half-life. Such as measurement would have to be more sophisticated than simply using when the data was created or how long it’s been since we used the data. After all, old data can remain useful, just like old facts can remain true (e.g., water molecules are made of two hydrogen atoms and one oxygen atom), and, as Rob pointed out, unused data might need to be retained for valid business reasons.
What do you think?
Can we measure the half-life of data? Please share your thoughts by posting a comment below.
Great perspective on the life-time of data. As Rob noted in his comment - it is really about the "usefulness" of the data. The same way an old bottle of water can still be used to water the flowers, and radioactive decay can be used to generate random numbers - data’s expiry date depends on what it is used for. The documented expiry date, however, if we want to be deterministic, will be the moment at which the data is no longer relevant (useful) for its primary packaging purpose.
To measure the half-life of data, we need to categorize the different uses of the data, and identify the triggers that move its usefulness value from one level to the next. That then becomes our measurement scale for the life of the data. The half-life is the moment when the data transitions into a state where it can no longer service its primary packing purpose.
Thanks for your comment, Gabriel.
I agree with your well thought out way to measure of the half-life of data that has a well-known primary purpose. Of course, complexities arise when the primary purpose is not well-defined, as well as when the same data can be put to additional purposes, each potentially carrying a different definition of the data’s half-life.
Of course, as with many data-related discussions, the general does not apply well to the specific, by which I mean that we need to discuss not data in general, but specific data, many of which could apply and benefit from the half-life measurement you propose.