In my last post, I covered some of the first data preparation questions to ask when going down the analytics road. I was just getting started, though. There are plenty more things to consider in this vein.
What if the data is flawed or incomplete? What are the downsides?
Not all errors are created equal. Consider three mistakes emanating from errant data:
- Does an organization pay established employee or vendor twice? That error can be easily fixed.
- Does a department hire a bunch of employees by accident because its attrition rate was incorrect? What about a university sending out more than 5,000 acceptance letters in error? These are certainly more embarrassing, but ultimately they are fixable.
- What if the consequences far more severe? Pish posh, you say? In fact, data-based errors can be fatal.
As a general rule, the greater the risks and downsides, the more data preparation is required.
How complex is the data?
Is the data very simple (re: structured)? What about more complex (re: semi-structured or completely unstructured)? The answers to these questions mean a great deal for data preparation.
As a general rule, more complex data requires more preparation than simpler, structured data.
How old is the data? How frequently is it being updated?
The answers to these questions can vary tremendously. Does the data "arrive" on a periodic basis (re: daily, weekly or monthly)? Or is its genesis only mere seconds old, most likely born via streaming? Unfortunately, here it's tough to promulgate formal rules. I've seen both side of this coin:
- Old, pristine data.
- Old, awful data.
- New, pristine data.
- New, awful data.
As for updates, I've seen them make data better and worse. Be prepared for both.
Who or what generates the data?
Even children know that people make mistakes, but what about computers? To be sure, no system is error-proof. Put differently, just because a machine generates data doesn't mean its output is necessarily accurate. Blindly trusting data is rarely a good idea.
Retraining an error-prone employee may be simple or futile. Reprogramming a system to correctly generate and store data may require a ten-second coding change or a year-long effort by pricey consultants.
Is the data at risk of being unavailable?
Have you ever lost a key file or e-mail? How about having a spreadsheet corrupted? Ever had your laptop stolen?
As anyone who has experienced these things knows, data sources occasionally disappear. Events such as those aside, make no mistake: data available today may not remain so tomorrow – at least on a free basis. This is particularly true in the case of application programming interfaces (APIs). Case in point: Twitter closed its new search API to third-party apps. The move affected many developers and startups.
Brass tacks: data "at risk" may not exist to be prepared tomorrow. The absence of key data may make certain analytics, KPIs and other measures less meaningful or even moot.
Keep these things in mind as you prepare data for analytic purposes.
What say you?