Data prep considerations for analytics, Part 2


In my last post, I covered some of the first data preparation questions to ask when going down the analytics road. I was just getting started, though. There are plenty more things to consider in this vein.

What if the data is flawed or incomplete? What are the downsides?

Not all errors are created equal. Consider three mistakes emanating from errant data:

  • Does an organization pay established employee or vendor twice? That error can be easily fixed.
  • Does a department hire a bunch of employees by accident because its attrition rate was incorrect? What about a university sending out more than 5,000 acceptance letters in error? These are certainly more embarrassing, but ultimately they are fixable.
  • What if the consequences far more severe? Pish posh, you say? In fact, data-based errors can be fatal.

As a general rule, the greater the risks and downsides, the more data preparation is required.

How complex is the data?

Is the data very simple (re: structured)? What about more complex (re: semi-structured or completely unstructured)? The answers to these questions mean a great deal for data preparation.

As a general rule, more complex data requires more preparation than simpler, structured data.

How old is the data? How frequently is it being updated?

man considering data prep for analyticsThe answers to these questions can vary tremendously. Does the data "arrive" on a periodic basis (re: daily, weekly or monthly)? Or is its genesis only mere seconds old, most likely born via streaming? Unfortunately, here it's tough to promulgate formal rules. I've seen both side of this coin:

  • Old, pristine data.
  • Old, awful data.
  • New, pristine data.
  • New, awful data.

As for updates, I've seen them make data better and worse. Be prepared for both.

Who or what generates the data?

Even children know that people make mistakes, but what about computers? To be sure, no system is error-proof. Put differently, just because a machine generates data doesn't mean its output is necessarily accurate. Blindly trusting data is rarely a good idea.

Retraining an error-prone employee may be simple or futile. Reprogramming a system to correctly generate and store data may require a ten-second coding change or a year-long effort by pricey consultants.

Is the data at risk of being unavailable?

Have you ever lost a key file or e-mail? How about having a spreadsheet corrupted? Ever had your laptop stolen?

As anyone who has experienced these things knows, data sources occasionally disappear. Events such as those aside, make no mistake: data available today may not remain so tomorrow – at least on a free basis. This is particularly true in the case of application programming interfaces (APIs). Case in point: Twitter closed its new search API to third-party apps. The move affected many developers and startups.

Brass tacks: data "at risk" may not exist to be prepared tomorrow. The absence of key data may make certain analytics, KPIs and other measures less meaningful or even moot.

Simon Says

Keep these things in mind as you prepare data for analytic purposes.


What say you?

Download a paper about data management for analytics best practices


About Author

Phil Simon

Author, Speaker, and Professor

Phil Simon is a keynote speaker and recognized technology expert. He is the award-winning author of eight management books, most recently Analytics: The Agile Way. His ninth will be Slack For Dummies (April, 2020, Wiley) He consults organizations on matters related to strategy, data, analytics, and technology. His contributions have appeared in The Harvard Business Review, CNN, Wired, The New York Times, and many other sites. He teaches information systems and analytics at Arizona State University's W. P. Carey School of Business.

Related Posts

Leave A Reply

Back to Top