I'm hard-pressed to think of a trendier yet more amorphous term today than analytics. It seems that every organization wants to take advantage of analytics, but few really are doing that – at least to the extent possible. This topic interests me quite a bit, and I hope to explore it more in the fall. (I'll be teaching Enterprise Analytics in the fall as I start my career as a college professor.)
But analytics (and all of its forms) are predicated on, you know, data. And, as we know from this blog, data is often not ready to be released into the wild. Many times, machines and people need to massage data first – often to significant degrees. Against that backdrop, in this post and its successor, I'll list some key data preparation questions to think about as they pertain to analytics.
Where is the data coming from?
Who generates this stuff anyway? Is it exclusively from a source that an organization directly controls? Sources here include internal enterprise data typically from CRM and ERP applications. Consider lists of employees, customers, products, users and the like. Is the data strictly controlled via an MDM application, good data governance, a responsible culture and the like? If so, then odds are that it will require less preparation than if anarchy prevails.
Of course, this is not always the case. The data may lay outside of an organization's control. What about a bit of both? Hybrid situations are becoming increasingly common.
As a general rule, the more data sources involved, the more data preparation is required for analytics.
How is the data generated?
Do human beings actively generate the data? Does a machine or sensor passively generate it? Does it arrive via an ETL job? What about via an application programming interface (API)? How about a combination of different methods?
As a general rule, the more people and/or data sources involved, the more data preparation is required for analytics.
How much data is generated?
Are we talking about a relatively small list of customers, employees or sales? For preparation purposes, a simple SQL statement or even pivot table in Excel may well identify problematic records – or at least most of them. That's fine, but what about billions of transactions from logfiles? The same tools that work in cleansing with even a medium-sized data set often don't work with much larger ones.
As a general rule, the more people and/or data involved, the more data preparation is required for analytics.
Feedback
In the second post of this series, I'll address some additional considerations.
In the meantime, what say you?