For nearly two decades now, I have called myself a golfer – although duffer and hacker are usually more apropos terms. Sure, I've hit plenty of good shots and even a few excellent ones in my years. For the most part, though, I struggle to put a good swing on the ball.
In golf, most players consider the the driver to be the toughest club in the bag, and I am no exception. It certainly is the least forgiving stick while the pitching or sand wedge is the easiest. As a result, if you can't hit a wedge consistently well, then you really have no business taking out "the big dog."
I thought of this analogy as I was penning this post. What is the role of the data scientist (the driver in my golf analogy) in a similarly difficult game (data integration)?
Understanding the data scientist
To be sure, data scientists sport a special set of skills. (Insert Taken reference.) Poseurs aside, true data scientists don't simply work with – and on – existing tidy and perfect data sets. Unlike traditional statisticians, they often have to cull data from myriad sources on the web. That's why many of them are adept at R, Python, and other powerful technologies and applications. What's more, they constantly ask themselves important business questions. That is, they don't just respond to basic queries from line-of-business (LOB) employees. They go deeper, using data visualization tools to go wherever the data takes them.
Ostensibly magical powers aside, however, data scientists are just like you and me in one fundamental way: They have a much easier time doing their jobs when they can rely upon complete and accurate data sets. That is, they need not spend as much time scraping, parsing and de-duplicating data – or at least some data. (Of course, it's imperative to remember that there's plenty of data that lies outside of an organization's walls and way beyond any one individual's control.)
Returning to the world of golf, allow me to state the obvious: Players score better when all facets of their games are working well. It's nearly impossible to separate the fact that your putter and irons are consistently failing you while you're on the tee. When one aspect of your game is hurting you, it increases the pressure on the rest of them. Scrambling gets exhausting.
The same holds true with data scientists. Yes, they can often at least partially overcome an organization's dearth of data integration. They may be able to use effective proxies for missing or incomplete data.
Simon says: Possible <> desirable
Are these scenarios possible? Sure, but there's a world of difference between possible and desirable. What's more, CXOs who believe that they can substitute data scientists for real data integration are as foolish as the duffer who consistently uses the wrong club.
Feedback
What say you?
Download a TDWI checklist about modernizing data integration.