It's hard to get away from data these days, especially Big Data. The news is full of stories about how fast it's accumulating, about technologies for capturing and analyzing it, and about the creative ways organizations are using it. Pundits have even dubbed personal data the “new oil” that will fuel innovation and economic growth for the future.
Stephanie Thompson’s presentation at SouthEastern SAS Users Group reminded me how the principles of basic science are fundamental to unlocking all of this potential. The success of any scientific enquiry depends on asking the right questions and acquiring the right data to provide an answer.
Thompson, whose data-mining and analytics career spans a variety of applications and industries, used a gold-mining metaphor to highlight how not recognizing the importance of these principles can cause data mining projects to go amiss. Drawing on her personal experiences, she observed that human oversight is more often the cause of disappointing results than technical difficulty. Too often, these oversights involve the data:
- Not collecting the right data or enough of it.
- Not ensuring the data are of sufficient quality and in the right format for the statistical tools you plan to use.
- Not confirming the data are giving correct or actionable results.
Whether you're looking for gold (to use Thompson's metaphor) or hoping to strike oil - whether it's Big Data or not-so-big data - success depends on asking the right questions. According to Thompson, these are the "big things you need to do before you go off to use Enterprise Miner":
Look at the map: Define the problem. What is the general problem and what do you want to mine? Are the data available to do what you want? Do you need all of the data available to you?
Check the pan: Run preliminary tests. How do you know if the data are going to tell you something? Will they give you the information you need? Are the data any good?
Check and recheck the map: Narrow the scope. What do you want to know? Has the problem been stated in the right context? Is it specific enough? Or are you even asking the right question? Is this answer something that can be quantified?
Load the mule: Integrate data sources. Have you assembled all the data you need? Is it relevant? Is it possible to access the data and then pull it into a place where it can be mined?
Setting up the sluice: Prepare the data. Did you inspect the data first? Are the variables defined correctly? Are data values, such as missing values or dates, handled appropriately? Are your observations arranged in a meaningful way for mining? Can you interpret the results of your mining? Are they reasonable?
Calling it a day: Confirm your results. What if you don't get an answer? Is it really an actionable result? The answer to your question may not be in this particular set of data, or the data may point you in an inappropriate direction.
Read Thompson’s entire paper - Where Should I Dig? What to do Before Mining Your Data (SD-10).