John F. Elder IV, Ph.D., is President of Elder Research Inc. (ERI), a data mining consulting team. He has authored innovative data mining tools, is a frequent keynote speaker and was co-chair of the 2009 Knowledge Discovery and Data Mining conference in Paris. His courses on analysis techniques – taught at dozens of universities, companies and government labs – are noted for their clarity and effectiveness. John was honored to serve for five years on a US presidential panel to guide technology for national security. His book with Bob Nisbet and Gary Miner, Handbook of Statistical Analysis & Data Mining Applications, won the PROSE award for Mathematics in 2009. His book with Giovanni Seni, Ensemble Methods in Data Mining: Improving Accuracy through Combining Predictions, was published in February 2010. We recently caught up with John to ask him a few questions as he prepares to teach his new course, Data Mining: Principles and Best Practices.
1. In your opinion, what have been the biggest advancements in data mining over the past 10 years?
Technically, the greatest accuracy is coming from ensembles of models. The Netflix Prize was a great example of this; the two top teams were actually both ensembles of people building ensembles of models. The idea was invented by non-statisticians (including me), but once it was seen to work so well real statisticians (especially Jerry Friedman) broke it down and rebuilt it stronger than before. I helped Giovanni Seni write a clear, short book translating that work to where most scientists can use it.
On the practice side, it’s the continued improvement of software, like SAS Enterprise Miner, which helps analysts automatically do increasingly more necessary but complex steps so they can focus on the hard part of the problem, including translating the squishy business task into crispy technical steps.
2. What predictions do you have for how data mining techniques and technologies will evolve in the next 10 years?
The big new thing is the emergence of two exciting new sources of data: text and links. Six of us just wrote a book about Practical Text Mining (a 1,000+ page monster). Elder Research has had a lot of success working with text, even with early tools, partly as there are “low hanging fruit” whenever new data is explored. And the link information in Social and other complex networks is fascinating. My colleague Andy Fast got written up in ESPN magazine after predicting which football teams will make the playoffs using just the coaching network (who worked for who). In fact, he did better than all four professional ESPN analysts (who used little facts like players, record, etc.)! The power of looking at information in a way that others aren’t yet was also depicted well in the recent film “Moneyball.”
3. You've been a past keynote speaker at SAS' data mining conference (now titled The Analytics Conference Series) and soon you'll be teaching a new SAS Business Knowledge Series course titled Data Mining: Principles and Best Practices. What's your favorite part of presenting and teaching?
At SAS keynotes, it’s the rock music blaring as you bound on stage! For a few seconds, geeks are cool.
Seriously, the greatest thing about teaching is seeing the lightbulbs go off over folks’ heads. This technology can really help mankind. And I get to tell people about how to do it well!
4. What is your favorite industry to work with? Give us an example of a problem you've helped to solve using data mining in that industry.
We do a lot of work to help analysts catch bad actors – such as fraudsters, insider threats, even terrorists. My team finds it very satisfying to help our country stay safe, and help firms and government agencies save tens of millions of dollars (and sometimes, even lives).
5. Can you share a tip from your new course?
It uses SAS Enterprise Miner, and our book on practical data mining – winner of the top award for a mathematics book in 2009 (maybe because it’s huge, color illustrated, and has the fewest equations of any analytic book!) We’ve decided to focus on teaching how to know the true quality of your model. It’s very hard, without training, to avoid overfit, and the great majority of work we see in the wild is not done as well as it could be. Most analysts think their models are much better than they really are, which can be devastating when they are put to use and under-perform. Our course will show how to get the science right and have realizable gains. (We include code to use in those cases when the software streams don’t have all the functionality needed.) We conclude by reviewing the Top 10 Data Mining mistakes, so folks can recognize when they’re going down the wrong path and correct efficiently.
6. Final advice?
This is a great field. Learn to do it well, and your work will generate huge return for its investment. This tends to make folks like to see you!