I attended the Predictive Analytics World conference in NYC this week and found Kaggle, a platform that hosts data prediction competitions, fascinating. It accepts the broadest range of data mining, forecasting and bioinformatics problems and conducts worldwide competitions that invite not only true data scientists but also electrical engineers, statisticians, really ANYONE who can solve the problem to participate. Certainly a productive use of crowdsourcing, this is a game for the smartest, most determined problem solvers in the world – set on a formalized, global stage.
Data predictive challenges have dealt with a variety of topics like improving research for HIV, forecasting tourism, predicting beer sales, estimating which Wikipedia editors are most likely to resign and even answering “dark matter” questions that NASA has been pondering for decades. The prizes for solving the world’s complex problems vary from as little as $500 to $3 million, the latter being the biggest data mining competition ever by the Health Provider Network to help prevent the number of hospital visits in a given year.
“The award amount usually isn’t that large,” says Kaggle CEO Anthony Goldbloom, “yet we consistently receive have hundreds of entries for each project.” When it comes to predictive analytics problems, passionate statistical modelers just want to come up with the best answer. The prize, it seems, is icing on the cake.
Who exactly wins the competitions; who comes up with the best predictive score? Surprisingly, electrical engineers and physicists are the most successful. “Computer scientists and statisticians don’t do as well, perhaps because they are so tied to specific algorithms in the field,” Goldbloom says.
You might also be surprised to learn that a PhD student in glaciology won the NASA dark matter challenge. “The contest allows people to look at a problem that they wouldn’t otherwise know existed,” says Goldbloom. So it’s not only a win for NASA that now has the “magic” formula for its dark matter work, but a mega opportunity for the student for visibility, jobs, not to mention the satisfaction of beating his peers. This particular contest, according to Goldbloom, was fierce. And that’s not unusual.
Participants have visibility into levels of scores from other contenders. So when one outperforms another, the former leader is motivated to keep working on the model to jump ahead. Then the other participant works harder to regain the lead, and so on and so forth. Goldbloom sees this leapfrogging effect quite often. A competitive dynamic kicks in and the drive to win, to reach the best possible solution, is sky high.
The solutions come in quickly, too. The dark matter problem was solved within a week. Once Kaggle made data available for the tourism forecasting competition, forecasting errors dropped dramatically in just two weeks. However within a month, the results stayed flat. “The results start to level out after a point because participants have squeezed the most out of the given data for the best models,” explains Goldbloom.
He talked about how data scientists – be them students or practitioners – are hungry for real-world datasets like this and enjoy the challenge. Meanwhile, according to a McKinsey report published earlier this year, companies are lacking in supply of analytical talent. It doesn’t add up. But Kaggle is providing a forum to unite the two entities, making data science a sport.