It is said that everything is big in Texas, and that includes big data. During my recent trip to Austin I had the privilege of being a judge in the final round of the Texata Big Data World Championship, a fantastic example of big data competitions.
It felt fitting that I arrived on the day of a much-anticipated University of Texas Longhorns game and witnessed the city awash with college students proudly wearing burnt-orange shirts. Their enthusiasm notwithstanding, my personal sport of choice is not really football but rather big data competitions! And I saw plenty of competitiveness in this particular venue.
Texata is quite new, this being only its second year, but it has already generated significant buzz among big data competitions. Competitors come from all over the world and tend to be seasoned professionals or graduate students from prestigious university programs. By the time they reach the finals, they have already undergone two intense rounds of question answering, data exploration and coding. Unlike other competitions where teams have months to play with a dataset (often preprocessed and curated), and get ranked based on very specific quantitative criteria, in the Texata finals each individual participant is given a real dataset on the spot and only four hours to work with it and extract some sort of meaningful value. This year’s dataset was a collection of customer support tickets and chat logs from Cisco, in whose facility the finals took place. This closely resembles the real world of a data scientist... messy unstructured data, open problem definitions, and a running clock.
Not having a leaderboard, as some other big data competitions do, means that the judges must evaluate the candidates and pick the winner. This was a tough choice, given twelve very talented people, many having traveled from other continents, who put in a large effort and gave their very best. All of us on the judging panel took the responsibility very seriously. At the same time, it was sheer fun to see how each candidate took their own approach, reaching into a large toolbox: latent semantic analysis, clustering, multi-dimensional scaling, and graph algorithms, for example. Some contestants focused on categorizing, others on visualizing, and yet others on inferring causal relations. Every single solution yielded some unique and valuable insight into the data.
At the end of the day, the winner was Kristin Nguyen from Vietnam. Her analysis had the best balance between technical soundness of the code, variety of techniques and presentation clarity. Plus, in 2014 she had already placed second, so this was no fluke. Well deserved, Kristin!
As an added treat, on the following day I got to speak at the companion Texata Summit event. That gave me the chance to show off some exciting examples of using SAS in sports analytics, such as season ticket customer retention in football (both the American and European versions of it!). Also baseball – remember the 2011 movie Moneyball? Scouting continues to be a major application of analytics, allowing small teams to punch well above their weight. Many other sports use analytics, from basketball to Olympic rowing.
Perhaps most exciting of all, there are novel frontier areas identified in the comprehensive report “Analytics in Sports: The New Science of Winning” by Thomas Davenport. For instance, image and video data can be used for crowd management in a stadium, or to track players in the field. In other cases, athlete performance monitoring is of interest. This allowed me to slightly lift the veil on new R&D work related to images, video and wearable sensors:
Thanks to SAS’s ongoing collaboration with Prof. Edgar Lobaton and Prof. Alper Bozkurt of North Carolina State University, involving multiple groups within the Advanced Analytics division of R&D at SAS, I am now aware that golf is actually a rather stressful sport ! By looking at EKG activity, it is apparent that the heart rate goes up to enormous levels in the moments before a swing. Also, while wearables and the Internet of Things are hot topics right now, we should all keep an eye on Profs. Lobaton and Bozkurt’s other work - I like to call it the Internet of Bugs:
As featured in the New Scientist and Popular Science, these cyborg-augmented hissing cockroaches can be instrumental in search and rescue operations. Responders can steer them by applying electrical impulses to their antennas and locate potential survivors in rubble via directional microphones and positioning sensors. SAS has very strong tools that are uniquely suited for processing and analyzing this type of streaming data – for example, SAS® Event Stream Processing can acquire real-time sensor signals, while SAS® Forecast Server and SAS® Enterprise Miner™ can perform signal filtering, detect cycles and spikes, and analyze the aggregate position coordinates of the insects, for example to map a structure and find locations of interest.
After the presentation, #Texata and @SASSoftware Twitter traffic contained multiple variations of the words “weird” and “awesome” - which is a fitting description of data science itself. Truly you never know where your data will come from!