With more and more data available these days, and computers that can analyze that data, it's becoming feasible to look for fraud in events such as the Boston Marathon. So put on your detective hat, and follow along as I show you how to use SAS to be a data sleuth!
But before we get started, I wanted to share a picture of my old college buddy Jenny - she actually ran in the Boston Marathon yesterday. She's quite the runner, and I'm really proud of her (and a bit jealous that she's in so much better shape than I am!)
With the Boston Marathon in the news, I couldn't help but look around for some examples showing what people had done with the data. I found a *very* interesting article about detecting people who might have cheated when qualifying for the Boston Marathon. Derek Murphy is one of the data sleuths who is passionately analyzing the data, and one of the metrics he uses is to identify the runners who ran the marathon at least 20 minutes slower than their qualifying time. Here's a graph he created:
I think it's a pretty neat graph, and I really like it ... except that the bottom axis is a bit busy/confusing. So, of course, I decided to create my own version!
I did a bit of searching, and found that Bill Mill (llimllib) had set up a Github page with some past Boston Marathon data he had scraped from the Boston Athletic Association website. His data collection didn't contain the latest data (it only went up to 2014), but I decided it would be close enough for my purposes. I downloaded the data, imported it into SAS, and created the following plot. Note that the bib numbers in the Boston Marathon indicate runners’ qualifying times - lower numbers mean lower qualifying times, and faster runners.
I simplified the axes a bit, used transparent circular markers rather than solid dots, and included all the data, rather than limiting it to just the competitive runners (I think the last ~1/4 of the runners are more of the fundraisers, rather than competitive runners?):
Murphy's graph highlighted the runners who ran the marathon 20 minutes slower than their qualifying time -- but the data I was using didn't include the qualifying times, so I had to find a different metric to compare the times against. After a bit of head-scratching, I decided to divide the runners into groups (or packs) of 200, and calculate the average speed of each group. I then plotted that average speed as a red line on the graph (I might could have gotten a smoother line by using a "moving average" but I decided to stick with simple for now).
I then identified all the times that were 20% above (or below) the red average line, and put a red 'x' through those circular blue markers. I also added html hover-text so you can see the name & time for those runners, and if you click on them it will launch a Google search. (You have to first click the image below, to see the interactive graph.)
Note that just because the markers are red doesn't mean these people necessarily cheated! If they ran the marathon 20% slower than their qualifying time, they might have been dealing with sickness, injury, or lack of sleep. Or if they ran it 20% faster than their qualifying time, perhaps they had improved that much by hard practice. But it does perhaps warrant a little extra scrutiny, just to make sure everything is copacetic.
Now it's your turn - what other kinds of fraud analytics would you like to run against marathon data? Or what other kinds of data could these marathon analytics be applied to? Feel free to leave your ideas and suggestions in the comments section!