Looking for cheaters in the Boston Marathon data

With more and more data available these days, and computers that can analyze that data, it's becoming feasible to look for fraud in events such as the Boston Marathon. So put on your detective hat, and follow along as I show you how to use SAS to be a data sleuth!

But before we get started, I wanted to share a picture of my old college buddy Jenny - she actually ran in the Boston Marathon yesterday. She's quite the runner, and I'm really proud of her (and a bit jealous that she's in so much better shape than I am!)


With the Boston Marathon in the news, I couldn't help but look around for some examples showing what people had done with the data. I found a *very* interesting article about detecting people who might have cheated when qualifying for the Boston Marathon. Derek Murphy is one of the data sleuths who is passionately analyzing the data, and one of the metrics he uses is to identify the runners who ran the marathon at least 20 minutes slower than their qualifying time. Here's a graph he created:


I think it's a pretty neat graph, and I really like it ... except that the bottom axis is a bit busy/confusing. So, of course, I decided to create my own version!

I did a bit of searching, and found that Bill Mill (llimllib) had set up a Github page with some past Boston Marathon data he had scraped from the Boston Athletic Association website. His data collection didn't contain the latest data (it only went up to 2014), but I decided it would be close enough for my purposes. I downloaded the data, imported it into SAS, and created the following plot. Note that the bib numbers in the Boston Marathon indicate runners’ qualifying times - lower numbers mean lower qualifying times, and faster runners.

I simplified the axes a bit, used transparent circular markers rather than solid dots, and included all the data, rather than limiting it to just the competitive runners (I think the last ~1/4 of the runners are more of the fundraisers, rather than competitive runners?):


Murphy's graph highlighted the runners who ran the marathon 20 minutes slower than their qualifying time -- but the data I was using didn't include the qualifying times, so I had to find a different metric to compare the times against. After a bit of head-scratching, I decided to divide the runners into groups (or packs) of 200, and calculate the average speed of each group. I then plotted that average speed as a red line on the graph (I might could have gotten a smoother line by using a "moving average" but I decided to stick with simple for now).


I then identified all the times that were 20% above (or below) the red average line, and put a red 'x' through those circular blue markers. I also added html hover-text so you can see the name & time for those runners, and if you click on them it will launch a Google search. (You have to first click the image below, to see the interactive graph.)


Note that just because the markers are red doesn't mean these people necessarily cheated! If they ran the marathon 20% slower than their qualifying time, they might have been dealing with sickness, injury, or lack of sleep. Or if they ran it 20% faster than their qualifying time, perhaps they had improved that much by hard practice. But it does perhaps warrant a little extra scrutiny, just to make sure everything is copacetic.

Now it's your turn - what other kinds of fraud analytics would you like to run against marathon data?  Or what other kinds of data could these marathon analytics be applied to? Feel free to leave your ideas and suggestions in the comments section!


Post a Comment

Maximize your conference experience by getting SAS certified

SAS users are always looking for ways to optimize, maximize, and prioritize just about everything.  And that includes the precious commodity of time away from the office, even for users at a premier event like SAS Global Forum.  Sure attendees get to learn and share with the best and brightest minds around and investigate new techniques and tools that can directly improve how they work and how their company can help customers.  To make even better use of their time, dozens of attendees also took advantage of the opportunity to challenge a SAS Certification exam right at the conference site.

At major SAS events such as SAS Global Forum and the Analytics Experience series, the SAS Global Certification program offers multiple SAS exam sessions for attendees, usually at a 50% discount.  Here in Las Vegas, two exam sessions were offered on the day before the forum on Monday, April 18.  More than 80 attendees took a SAS exam while they were here.  As you can imagine, SAS users attending the forum are highly motivated individuals which resulted in a significant number earning a SAS credential.  What a great way for someone to get a jump start on their SAS Global Forum experience.


Kriss Harris shares his SAS certification story on camera at SAS Global Forum

Next up?  The Analytics Experience 2016 to be held at the Bellagio hotel in Las Vegas September 12-14.  This time the SAS Global Certification program will be offering three exam sessions – two on Sunday, September 11 and one on Monday morning, September 12.  If you are planning on attending and you have been wanting to take a certain exam, why not maximize your time away from the office and do both?  Maybe you will leave the Analytics Experience 2016 not just smarter, but SAS certified. 

Here's a short video with more information about the certification program, including interviews from test takers at SAS Global Forum 2016. 


Post a Comment

How much will that taxi cost me?

Are you afraid that if you take a ride in a taxi, you might get "taken for a ride"? If trying to figure out the reasonable price of a taxi is a bit voodoo/black-box to you, here is a SAS data analysis of over 12 million NYC Yellow Cab rides, that will hopefully get you in the right ballpark!

Before we get started, here's a picture of a taxi I saw on my trip to Cuba last fall - I've never been to NYC, but I imagine taxi rides are a little different there! :)


I recently came across an interesting graph posted by reddit user 'badgraphs' that analyzed ~1,000,000 NYC Yellow Cab rides from January 2015. His goal was to estimate the "effective rate for an average yellow cab trip in NYC" ($/mile). Below is a copy of his graph:


I found his graph interesting (mainly because I had no idea that this detailed data from all the NYC Yellow Cabs was available!), and the combination of his graph and his write-up answered many questions about the data. But I wondered if I could create a better graph, that was a little more self-explanatory, and didn't need an accompanying article to help users know what was going on in the graph.

I located the data on the nyc.gov website, downloaded the csv file, and imported it into SAS. There were actually over 12 million rides in the data for January 2015 (whereas the graph above only plots ~1 million rides), and of course I included all 12 million in my graph, since SAS can handle that. I decided to let the data speak for itself rather than using regression lines and such, and I found it useful to color the data by the RateCodeID. The coloring helps explain several of the visual features in the graph.


Showing the cab fare -vs- distance was interesting, but I had a more direct question ... how much do people generally pay for a cab ride? Therefore I rounded all cab rides to the nearest dollar, and created a histogram. Looks like the typical ride in a NYC Yellow Cab is around $9 (good to know, eh!?!) Read More »

Post a Comment

Getting down and dirty with lake water quality data

Just like my hero Mike Rowe on the Dirty Jobs TV show, I'm finally diving into an area involving water quality ... and poop! Let's take a graphical look at just how clean (or dirty) the water is, at the lake where my Raleigh Dragon Boat Club practices...

Before we get into the data analysis, here's a picture of our dragon boat team. The boat is about 43 feet long and there are 20 paddlers, a drummer, and a steersperson. Several boats line up at the starting line, and it's a straight-line race to the finish line (generally 300 to 500 meters). The paddlers usually get a little wet from water splashing off of the paddles, and there's always the chance of a boat getting swamped if the water is choppy, so you kinda want that water to be clean. See how I sneaked my way back into the data analysis topic!?! ;)


Our team's boat is housed at Lake Wheeler - a small lake on the south side of Raleigh, with not too much motorboat traffic. Several years ago, the lake got a somewhat bad reputation for being dirty because it was frequently closed due to high levels of bacteria found in poop. Although it had this reputation, I had never personally witnessed it being closed while attending dragon boat practice there, once or twice a week for the past ~3 years. I wondered if maybe the lake's water quality had improved ... and of course a good way to find out would be through some data visualization!

After a bit of web searching, I found a page on the Wake County website that had data tables for 2009-2015. The tables were color-coded to help identify the bad days, but I felt like I could get a much better grasp of the information using a graph. So I copy-n-pasted the data from the pdf documents into text files that SAS could read, and imported the values into a SAS dataset. I plotted the enterococci level for the most recent year, and the levels didn't look too bad ... the readings were only above the EPA limit on one day (July 14). Read More »

Post a Comment

Broadcast your SAS credentials with digital badging

Certification_digitalbadgeIf you happen to be a SAS credential holder, like most candidates, you have invested a lot of time and effort in earning your SAS certification.  Wouldn’t it be great if you could broadcast and manage verifiable proof of your achievement where you want and when you want? That’s where digital badging comes in.

The SAS Global Certification program has partnered with Acclaim, a part of Pearson, to provide SAS users who pass SAS certification exams with a digital version of their SAS credentials.  This digital badge can be used in email signatures, digital resumes, and on social media sites such as LinkedIn, Facebook, and Twitter.  This new functionality is available to all SAS users earning certifications at no cost.

The Acclaim digital badging platform provides:

  • A web-enabled version of your credential that can be shared online
  • Labor market insights that relate your skills to jobs
  • A trusted method for real-time credential verification
  • Complete user control of if/when/where your digital badge is displayed

Read More »

Post a Comment

PROC REPORT versus other Base SAS procedures

PROC REPORTWhen you’re making a report, how do you choose which procedure to use?  The answer is – it depends.

It depends on:

  • whether you are doing an ad hoc analysis or creating a final report that many people will see
  • whether you will run statistical tests with your data or if you just want to see it
  • your level of comfort and if you already have working code

Most of my work involves creating final reports that will be distributed to many users.  I am a PROC REPORT specialist, so 90% of the time, I use PROC REPORT to create these reports.

Here is why:


PROC MEANS is a wonderful, powerful procedure in its own right.  It is an excellent tool for creating data sets of statistics that get fed into DATA steps or other procedures.  I use PROC MEANS extensively, but not as my final reporting procedure.


  • give me percentages
  • do traffic lighting
  • add, subtract, multiply, or divide two variables
  • create new variables

PROC MEANS can give you an overall total and a total for CLASS and BY variables, but so can PROC REPORT.


PROC FREQ is really good at creating a small data set of counts that I can use later.  It is a very quick way of calculating percentages, which might be easier than trying to calculate them in PROC REPORT.  However, starting with PROC REPORT is better for me, just in case I am asked to make changes to the final output and the changes require something that PROC FREQ can’t do.

PROC FREQ does not:

  • support the STYLE option, so I can’t apply traffic lighting
  • calculate simple statistics, like median or mean
  • add informative text

Read More »

Post a Comment

Things to do in Vegas during SAS Global Forum

While you're at SAS Global Forum (or any conference) in Las Vegas, you might have a bit of spare time to do some sightseeing. Therefore I've put together a list & map of interesting things you might want to experience!

Below is a snapshot of my map - click it to see the full size interactive map, with hover-text and drill-downs for each attraction. Notice that the marker for the main conference hotel is yellow/red, and the other attractions are cyan/blue. You can click the markers to launch a Google search for that attraction, and the hover-text for each marker shows the distance from the conference hotel (I tried to pick interesting attractions within a reasonably short distance). The interactive map is followed by a tabular list of the attractions, if you're more of a text-based person.  


Technical Details: Read More »

Post a Comment

The road to SAS Global Forum: A training Q&A with the SAS Jedi

The journey continues as we hear from the instructors for each of the courses being offered on Thursday and Friday, April 21 and 22 after SAS Global Forum.

Next up is Mark Jordan who developed and will teach the Introduction to DS2 and Hadoop course.

  1. Why should people get excited about this course?

Are you good with SAS but new to Hadoop and wondering how they fit together? Learn enough Hadoop to get you started and enough about DS2 to supercharge your data preparation programs – and get a little hands-on experience along the way! DS2 is a new SAS programming language that integrates the power and control of the DATA step with the ease, flexibility and rich data type palette of SQL. Like so many things in SAS, DS2 plays well with many data sources, but it is exceptionally well suited for manipulating data on massively parallel platforms (MPPs) like Hadoop and Teradata.

Learn how DS2 gives you the power to:

  • Process traditional SAS data sets and data tables containing full-precision ANSI data types in the same DS2 DATA program
  • Easily create and share reusable code modules using DS2 packages
  • Safely and easily parallel process multiple rows of data simultaneously using DS2 DATA and THREAD programs on the base SAS platform. You’ll need nothing but base SAS!
  1. Who would get the most out of attending this course?

Read More »

Post a Comment

Speak up! The Analytics Experience is opening its call for content

AX2016speakerThe Analytics conference series is getting a modern makeover. And it’s not just a new name.

The Analytics Experience, Sept. 12-14 at the Bellagio in Las Vegas, is bringing thought leaders and analytics gurus together for one big event. So whether you’re a geek or a suit you can create a custom conference experience that delivers thought leadership, analytics strategies, learning and connecting.

Another big change happening this year is that we’re opening a call for content. Are you interested in presenting, but want to know what’s in it for me? Here are the top 3 reasons why you should consider presenting at Analytics Experience.

  1. Show off your work

Why wouldn’t you want to show hundreds of your peers the awesome advancements you’re making in the analytics space? Speaking boosts not only your credibility but also your company’s credibility. Read More »

Post a Comment

Drug overdose deaths are on the rise in the US

Lately I've seen several articles about drug overdose deaths being on the increase. But I didn't really like the graphs in those articles, so I tried to create some better ones using SAS ...

For example, here's a map from the National Center for Health Statistics website (see the 3rd dashboard/tab above the images). I've seen it used in many articles (such as here and here), but I really don't think it's a great map. For example, the odd map projection (the western states look squished), and placement and size of Alaska and Hawaii - none of these physical aspects of the map are what I'm accustomed to seeing. I also don't like that they used a diverging color scheme (red to blue) - this might be appropriate for quintile color binning (where a different color is assigned to 1/5 of the land areas), but in this case sequential/linear binning was used with each color representing an additional 2 deaths per 100,000. Also, 11 colors were used in the legend - this is really too many colors for someone to easily discern, and relate from the map to the legend. And there are no state outlines, therefore it is difficult to determine which state a specific county is in.

nchs_mapI located the raw data in csv format, and was happy to find that it imported easily & cleanly into SAS using Proc Import. The data was pre-summarized into the 11 bins used in the NCHS map (above) - but 11 is too many colors to easily discern, and therefore I combined bins such that there were only 6 colors. I then plotted the data on a map using Proc Gmap (using a standard/familiar projection), and used a color gradient with shades of red (rather than using shades of 2 diverging colors like the map above). For a finishing touch, I overlaid the state outlines on the map. You can click the image below to see the interactive version with html hover-text, so you can see the names & values of each county:

While looking for the data, I happened to come across another visualization that let you see the trend over time, by showing 12 small maps on 1 page (small multiples). Below is a partial screen-capture (the whole grid is a bit wide, and would require too much shrinking to fit into the blog format - but you can click the image to see the full-size example). This map also used a diverging color scheme (blue-to-red), was a bit too small to really see the data at the county level, and lacked state outlines.

Read More »

Post a Comment