Eliud Kipchoge recently ran a marathon in under 2 hours. It was a special marathon where they had set up the best possible conditions to help him achieve this goal (such as swapping in pace-setting runners to block the wind for him), so it won't count as the world record - but nonetheless it was quite an accomplishment. And just how fast is a 2 hour marathon, compared to the time it usually takes athletes to run a marathon? ... I'm glad you asked. Let's graph some data and find out!
But first, to help get you in the mood for analyzing some marathon data, here's a picture of my friend Leah's son running a race. It will be interesting to see what this next generation of runners will accomplish, eh!?!
Data to Compare
I just happened to have some data from the Boston Marathon that I had used in a previous blog post, so I decided to compare Eliud's time to that data. Note that the Boston Marathon times are fairly fast, as you have to meet high standards to be allowed to run in that race. Below is my previous graph. If we added Eliud's time, it would be a little below the bottom/left blue marker ... but I don't think that's a good way to show how much faster he is than everyone else, so I decided to start from scratch.
A much better way to visualize the distribution of race-times is to use a histogram. And here's an example of a pretty good histogram that chartr posted on Reddit.
I like the way it's laid out, and it has a good look for a magazine graph. But I would prefer to de-clutter it a bit, and try to make it easier to read.
How can I make a similar graph in SAS? Well, one way is to use a Proc SGplot histogram. With the following minimal code, I can produce a pretty decent histogram of the data, laid out similarly to the chartr graph above.
proc sgplot data=my_data;
label official='Time to finish marathon (minutes)';
The above histogram is nice, but for this particular data I prefer the way each bar represented an individual minute in the original (chartr) graph. That's a lot of minutes to try to represent as individual bars in a traditional histogram ... I only have so many pixels to work with! So I decided to switch to using a needle plot instead. I rounded the race times to the nearest minute, and then used SQL to count the number of runners with each minute time. I then plotted the summarized data using a needle plot (I like thinking of needles as micro-bars).
data my_data; set my_data;
proc sql noprint;
create table plot_data as
select unique official_rounded, count(*) as count
group by official_rounded;
proc sgplot data=plot_data ;
label count='Number of runners finishing in that time';
label official_rounded='Time to finish marathon (rounded to the nearest minute)';
needle x=official_rounded y=count / lineattrs=(thickness=3px color=cx5994c8);
xaxis values=(120 to 360 by 60) reverse;
yaxis display=(noline) values=(0 to 350 by 50);
Now that I've got the basic graph, it's just a matter of adding a few final touches. Below is a summary list of those final touches (click here to see the complete SAS code):
- I convert the minutes to hours, so I can use the hhmm. time format along the bottom axis.
- I add reference lines at each hour.
- I annotate an arrow pointing to Eliud's time.
A histogram is a good way to see how much faster Eliud's time was than typical marathon times, and a needle plot is a good way to create a "high density histogram". A fancy histogram might be good to grab attention in a magazine article, but a nice clean/minimal graph is easier to read and comprehend quickly.