Math and statistics are everywhere, and I always rejoice when I spot a rather sophisticated statistical idea "in the wild." For example, I am always pleased when I see a graph that shows the distribution of race times in a typical race (such as a 5K), as shown to the right. The finishing times are plotted against the order in which runners crossed the finish line. This is a great visualization because you can see the times for each participant, the range of times between the top finishers and the laggards, and how close each runner's time was to the time of the person who placed ahead of her.
What amazes me is that this graph essentially shows the cumulative distribution of race times. The cumulative distribution is not generally used outside of scientific publications. Yet here it is, easily understood and with no accompanying explanation! Graphs like this are often used to visualize race times for triathlons, marathons, 5Ks, and more.
The distribution of race times
The graph is noteworthy for what it is and also for what it isn't. It isn't a histogram. If you give a statistician a set of measurements and ask for the distribution, you are likely to get a histogram. For small data sets, you might also get a fringe plot below the histogram, as shown below:
The graph shows the distribution of race time for the same race. The histogram bins the times into one-minute intervals. You can easily see that a small percentage of runners (about 2%) finished the race under 19 minutes and that about 40% of the runners finished between 20 and 22 minutes. You can see that only a few runners exceeded 26 minutes.
Since the histogram is such a standard plot, you would think it would be the "best" graph to use, but the time-versus-rank graph has several advantages:
- The time-versus-rank graph connects the times to the place. What was the time for the 20th runner? It's easy to determine. How many runners finished under 22 minutes? Also easy to find.
- You can see every runner's time. In a race, there might be a tenth of a second between times. The fringe plot suffers from overplotting. The histogram bins many times into a single bar. The time-versus-rank graph displays one marker per runner and the markers do not overlap unless there are hundreds of runners.
- The time-versus-rank graph shows packs of runners. In long-distance races, there is often a "lead pack," a "trailing pack," and other clumps of runners of equal abilities. On the time-versus-rank graph, these packs show ups as groups of nearly horizontal markers. In the fringe plot, the vertical lines overlap and are harder to see.
- Leaders and laggards stand out in the time-versus-rank graph because the markers are isolated. If someone wins a race by 20 seconds (a huge lead!), you can see that clearly in the first graph. In contrast, the histogram lumps together all the leaders into one bar.
The cumulative distribution of race times
The time-versus-rank graph is not exactly equal to the standard graph of the empirical cumulative distribution, but it's close. You can use PROC UNIVARIATE to create the following graph of the cumulative distribution of race times. The graph is known as a CDF plot.
The CDF plot has the same shape as the time-versus-rank graph, but you need to flip the axes. (Geometrically, flip the CDF plot across its diagonal.) The CDF plot differs in three minor ways:
- The ECDF is plotted as a step function. The time-versus-rank graph is plotted as a scatter plot.
- The ECDF has its axes reversed: the times are on the horizontal axis and the order is plotted vertically.
- The ECDF standardizes the "order statistic" into a percentage, rather than using a rank. The runner who finishes 20th out of 66 runners is plotted at the 30.3 percentage point.
So, yes, there are minor differences. But I still smile whenever I see the time-versus-rank graph. Although the racers might not know or care, the plot contains the same information as a plot of the cumulative distribution of the race times.