For those who follow such things, professional cycling proves to be fascinating. And July is the best month of all, since it features the three-week long Tour de France, a wonderful extravaganza of athleticism, endurance and marketing hype.
For the first time ever, pundits are touting a rider from Great Britain -- Bradley Wiggins -- as a possible, even a probable winner. So can data analysis offer Bradley any advice to maximise his chances of donning the fabled Yellow Jersey in Paris?
As with many sports, data relating to past performances are available if you care to look. In this case, though, some relevant data was easy to find, since an analysis of the 2005 Tour is featured in the excellent book by Theus and Urbanek titled Interactive Graphics for Data Analysis. You can download the data via the website for the book, or alternatively, you can get the data already prepared as a JMP table from the JMP File Exchange (download requires a free SAS profile).
Figure 1 shows the teams and the ages of the riders who started the race. Note that the rider who won the race to Paris is shown as a yellow dot, and his teammates as red dots (this coloring is also used in subsequent graphs).
Figure 2 shows the overall ranking of each rider after every stage, in the form of a Parallel Plot. For those unfamiliar with this display, it was developed by Alfred Inselberg, and it is a great way to show structure and univariate and multivariate outliers when you have lots of variables. Although one can use colors, sizes and shapes to extend the traditional scatterplot representation of data, this idea soon runs out of steam.
The Parallel Plot solves this problem neatly simply by stacking up coordinate axes next to each other in a vertical arrangement. Figure 2 shows that the 2005 Tour consisted of 21 stages. With this representation, you can have a large number of variables before the plot becomes unwieldy, and observations become represented as lines rather than the points. This means that the Parallel Plot has some interesting features, but also some shortcomings, so it illustrates the more general idea that no single visualisation will be able to reveal everything about your data.
The labels on the horizontal axis in Figure 2 show what cycling fans know already, namely that the stages of the Tour fall into certain types, for which certain types of riders are more suited physiologically. In this case, a (TT) indicates a time trial stage (in which riders compete individually and can’t cooperate with team members or allies), and an (M) indicates a mountain stage (often involving vertical height gains of 3,000 meters or more). Note that, as might be expected, such stages often involve a lot of "mixing" of the riders (as shown by the lines criss-crossing).
Of course, the Yellow Jersey is awarded to the rider in Paris who has the quickest overall race time, so ranks tell only part of the story. With the possible exception of Year of Birth in Figure 1, remember that "smaller is better" when looking at the vertical scales in the figures. Figure 3 shows the rider’s time for each stage and gives an overall feel for how the route was made up. Note that in the Parallel Plots, we have used Transparency less than 1 (right-click in the Graph Box to set this value), and we have also selected all riders in the team of the winner. Using Transparency and selecting rows (by the Data Filter or some other view) can be very informative when you have lots of variables.
Figure 4 shows the cumulative time of each rider after each stage. Note that neither Figure 4 nor Figure 3 gives much insight into individual performances due to the vertical scaling. Although Parallel Plot has some inbuilt scaling options, sometimes more informative displays can be made by defining auxiliary formula columns and plotting these. In this case, it’s helpful to take the cumulative times at a given stage and subtract the median cumulative time at that stage from all the values. This gives rise to Figure 4. Figure 4 gives a nice visual representation of the entire race history, showing how individual riders fared relatively at each stage.
Theus and Urabek have some further analysis of the data, so what general guidance can we offer Bradley as a result? The simple and obvious answer is "get to Paris quickest"! This involves a number of aspects, including not being forced to abandon due to a crash (as in 2011), and having team members who can help and finish close by whatever the type of stage (time trials excluded, of course). Whatever happens, it will be certainly interesting to see how things unfold. At the first rest day (after Stage 9), Bradley was in yellow. Come on, Brad!
Finally, and those who follow cycling will know already, the Tour winner in 2005 was Lance Armstrong riding for the Discovery Channel. Figures 2 and 5 clearly show a commanding, even superhuman performance. Those who are interested might care to read about the latest charges against Armstrong.