In a voting contest, is it possible for a huge population to get behind a ridiculous candidate with such force that no other contestant can possibly catch up? The answer is: Yes.
Just ask the folks at NERC, the environmental research organization in the UK. They are commissioning a new vessel for polar research, and they decided to crowdsource the naming process. Anyone in the world is welcome to visit their NameOurShip web site and suggest a name or vote on an existing name submission.
As of today, the leading name is "RRS Boaty McBoatface." ("RRS" is standard prefix for a Royal Research Ship.) This wonderfully creative name is winning the race by more than just a little bit: it has 10 times the number of votes as the next highest vote getter, "RRS Henry Worsley".
I wondered whether the raw data for this poll might be available, and I was pleased to find it embedded in the web page that shows the current entries. The raw data is in JSON format, embedded in the source of the HTML page. I saved the web page source to my local machine, copied out just the JSON line with the submissions data, then used SAS to parse the results. Here's my code:
filename records "c:\projects\votedata.txt"; data votes (keep=title likes); length likes 8; format likes comma20.; label likes="Votes"; length len 8; infile records; if _n_ = 1 then do; retain likes_regex title_regex; likes_regex = prxparse("/\'likes\'\:\s?([0-9]*)/"); title_regex = prxparse("/\'title\':\s?\""([a-zA-Z0-9\'\s]+)/"); end; input; position = prxmatch(likes_regex,_infile_); if (position ^= 0) then do; call prxposn(likes_regex, 1, start, len); likes = substr(_infile_,start,len); end; start=0; len=0; position = prxmatch(title_regex,_infile_); if (position ^= 0) then do; call prxposn(title_regex, 1, start, len); title = substr(_infile_,start,len); end; run;
With the data in SAS, I used PROC FREQ to show the current tally:
title "Vote tally for NERC's Name Our Ship campaign"; proc freq data=votes order=freq; table title; weight likes; run;
The numbers are compelling: good ol' Boaty Mac has over 42% of the nearly 200,000 votes. The arguably more-respectable "Henry Worsley" entry is tracking at just 4%. I'm not an expert on polling and sample sizes, but even I can tell that Boaty McBoatface is going to be tough to beat.
To drive the point home a bit more, let's look at a box plot of the votes distribution.
title "Distribution of votes for ALL submissions"; proc sgplot data=votes; hbox likes; xaxis valueattrs=(size=12pt); run;
In this output, we have a clear outlier:
If we exclude Boaty, then it shows a slightly closer race among the other runners up (which include some good serious entries, plus some whimsical entries, such as "Boatimus Prime"):
title "Distribution of votes for ALL submissions except Boaty McBoatface"; proc sgplot data=votes(where=(title^="Boaty McBoatface")); hbox likes; xaxis valueattrs=(size=12pt); run;
Digging further, I wondered whether there were some recurring themes in the entries. I decided to calculate word frequencies using a technique I found on our SAS Support Communities (thanks to Cynthia Zender for sharing):
/* Tally the words across all submissions */ data wdcount(keep=word); set votes; i = 1; origword = scan(title,i); word = compress(lowcase(origword),'?'); wordord = i; do until (origword = ' '); /* exclude the most common words */ if word not in ('a','the','of','and') then output; i + 1; wordord = i; origword = scan(title,i); word = compress(lowcase(origword),'?'); end; run; proc sql; create table work.wordcounts as select t1.word, /* count_of_word */ (count(t1.word)) as word_count from work.wdcount t1 group by t1.word order by word_count desc; quit; title "Frequently occurring words in boat name submissions"; proc print data=wordcounts(obs=25); run;
The top words evoke the northern, cold nature of the boat's mission. Here are the top 25 words and their counts:
1 polar 352 2 ice 193 3 explorer 110 4 arctic 86 5 red 69 6 sir 55 7 john 54 8 lady 46 9 sea 42 10 ocean 42 11 scott 41 12 bear 39 13 aurora 38 14 artic 37 15 queen 37 16 captain 36 17 james 36 18 endeavour 35 19 william 35 20 star 34 21 spirit 34 22 new 26 23 antarctic 26 24 boat 25 25 cold 25
I don't know when voting closes, so maybe whimsy will yet be outvoted by a more serious entry. Or maybe NERC will exercise their right to "take this under advisement" and set a certain standard for the finalist names. Whatever the outcome, I'm sure we haven't heard the last of Boaty...