Some people search the Internet for a set of topics and then use the number of search results ("hits") for each topic to rank the relative popularity of the topics. At the 2011 Joint Statistical Meetings (JSM), I had the opportunity to attend several talks by statisticians from Google and other large Internet companies. When I chatted with some of these statisticians after talks, they confirmed what I had suspected: it's a bad idea to estimate the popularity of a person or product based on the results of an Internet search.
A case study: Hot dogs versus hamburgers
If I search for "hot dogs," a search engine tells me there are "about 26,700,000 results." If I search for "hamburgers," I find that there are "about 20,900,000 results." Not only the number of results, but also the number of Internet searches favor "hot dogs" over "hamburgers". Is it valid to conclude that hot dogs are more popular than hamburgers? You can find out by examining statistics that are related to consumption.
The National Hot Dog & Sausage Council estimates that US retail sales of hot dogs are more than $1.68 billion, which doesn't include the 21.4 million hot dogs consumed each year just at major league baseball games. Add in carnivals, fairs, and cafeterias, and the facts are clear: hot dogs are popular.
On the other hand, hamburgers are popular, too. McDonalds, Burger King, White Castle, Five Guys Burgers, In-N-Out Burger, and many other chains make hundreds of billions of dollars selling burgers and related items. McDonalds does not publish sales information for individual items, but their own literature claims that they sell "more than 75 hamburgers per second, of every minute, of every hour, of every day of the year," which would amount to about 2.4 billion hamburgers sold annually. That's ten times the volume of retail hot dog sales, just from one fast food chain. (However, these are world-wide sales figures, whereas the hot dog statistics are for the US only.) Men's Health magazine estimates that "each year Americans eat about 40 billion burgers."
Is it valid to claim that hot dogs are more popular, based only on results from an Internet search engine? I asked a statistician from Google about using search engine results to measure popularity. He sadly shook his head. "I know some people do that," he sighed, "but I would never do it, and I don't know any statistician at Google who would, either."
Variance: There's no such thing as THE Google search
Okay, using the results from an Internet search might not be a good estimate of popularity, but some people still use it. For any estimate, a statistician wants to examine at least two properties of the estimate: bias and variance.
One fact I discovered at JSM is that there is no such thing as the Google search for a topic. Google is always changing its algorithms and even runs experiments with its search results. If you search for "Barack Obama" one morning, you might get 264 million hits. If you run the exact same search a few minutes later, you might get 261 or even 248 million hits. No, the Web is not shrinking. Rather, the algorithm that returns the results is not static.
Furthermore, the search results that you get might depend on your geographic location (try searching for "McDonalds") and on the status of your browser cache.
I heard a very interesting talk at JSM about how Google is trying to use topics that you previously searched for in order to predict what you might search for next. The day of "personalized searches" appears to be drawing closer. One day (possibly soon) the search results that I get when I search for "hot dogs" might be different than the results that you get, because our search history is different.
Internet search results are biased in many ways. Here are a few of the more obvious biases:
- Age and social media bias: Young people create much more Web content than older people. Whether on Facebook and Twitter or through blogs and comments, the under-30 generation produces much of the searchable content on the Web. The net effect is that topics that appeal to young people are discussed more often. If you judge popularity based on Internet searches alone, you get what might be called "The Justin Bieber Effect": a small but vocal minority dominates the online discussion of a topic.
- Sentiment bias : And speaking of Justin Bieber, a simple Internet search does not include sentiment analysis. A Google search for the exact phrase "I hate Justin Bieber" returns more than 10 million hits. Yet, when sentiment is ignored, these 10 million hits are counted towards Bieber's "popularity."
- Keywords and confounding topics: Internet searches are tricky beasts. A search for "Beatles" returns articles of interest to entomologists and gardeners as well as references to the Fab Four. A search for "hot dogs" misses most of the references to Oscar Mayer wieners and their famous jingle. Furthermore, the search for "hot dogs" includes 1.2 million references to the Nathan's Famous Hot Dog Eating Contest (62 hot dogs in 10 minutes, in case you're wondering). If I search for "burgers" instead of "hamburgers," I am rewarded with 76 million hits—almost four times as many hits!
If it's on the Internet, it must be true?
To paraphrase an old saying, "Google never lies, but liars sure can google." Internet searches are useful in many ways. However, using the number of search results as a proxy for popularity is of dubious value and is fraught with statistical perils. The statisticians at Google don't do it, and neither should you.