Some people search the Internet for a set of topics and then use the number of search results ("hits") for each topic to rank the relative popularity of the topics. At the 2011 Joint Statistical Meetings (JSM), I had the opportunity to attend several talks by statisticians from Google and other large Internet companies. When I chatted with some of these statisticians after talks, they confirmed what I had suspected: it's a bad idea to estimate the popularity of a person or product based on the results of an Internet search.
A case study: Hot dogs versus hamburgers
If I search for "hot dogs," a search engine tells me there are "about 26,700,000 results." If I search for "hamburgers," I find that there are "about 20,900,000 results." Not only the number of results, but also the number of Internet searches favor "hot dogs" over "hamburgers". Is it valid to conclude that hot dogs are more popular than hamburgers? You can find out by examining statistics that are related to consumption.
The National Hot Dog & Sausage Council estimates that US retail sales of hot dogs are more than $1.68 billion, which doesn't include the 21.4 million hot dogs consumed each year just at major league baseball games. Add in carnivals, fairs, and cafeterias, and the facts are clear: hot dogs are popular.
On the other hand, hamburgers are popular, too. McDonalds, Burger King, White Castle, Five Guys Burgers, In-N-Out Burger, and many other chains make hundreds of billions of dollars selling burgers and related items. McDonalds does not publish sales information for individual items, but their own literature claims that they sell "more than 75 hamburgers per second, of every minute, of every hour, of every day of the year," which would amount to about 2.4 billion hamburgers sold annually. That's ten times the volume of retail hot dog sales, just from one fast food chain. (However, these are world-wide sales figures, whereas the hot dog statistics are for the US only.) Men's Health magazine estimates that "each year Americans eat about 40 billion burgers."
Is it valid to claim that hot dogs are more popular, based only on results from an Internet search engine? I asked a statistician from Google about using search engine results to measure popularity. He sadly shook his head. "I know some people do that," he sighed, "but I would never do it, and I don't know any statistician at Google who would, either."
Variance: There's no such thing as THE Google search
Okay, using the results from an Internet search might not be a good estimate of popularity, but some people still use it. For any estimate, a statistician wants to examine at least two properties of the estimate: bias and variance.
One fact I discovered at JSM is that there is no such thing as the Google search for a topic. Google is always changing its algorithms and even runs experiments with its search results. If you search for "Barack Obama" one morning, you might get 264 million hits. If you run the exact same search a few minutes later, you might get 261 or even 248 million hits. No, the Web is not shrinking. Rather, the algorithm that returns the results is not static.
Furthermore, the search results that you get might depend on your geographic location (try searching for "McDonalds") and on the status of your browser cache.
I heard a very interesting talk at JSM about how Google is trying to use topics that you previously searched for in order to predict what you might search for next. The day of "personalized searches" appears to be drawing closer. One day (possibly soon) the search results that I get when I search for "hot dogs" might be different than the results that you get, because our search history is different.
Search bias
Internet search results are biased in many ways. Here are a few of the more obvious biases:
- Age and social media bias: Young people create much more Web content than older people. Whether on Facebook and Twitter or through blogs and comments, the under-30 generation produces much of the searchable content on the Web. The net effect is that topics that appeal to young people are discussed more often. If you judge popularity based on Internet searches alone, you get what might be called "The Justin Bieber Effect": a small but vocal minority dominates the online discussion of a topic.
- Sentiment bias : And speaking of Justin Bieber, a simple Internet search does not include sentiment analysis. A Google search for the exact phrase "I hate Justin Bieber" returns more than 10 million hits. Yet, when sentiment is ignored, these 10 million hits are counted towards Bieber's "popularity."
- Keywords and confounding topics: Internet searches are tricky beasts. A search for "Beatles" returns articles of interest to entomologists and gardeners as well as references to the Fab Four. A search for "hot dogs" misses most of the references to Oscar Mayer wieners and their famous jingle. Furthermore, the search for "hot dogs" includes 1.2 million references to the Nathan's Famous Hot Dog Eating Contest (62 hot dogs in 10 minutes, in case you're wondering). If I search for "burgers" instead of "hamburgers," I am rewarded with 76 million hits—almost four times as many hits!
If it's on the Internet, it must be true?
To paraphrase an old saying, "Google never lies, but liars sure can google." Internet searches are useful in many ways. However, using the number of search results as a proxy for popularity is of dubious value and is fraught with statistical perils. The statisticians at Google don't do it, and neither should you.
16 Comments
Nice article. Just one thing, the "day" that your search results for a topic are different to mine lays well in the past: http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html
br, Thomas
I'm not an expert, but I thought that you and I get basically the same NUMBER of results, but in a different ORDER. (There will be some "local" variation if we both search for restaurants or retail stores.) Can someone confirm or debunk my conjecture? Does my search history affect the number of results, or only the order in which the results are presented?
I see this all time now. It’s ridiculous people have gotten so lazy they use Google (=easy) instead of surveys or data (=hard). Even news sites like Huffington Post publish “news articles” that tells the “Most Popular [Playmates/Women/Celebrities/…] According to Google.” How is this news? It’s certainly not science. In fact, if you search for the two phrases “most popular” AND “according to google,” you get 3 million hits! Maybe soon Neilsen Ratings and American Top 40 will fire their research departments and just use Google to predict the most popular TV shows/songs? Maybe US will save money and not hold elections anymore but just award government office to candidates that have most Google hits? Google used in this way is nonsense. Thanks for writing this all down
Hi Rick,
Well said! As you know I've examined all kinds of ways to measure the popularity of SAS, R, SPSS, etc. in my paper, http://r4stats.com/popularity. Each of the approaches I tried had various problems, but at least they were fairly consistent from year to year. One appriach I tried was similar to what you discuss. Rather than count Google search hits, I counted Google searches themselves by using Google Insights. However, since I was after a popularity measure, what the people were expecting when they entered those searches faces the same problem as the search results that you discuss.
The method I used lately counted searches for "SAS code for" or "SAS graph", and so on for each software. To make sure valid hits would result from the searches I ran those searches in Google. The "long tail" showed how hard it was. I recall searching for "sas manual" thinking that if someone entered that search, surely they were using SAS software. For the first 50 or so hits, that was true. Then I started seeing manuals on how to be a Special Air Service (S.A.S.) sniper! After many attempts, those two strings were the only ones I found that consistently returned only SAS software results. But still, Google Insights is not about returned hits, but rather about what people were thinking when they did their search. How many people were thinking about encrypting secret messages and entered "SAS code for"? Or perhaps snipers have their own sorts of graphs that didn't show up in an actual search.
The next time I update that article, I plan on briefly describing this issue and advising people against it.
Cheers,
Bob
Pingback: A Google Fight is not a fair fight - The SAS Dummy
Interesting, I thought of the whole SAS Vs R question, too, when I read this blog. Web content can be biased for software or operating systems as well as entertainers. People write about something because it is new, they hate it, they are confused by it and many other reasons besides popularity. I would be very cautious about considering Google searches a representative, random sample.
Right on, AnnMaria. One could argue that Neil Diamond or Barry Manilo are more popular than Justin Bieber (who can sell out more shows at higher prices?), but Justin's fans live on the Web and on smart phones and they generate tweets and YouTube videos. Neil Diamond's fans live in golf course communities and bridge clubs, and they rarely leave an Internet-enabled trail---except to visit StubHub to buy tickets to Brother Love's Traveling Salvation Show.
On Chris's blog, he mentions the Google Fight Web site and sets up a competition between him and me. You can also run a Google Fight between "hot dogs" and "hamburgers". What is interesting is that you get DRASTICALLY DIFFERENT results than if you do the same search directly from the Google home page!
Did you know that when you search for "R Programming" you get thousands of hits for "SAS(R) Programming"? That's "R" as in "registered trademark." Maybe Google has a sense of humor?
Great post. I'll add that the number of google hits doesn't reflect the quality of a product/performer/person. For example, Justin Bieber has many more google hits than Motzart and Beethoven, but this doesn't mean that he makes better music or that his songs will still be played and enjoyed in 10, 25, or 100 years.
This news story reminds me that just because something appears millions of times on google doesn't make it true:
http://news.yahoo.com/einsteins-laws-prove-ghosts-exist-144407090.html
A quotes from the story: "A recent Google search turned up nearly 8 million results suggesting a link between ghosts and Einstein's work covering the conservation of energy." Apparently, ghost hunters claim that since the human body has electrical activity when alive, that electrical energy must stay around when the human dies. (Not sure how/when they think the energy is created in the body...)
Pingback: Readers’ choice 2011: The DO Loop’s 10 most popular posts - The DO Loop
Pingback: Popular! Articles that strike a chord with SAS users - The DO Loop
Pingback: A double take on sampling - The Data Roundtable
Good post.I liked it - Suraj Purohit
Great post! you open up my mind on this particular issue! Thanks