Recently a colleague told me Google had published new, interesting data sets at BigQuery. I found a lot of Reddit data as well, so I quickly tried running BigQuery with these text data to see what I could produce. After getting some pretty interesting results, I wanted to see if I could implement the same analysis with SAS and if using SAS Text Mining you would get deeper insights than simple queries. So, I tried SAS with Reddit comments data and I’d like to share my analyses and findings with you.
Analysis 1: Significant Words
To get started with BigQuery, I googled what others were sharing regarding BigQuery and Reddit, and I found USING BIGQUERY WITH REDDIT DATA. In this article the author posted a query statement about extracting significant words from Politics subreddit. I then wrote a SAS program to mimic this query and I got following data with the July of Reddit comments. The result is not completely same as the one from BigQuery, since I downloaded the Reddit data from another web site and used SAS Text Parsing action to parse the comments into tokens rather than just splitting tokens by white space.
Analysis 2: Daily Submissions
The words Trump and Hillary in the list raised my interest and begged for further analysis. So, I did a daily analysis to understand how hot Trump and Hillary were during this month. I filtered all comments mentioning Trump or Hillary under Politics subreddit and counted total submissions per day. The resulting time series plot is shown below.
I found several spikes in the plot, which happened on 2016/7/5, 2016/7/12, 2016/7/21, and 2016/7/26.
Analysis 3: Topics Time Line
I wondered what Reddit users were concerned about on these specific days, so I extracted the top 10 topics from all comments submitted in July, 2016 within Politics subreddit and got the following data. These topics obviously focused on several aspects, such as vote, president candidates, party, and hot news such as Hillary’s email probe.
The topics showed what people were concerned about in the whole month, but I need further investigation in order to explain which topic mostly contributed to the four spikes. The topics’ time series plot helped me find the answer.
Some topics’ time series trends are very close and it is hard to determine which topic contributed mostly, so I got the top contribution topic based on their daily percentage growth. The top growth topic on July 05 is “emails, dnc, +server, hillary, +classify”, which has 256.23 times of growth.
Its time series plot also shows a high spike on July 05. Then, I googled with “July 5, 2016 emails dnc server hillary classify” and I got following news.
There is no doubt the spike on July 05 is related to the FBI’s decision about Clinton’s email probe. In order to confirm this, I extracted the Top 20 Reddit comments submitted on July 05 according to its Reddit score. I quoted partial comment from the top one and I found the link in the comment was included in the Google’s search result.
"...under normal circumstances, security clearances would be revoked. " This is your FBI. EDIT: I took paraphrased quote, this is the actual quote as per https://www.fbi.gov/news/pressrel/press-releases/statement-by-fbi-director-james-b.-comey-on-the-investigation-of-secretary-hillary-clintons-use-of-a-personal-e-mail-system - "
Similar analysis was done on the other three days and the hot topics as follows.
Interestingly, one person did a sentiment analysis with Twitter data and the tweet submission trend of July looks the same as Reddit.
And in this blog, he listed several important events that happened in July.
- July 5th: the FBI says it’s not going to end Clinton’s email probe and will not recommend prosecution.
- July 12th: Bernie Sanders endorses Hillary Clinton for president.
- July 21st: Donald Trump accepts the Republican nomination.
- July 25-28: Clinton accepts nomination in the DNC.
It showcased that different social media data have similar response trends on the same events.
Now I know why these spikes happened. However, more questions came to my mind.
- Who started posting these news?
- Were there cyber armies?
- Who were opinion leaders in the politics community?
I believe all these questions can be answered by analyzing the data with SAS.