Great news. If you’ve been struggling to import a Twitter stream as a data source, SAS Visual Analytics 6.4 has greatly simplified that task as part of this release’s expanded data import functionality. The first time you import tweets, you are directed to the Twitter website to log on to your account and authorize SAS Visual Analytics. After the initial logon, SAS Visual Analytics uses authorization tokens for accessing Twitter instead of requiring you to log on each time.
The product documentation provides high level instructions for how to import tweets from Twitter, but I found that additional detail makes the process much simpler to follow. In this post, I’ll walk you through the process from beginning to end, with screenshots and helpful hints along the way.
Capturing the Twitter stream
Start by logging into SAS Visual Analytics. When you select Prepare Data from the main menu, you’ll get the SAS Visual Data Builder dialog. Select Import Data. You’ll see the list of possible sources, including Twitter.
When you select Twitter, the next dialog informs you that you’ll need to connect to your Twitter account to authorize SAS Visual Analtyics to use it. You may want to create a Twitter account specifically for reporting and analysis purposes as opposed to using an existing business or personal account.
Log into Twitter and authorize SAS Visual Analytics to use the account. Once authorized, you are redirected to the browser tab you used for Twitter.
Returning to SAS Visual Analytics, you’ll now see a different dialog requesting search strings. You may include hashtags or Twitter handles. Once you’ve entered information, you can run the query. In this example, my search term is #MarchMadness and I’ve specified a limit of 2000 tweets to be returned. Depending on your needs, you can also specify information such as metadata location, LASR server library and proxy server account information.
Understanding the results
With the table successfully created, you can now choose Create Exploration from the main menu to see what’s there. In this example, my March Madness query bought back fewer than the maximum of 2000 results that I requested. I’m not sure what’s the upper limit for maximum number of results in SAS Visual Analytics, but I’ve been able to run a query asking for 50,000 rows with little trouble.
In a typical Twitter stream query, there is quite a bit of data to explore, and quite a number of variables associated with each tweet. Unfortunately, finding documentation for these variables was quite difficult. Some variables are more intuitive than others. By doing a little exploration, I was about to pull together my interpretation of what some of the fields in this report represent.
Let me know if you find the following lists useful or if you have additional information to share about these.
Category Variables
- author - Mandatory and unique Twitter handle of the person who sent the Tweet.
- authordescription - Optional user description.
- authorimageurl - URL for user's profile photo.
- authorlang - The language the user specifies in their Twitter profile.
- authorlocation - Optional location in user's profile. Not always an actual location.
- authorname - The displayed nickname for the user's account. Neither unique nor necessarily the same as the author value.
- authortimezone - Time zone specified in the user's Twitter profile. Value is sometimes empty.
- authorurl - Optional URL for user's homepage.
- body - The actual tweet itself, max of 140 characters.
- deviceinfo - Information about the device or platform the user sent the Tweet from. Not as clean a field as you might expect.
- listoflinks - Any URLS that appear in the body of the tweet itself, separated by a semicolon if there are multiple values.
- mentionedusernames - The displayed nickname of any users mentioned in the tweet, separated by a semicolon if there are multiple values.
- mentionedusers - The "author" value of any users mentioned in the tweet, separated by a semicolon if there are multiple values.
- publisheddatetimestr - The date and time the tweet was sent. Note: this value isn't imported as a SAS date, it's a character string.
- referenceauthor – Unclear how this differs from mentionedusers. It consistently appears to be the first value there.
Measure Variables
- authorfavouritecount - The number of times the tweet was favorited.
- authorfollowercount - How many followers the author has.
- authorfriendcount - Somewhat misleading label. This is actually the number of accounts the user follows.
- authorid – Possibly a proprietary Twitter unique User ID.
- docid - Another proprietary Twitter unique identifier. Noninteger value.
- doclatitude - Presumably the latitude value if user has geo-location enabled. Missing for most records.
- doclongitude - Presumably the longitude value if user has geo-location enabled. Missing for most records.
- isretweet - Is this a retweet? 1 if yes, 0 if no.
- publisheddatetime – Long numeric value
- referenceauthorid – Another Twitter identifier.
- retweetcount - The number of times the tweet was retweeted.