Like millions of Americans, I watched the vice presidential debate on television last night. I also watched it on Twitter -- which is to say that while watching the debate, I watched the real-time responses of hundreds of Twitter users.
Twitter users, as you may know, have up to 140 characters to tell the world of Twitter followers what they are doing right now. (Do you care what I'm doing right now?) As a Twitter user, you can sacrifice part of your precious 140 to lend structure to your data through the use of "hash" (as in #) terms. Today, a day after the debate, Twitter users are "voting" for the perceived winner of the debate by tagging their entries with "#vpwinner".
Today I wrote a SAS program that gathers the most recent of these responses, live from Twitter search, and "analyzes" them to tally the votes. Here is the basic approach:
- Use the SAS XML Mapper to define an XML mapping of the RSS feed from Twitter to a SAS data set structure.
- Use the FILENAME URL method to retrieve the RSS feed from the Twitter search site.
- Use the XML libname engine to define the RSS file as a data source.
- Repeat the previous two steps to gather at least the last 80 or so "tweets".
- Use regular expressions as a crude technique to decide, for each tweet, who the "vote" falls to.
- Create a simple bar chart with the result.
I'm in no danger of replacing Gallup polls as an authority on the political pulse of the nation, but the example shows that there is data out there, and SAS can get to it and analyze it.
Here is the output from my quick-and-dirty program, followed by the program itself.
filename twsearch temp; /** this is the XML map that will convert the RSS search feed into a SAS data set **/ data _null_; infile datalines truncover; file twsearch; input line $1000.; put line; datalines4; <?xml version="1.0" encoding="windows-1252"?> <!-- ############################################################ --> <!-- 2008-10-03T11:35:31 --> <!-- SAS XML Libname Engine Map --> <!-- Generated by XML Mapper, 902000.2.1.20080911191346_v920 --> <!-- ############################################################ --> <SXLEMAP name="SXLEMAP" version="1.2"> <!-- ############################################################ --> <TABLE name="entry"> <TABLE-PATH syntax="XPath">/feed/entry</TABLE-PATH> <COLUMN name="id"> <PATH syntax="XPath">/feed/entry/id</PATH> <TYPE>character</TYPE> <DATATYPE>string</DATATYPE> <LENGTH>37</LENGTH> </COLUMN> <COLUMN name="published"> <PATH syntax="XPath">/feed/entry/published</PATH> <TYPE>numeric</TYPE> <DATATYPE>datetime</DATATYPE> <FORMAT width="19">IS8601DT</FORMAT> <INFORMAT width="19">IS8601DT</INFORMAT> </COLUMN> <COLUMN name="link"> <PATH syntax="XPath">/feed/entry/link</PATH> <TYPE>character</TYPE> <DATATYPE>string</DATATYPE> <LENGTH>32</LENGTH> </COLUMN> <COLUMN name="title"> <PATH syntax="XPath">/feed/entry/title</PATH> <TYPE>character</TYPE> <DATATYPE>string</DATATYPE> <LENGTH>140</LENGTH> </COLUMN> <COLUMN name="content"> <PATH syntax="XPath">/feed/entry/content</PATH> <TYPE>character</TYPE> <DATATYPE>string</DATATYPE> <LENGTH>261</LENGTH> </COLUMN> <COLUMN name="updated"> <PATH syntax="XPath">/feed/entry/updated</PATH> <TYPE>numeric</TYPE> <DATATYPE>datetime</DATATYPE> <FORMAT width="19">IS8601DT</FORMAT> <INFORMAT width="19">IS8601DT</INFORMAT> </COLUMN> <COLUMN name="author"> <PATH syntax="XPath">/feed/entry/author</PATH> <TYPE>character</TYPE> <DATATYPE>string</DATATYPE> <LENGTH>32</LENGTH> </COLUMN> </TABLE> </SXLEMAP> ;;;; /** this is the data set that will hold the "tweet" content **/ data work.feed; length winner $ 10; run; /** this macro makes it simple to get several "pages" worth of tweets **/ %macro getpage(num); %let feed="http://search.twitter.com/search.atom?lang=en&q=%23vpwinner&page=&num"; filename twit URL &feed /** if you need to specify a proxy server to get to the internet **/ /** proxy="http://myproxy.com" **/ ; /** use the XML library engine **/ libname tf XML xmlfileref=twit xmlmap=twsearch; data work.feed; set work.feed tf.entry; run; %mend; %getpage(1); %getpage(2); %getpage(3); %getpage(4); %getpage(5); %getpage(6); %getpage(7); /** the regular expressions in this data step are crude, but they work **/ /** if the tweet contains #vpwinner followed by "Biden", he gets a tick **/ /** if the tweet contains #vpwinner followed by "Palin", she gets a tick **/ data work.feed; set work.feed; if prxmatch('/(#vpwinner)([\s\-\:]+)(biden)/',lowcase(title)) >0 then winner="Biden"; if prxmatch('/(#vpwinner)([\s\-\:]+)(palin)/',lowcase(title)) >0 then winner="Palin"; if prxmatch('/(#vpwinner)([\s\-\:]+)(tie)/',lowcase(title)) >0 then winner="Tie"; run; title "VP Debate winner according to Twitter users"; ods graphics / width=400 height=250; proc sgplot data=work.feed; hbar winner; xaxis label="#vpwinner 'votes'"; run; quit; |
8 Comments
Now that's really cool! Glad to hear you are connecting SAS technology with the latest in both technology and political trends. Keep up the great work!
James
Very nice, Chris!
KUTGW, Audi
I think you need to replace the last line from run; to quit; else sgplot keeps running. I tested it on 92p.
Paul, good catch. I neglected to mention that I was using SAS Enterprise Guide to develop/run this program, and it adds the QUIT; for you. I've updated the source in the example.
Chris
The atom feed is so simple that you don't even need xml mapper. Combined with the sophisticated search parameters on twitter.com, it can be as simple as below:
/* get upto 100 "tweets" posted on oct. 3, 2008,
with the hash tag vpwinner and either "biden"
or "palin" in its title */
%let home = http://search.twitter.com/;
%let search = search.atom?q=+%23vpwinner;
%let date = +since:2008-10-03+until:2008-10-03;
%let ors = %nrstr(&)ors=biden+palin;
%let lang = %nrstr(&)lang=en;
%let rpp = %nrstr(&)rpp=100;
filename page url "&home&search&date&ors〈&rpp";
data winners;
infile page lrecl=32767;
input;
prxTitle = '//i';
prxBiden = '/#vpwinner[s-:]+biden/i';
prxPalin = '/#vpwinner[s-:]+palin/i';
if prxmatch(prxTitle, _infile_) then do;
if prxmatch(prxBiden, _infile_) then winner="Biden";
else if prxmatch(prxPalin, _infile_) then winner="Palin";
output;
end;
run;
filename page clear;
proc freq data=winners;
tables winner;
run;
/* on lst
The FREQ Procedure
Cumulative Cumulative
winner Frequency Percent Frequency Percent
-----------------------------------------------------------
Biden 31 75.61 31 75.61
Palin 10 24.39 41 100.00
Frequency Missing = 27
*/
Chang,
Thanks for the comments and the simplified SAS program! The advantage of the XML Mapper approach is that you then have all of the tweets in a SAS data set, ready for other analysis if you want.
Chris
Pingback: Your million and one uses of SAS - The SAS Training Post
Pingback: Twitter and SAS, together again - The SAS Dummy