Twitter is becoming an important data source for organizations who want to better understand customer feedback on social channels. Before you can analyze Twitter data, however, you’ll need to import it into SAS. This post explains one way to do just that.
Twitter supports authentication methods such as OAuth signed and application-only authentication. Both methods use access tokens and token secrets in order to authenticate with Twitter API.
In this blog I will focus on the application-only authentication method. Twitter supports authenticating on behalf of the application itself (as opposed to on behalf of a specific user). This means there wont be a user context and you wont have access to endpoints such as POST statuses/update. On the other hand this authentication method is simpler and sufficient for popular tasks such as tweet searches.
Twitter recently removed any XML based output for REST API calls and as a result SAS will need to import data using the JSON format. SAS can't natively read JavaScript Object Notation files but we can make use of dynamic language such as Groovy. Groovy statements can be executed using PROC GROOVY.
Communication to Twitter REST API is done via HTTP and SAS provides an excellent PROC HTTP procedure to do so.
Steps required to import Twitter tweets into a SAS data set:
- Twitter API Registration
- Obtain an OAuth Bearer access token
- Execute a Twitter search request
- Parse the JSON result content
- Import the JSON into SAS data set
Registration
In order to communicate with Twitter on the API level you will need to register at https://dev.twitter.com. Once registered you can create one or more applications and recieve your consumer key and secret.
Obtain an OAuth2 Bearer Token
The following sas code snippet sets the required macro variables including consumer key and secret. You may also adjust the query term which is set to return all tweets tagged with either #SASGF13, #SASGF14 or #SASUSERS.
In order to parse the JSON result content we are going to use PROC GROOVY. Before using Groovy you will need to download the latest stable jar files from Groovy and reference the file groovy-all-2.x.x.jar in the proc groovy statement below.
%let CONSUMER_KEY=<your_consumer_key>; %let CONSUME_SECRET=<your_consumer_secret>; %let JSON_TWEET_FILE=C:\Twitter\ResponseContent.txt; %let CSV_TWEET_FILE=C:\Twitter\ResponseContent.csv; %let TWEET_QUERY=%23SASGF13+OR+%23SASUSERS+OR+%23SASGF14; /* Create a temporary file name used for the XMLMap */ %macro tempFile( fname ); %if %superq(SYSSCPL) eq %str(z/OS) or %superq(SYSSCPL) eq %str(OS/390) %then %let recfm=recfm=vb; %else %let recfm=; filename &fname TEMP lrecl=2048 &recfm; %mend; /* create temp files for the content and header input streams */ %tempFile(in); %tempFile(hdrin); /* keep the response permanently */ filename out "&JSON_TWEET_FILE."; /* post request content is the grant_type */ data _null_; file in; put "grant_type=client_credentials&"; run; /* request the bearer token by providing consumer key and secret */ data _null_; file hdrin; consumerKey = urlencode("&CONSUMER_KEY."); consumerSecret = urlencode("&CONSUME_SECRET."); encodedAccessToken = put( compress(consumerKey || ":" || consumerSecret),$base64x32767.); put "Authorization: Basic " encodedAccessToken; run; proc http method="post" in=in out=out headerin=hdrin url="https://api.twitter.com/oauth2/token" ct="application/x-www-form-urlencoded;charset=UTF-8"; run; /* Store bearer token in macro variable 'BEARER_TOKEN' */ proc groovy classpath="groovy-all-2.2.1.jar"; submit "&JSON_TWEET_FILE."; import groovy.json.JsonSlurper def input = new File(args[0]).text def result = new JsonSlurper().parseText(input) println "Recieved bearer token: ${result.access_token}" exports.putAt('BEARER_TOKEN',result.access_token) endsubmit; quit; |
Use the Bearer Token to issue the Twitter seach
Let's create a new header request file containing the newly recieved Bearer token and use the header when searching for tweets via PROC HTTP. The parameter 'count' limits the number of returned tweets.
/* Use the bearer token as authorization */ data _null_; file hdrin; put "Authorization: Bearer &BEARER_TOKEN."; run; /* Search for tweets; Limit output to 100 tweets */ proc http method="get" out=out headerin=hdrin url="https://api.twitter.com/1.1/search/tweets.json?q=&TWEET_QUERY.&count=100" ct="application/x-www-form-urlencoded;charset=UTF-8"; run; |
Parse JSON results
We are going to use proc groovy in order to parse the JSON results. Once parsed we are going to create a temporary CSV file using the CSVWriter class from opencsv. Make sure to download the file corresponding jar file into the current folder location.
/* parse JSON and generate temporary CSV file */ proc groovy classpath="groovy-all-2.2.1.jar;opencsv-2.3.jar"; submit "&JSON_TWEET_FILE." "&CSV_TWEET_FILE."; import groovy.json.* import au.com.bytecode.opencsv.CSVWriter def input = new File(args[0]).text def output = new JsonSlurper().parseText(input) def csvoutput = new FileWriter(args[1]) CSVWriter writer = new CSVWriter(csvoutput); String[] header = new String[15]; header[0] = "id"; header[1] = "text"; header[2] = "truncated"; header[3] = "created_at"; header[4] = "user.id"; header[5] = "user.name"; header[6] = "user.screen_name"; header[7] = "user.location"; header[8] = "user.description"; header[9] = "user.url"; header[10] = "user.followers_count"; header[11] = "user.friends_count"; header[12] = "user.listed_count"; header[13] = "retweet_count"; header[14] = "favorite_count"; writer.writeNext(header); output.statuses.each { String[] content = new String[15]; content[0] = it.id.toString(); content[1] = it.text.toString(); content[2] = it.truncated.toString(); content[3] = it.created_at.toString(); content[4] = it.user.id.toString(); content[5] = it.user.name.toString(); content[6] = it.user.screen_name.toString(); content[7] = it.user.location.toString(); content[8] = it.user.description.toString(); content[9] = it.user.url.toString(); content[10] = it.user.followers_count.toString(); content[11] = it.user.friends_count.toString(); content[12] = it.user.listed_count.toString(); content[13] = it.retweet_count.toString(); content[14] = it.favorite_count.toString(); writer.writeNext(content) } writer.close(); endsubmit; quit; |
Import into SAS
The final import of the CSV file is just a matter of executing a PROC IMPORT statement.
/* Import the final tweets data set */ proc import out=work.tweets datafile="&CSV_TWEET_FILE." dbms=csv replace; getnames=yes; datarow=2; run; |
Note, for everyone who doesn't want to touch code, SAS Visual Analytics also provides the capability to import Twitter streams using the data builder interface. Finally, remember the use of Twitter API is subject to the Twitter terms of service.
Happy Tweeting!
41 Comments
Exactly what I was looking for. Thanks!
These are the complications I faced:
I had to update my Groovy version (http://groovy.codehaus.org/Download)
Then update my -jreoptions, following the instructions provided here: http://www.jiangtanghu.com/blog/2012/10/28/hello-groovy-in-sas-9-3/
and, something useful for the ones using Windows, had to replace the "\" [backslash] on the filenames path with "/" [slash] symbol for the path to be understood by
proc groovy
.I also didn't have the opencsv class installed on my pc therefore I decided to just output the file as a regular txt.
In the end the piece of code that changed the most looked like this:
data _null_;
NJSON_TWEET_FILE=tranwrd("&JSON_TWEET_FILE.",'\','/');
NCSV_TWEET_FILE=tranwrd("&CSV_TWEET_FILE.",'\','/');
CALL SYMPUT('NJSON_TWEET_FILE',TRIM(NJSON_TWEET_FILE));
CALL SYMPUT('NCSV_TWEET_FILE',TRIM(NCSV_TWEET_FILE));
run;
proc groovy;
submit "&NJSON_TWEET_FILE." "&NCSV_TWEET_FILE.";
import groovy.json.*
def input = new File(args[0]).text
def output = new JsonSlurper().parseText(input)
PrintWriter out = new PrintWriter(args[1]);
output.statuses.each {
String[] content = new String[15];
content[0] = it.id.toString();
content[1] = it.text.toString();
content[2] = it.truncated.toString();
content[3] = it.created_at.toString();
content[4] = it.user.id.toString();
content[5] = it.user.name.toString();
content[6] = it.user.screen_name.toString();
content[7] = it.user.location.toString();
content[8] = it.user.description.toString();
content[9] = it.user.url.toString();
content[10] = it.user.followers_count.toString();
content[11] = it.user.friends_count.toString();
content[12] = it.user.listed_count.toString();
content[13] = it.retweet_count.toString();
content[14] = it.favorite_count.toString();
out.println(content);
}
out.flush();
endsubmit;
quit;
Once again, AMAZING POST!
Thank you very much.
Thanks Julio. There are two ways of getting additional .jar files registered for proc groovy. One way is setting the -jreoptions in sasv9.cfg as you explained or alternatively setting the classpath parameter in proc groovy. I like the second option as it doesn't require altering the SAS configuration files. You may have noticed that I haven't used any folder references in my example. That's because I usually keep all project files within one folder and make sure SAS is launched with the sasinitialfolder option to set the default folder, e.g.
-SASINITIALFOLDER "C:\Twitter"
.But either way - I'm glad you find this post useful. Cheers, Falko
This is wonderful!
Is it possible to do the same with linkedin?
Thank you!
Fred.
Thanks Fred. While LinkedIn supports the same OAuth2 authentication mechanism you aren't allowed to extract data using the API with the purpose of storing the data locally. See paragraph III/B in the LinkedIn terms of service. Hope this helps. Cheers, Falko
Hello Falko, this is a superb post thank you. I am increasing the size of the count parameter in the http request from 100 to 200 but I am still hitting a limit in the tweets output to the dataset (I seem to get back 108 tweets either way). Any idea why this might be happening?
The number of tweets returned depends on the query term you used and obviously on the number tweets available. Note, that Twitter doesn't expose tweet history (just goes a few days back) so there may just not be more than 108 tweets in your case? Have a look at the Twitter Search API for some details how to fine tune your search term.
Great post!
Just wondering if the proc groovy would work with json from sites like data.medicare.gov that use the soda api?
There is great data out there but I have not been successful in trying to use the api to automate the process without the manual export readin steps.
Thanks again for the great post!
Eric
Hi Eric, PROC GROOVY simply submits any Groovy code you write. In my example above I'm using a JSON parser called JsonSlurper which is part of the Groovy API. The JavaScript Object Notation is open standard as such the parser should be able to parse any valid JSON file regardless of its source. If something doesn't work - it will usually tell you about syntax errors in the file. Hope this helps. Falko
I am new to SAS. How do I refresh the tokens inside SAS?
Best
Ganesh
The token doesn't need to be refreshed as such. The section "Obtain an OAuth2 Bearer Token" explains how to use the oauth2/token service to obtain a token. Once a token has been received it can be used to execute API requests on an application's own behalf. In the example above the token is stored in the macro variable &BEARER_TOKEN so it can be re-used in subsequence authorization headers. Hope this helps. Cheers, Falko
Hi Falko,
I am trying to use this to parse arcgis geocoding rest service responses. Problem I have is Michael has set -noxcmd so proc groovy is disabled. Do you know of a temp way to work around this (I have access to do anything) rather than overriding the config we have in our Prod Demo server?
Cheers,
David.
Well, I don't think you can overwrite the noxcmd option via SAS code. The only alternative would be to use standard SAS data step to read and parse your JSON file. If your JSON file structure is not too complicated and in a regular format - SAS functions such as scan or scanover may be sufficient. Have a look at the SAS Global Forum paper Efficient extraction of JSON information in SAS® using the SCANOVER
function which has some more details. Hope this helps. Cheers, Falko
Wonderful post. Thanks for the code.
Can I use the same code if I wanted to search for users on twitter instead of tweets? Is it just a matter of changing the url in the proc http step and tweaking the fields in the proc groovy part?
Thanks<
Fatima
Yes, this approach can be used for probably all REST API calls returning JSON. Yes, you would use Twitter users/search API and replace the search URL to something like
https://api.twitter.com/1.1/users/search.json?q=&USER_QUERY
where &USER_QUERY would be your search term. You would also need to adjust the following proc groovy part to import the relevant fields as the JSON structure would be slightly different. Hope this helps. Cheers, Falkoi've tried various versions of java and permutation of the steps included (editing sas config file to point to groovy, environment variables)
and still can't get sas to work because everytime I run proc groovy it gives me the "ERROR: The SUBMIT command failed."
I can't get groovy to work in SAS...please help!
Hi just out of curiosity where do these jars go just in the java classpath structure?
Rob, I believe that you can put the groovy jar file anywhere on your system where SAS is running, and then reference the path/jar file with the CLASSPATH option on the PROC GROOVY statement.
he cool that seems to be woking
ok so i am a littel confused , I still dont know how to create the %let JSON_TWEET_FILE=C:\Twitter\ResponseContent.txt; file?
I am using an API key and secret and it seems to generate a big long bearer token but when it is posted it is 403ed has something changed since this article was printed????
29 proc http method="post"
30 in=in out=out
31 headerin=hdrin
32 url="https://api.twitter.com/oauth2/token"
33 ct="application/x-www-form-urlencoded;charset=UTF-8";
34 run;
ERROR: java.io.IOException: Server returned HTTP response code: 403 for URL:
https://api.twitter.com/oauth2/token
NOTE: PROCEDURE HTTP used (Total process time):
real time 1.40 seconds
cpu time 0.00 seconds
NOTE: The SAS System stopped processing this step because of errors.
Rob,
The author of this post is currently on vacation, but I'll forward your question to a few others who might be able to help you. Another option is to try to ask your question in the SAS communities, where you can discuss Data Step questions or Visual Analytics questions, among other topics.
-Alison
For those interested, here is a link to a discussion related to this blog post on the community forums:
https://communities.sas.com/message/219515#219515
Thanks for the cross link and for the additional code examples. Very worthwhile reading everyone.
Hi Falko!
Great post!
I got a problem when I tried to submit the last segment of the first part :
proc groovy classpath="groovy-all-2.2.1.jar";
submit "&JSON_TWEET_FILE.";
import groovy.json.JsonSlurper
def input = new File(args[0]).text
def result = new JsonSlurper().parseText(input)
println "Recieved bearer token: ${result.access_token}"
exports.putAt('BEARER_TOKEN',result.access_token)
endsubmit;
quit;
I got an error message "The submit command failed"
Where do I more likely get wrong?
Do I have to download groovy?
Here is the log:
123 proc
123! groovy classpath="groovy-all-2.2.1.jar";
NOTE: The ADD CLASSPATH command completed.
124 submit "&JSON_TWEET_FILE.";
125 import groovy.json.JsonSlurper
126 def input = new File(args[0]).text
127 def result = new JsonSlurper().parseText(input)
128 println "Recieved bearer token: ${result.access_token}"
129 exports.putAt('BEARER_TOKEN',result.access_token)
130 endsubmit;
ERROR: The SUBMIT command failed.
org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
Script3.groovy: 1: unable to resolve class groovy.json.JsonSlurper
@ line 1, column 7.
import groovy.json.JsonSlurper
^
1 error
at org.codehaus.groovy.control.ErrorCollector.failIfErrors(ErrorCollector.java:296)
at org.codehaus.groovy.control.CompilationUnit.applyToSourceUnits(CompilationUnit.java:860)
at org.codehaus.groovy.control.CompilationUnit.doPhaseOperation(CompilationUnit.java:521)
at org.codehaus.groovy.control.CompilationUnit.processPhaseOperations(CompilationUnit.java:497)
at org.codehaus.groovy.control.CompilationUnit.compile(CompilationUnit.java:474)
at groovy.lang.GroovyClassLoader.parseClass(GroovyClassLoader.java:292)
at groovy.lang.GroovyShell.parseClass(GroovyShell.java:727)
at groovy.lang.GroovyShell.parse(GroovyShell.java:739)
at groovy.lang.GroovyShell.parse(GroovyShell.java:766)
at groovy.lang.GroovyShell.parse(GroovyShell.java:757)
131 quit;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE GROOVY used (Total process time):
real time 0.02 seconds
cpu time 0.01 seconds
Hi Markko,
The message "unable to resolve class groovy.json.JsonSlurper" indicates that it couldn't find the required Groovy jar file. Have you downloaded the jar file from http://groovy.codehaus.org/Download and referenced the file correctly in the 'classpath' parameter (if it's not in your work directory - specify the full path here). You may also check out a related discussion at https://communities.sas.com/thread/59061?start=15&tstart=0 which covered a similar issue.
Hope this helps. Regards, Falko
Wow very quick answer!! You rock, I'm gonna try it!
Thanks a lot! Prettyy awesome! It seems you very good, have you an email adress? Do you mind? Bye!
If you have technical issues with this or any other SAS area really - it's probably good to use the SAS Support Communities (https://communities.sas.com) and start a discussion. This way you can reach out to other SAS experts and may get help even quicker. Hope you get it working. Cheers, Falko
Ok Falko never mind! Thanks a lot for sharing your code, that work!! Awesome!!
Thanks Falko Schulz and thanks Julio for the alternative to CSVWriter as Open CSV 3.0 is too new to be supported or something. The error was major minor version not supported.
I see, I could study Groovy code more.
I used this search and it worked fine %let TWEET_QUERY=%23SASGF15+OR+%23SASGF14; I am ready for my own Internet research data gathering application now 🙂
One could also use R I found it a lot easier installing the TwitterR package and authorising your Consumer Key and Consumer Secret to extract Twitter feeds than tinkering with jar files I simply cannot get groovy to work with SAS nor can I harvest any tweets from Twitter using SAS.
Hi Ryno, there are certainly many ways to consume Twitter tweets and SAS is just one way to do it. If you have difficulties getting groovy to work - you may want to share more details on our community web site (https://communities.sas.com/community/support-communities ). There have been similar discussions such as https://communities.sas.com/thread/59061 which may be a good start. Hope this helps. Cheers, Falko
Hi Falko
Thanks for the big effort you have put into producing this code.
We've had some problems figuring out how to get the macro-part working. But it seems fixed.
I have one question I have not been able to solve. How do I order SAS to sample all available tweets and not just the first 100? We want to create a dataframe with tweets over time. So we will get the server to sample every night on some specific keywords. We have thus build a proces, that only takes the newest tweets, but I must be sure that we get a nearly complete corpus.
I suspect it is in:
/* Search for tweets; Limit output to 100 tweets */
proc http method="get"
out=out headerin=hdrin
url="https://api.twitter.com/1.1/search/tweets.json?q=&TWEET_QUERY.&count=100"
ct="application/x-www-form-urlencoded;charset=UTF-8";
run;
but I have not been able to figure out how to specify a non-max.
I have done some sampling in R and here I have been able to specify limits as high as 10,000. So it is not the API from twitter that limits the sampling size.
Best,
Sander Ehmsen.
Hi Sander,
The maximum number of tweets for https://api.twitter.com/1.1/search/tweets.json is 100 (API details are here). This blog entry was just a basic example to explain the general concept behind the authentication required as well as dealing with JSON results in SAS.
In order to get more than hundred tweets you would need to understand working with timelines (there is a guide with details). As a short summary you would introduce either a max_id parameter which specifies the max tweet ID you want to recieve - in other words you will get tweets older than this ID (there is also a since_id parameter which would return tweets younger than the one specified). In the past I was simply calling this macro x number of times in order to get multiple pages (each hundred tweets). On each call I'm appending the newly recieved tweets to a common tweet data set. I believe Twitter keeps about a one week history - so there is a natural limit how far back you can go. Initially you may have to run it a number of times - but in dialy or hourly batch job you may just have to deal with a few hundred dependent on your search term.
Hope this helps. Cheers, Falko
Hi Falko
Thanks for your quick respons. It was so quick that I didnt even notice it in my e-mail :-).
I have tried looking at the guide for timelines. I think I understand how the structure must be created in order to get the right amount of tweets.
But to be honorst I am not to worried about that part. I might simple just oversample (take 100 tweets but everytime doublesample some tweets and later on remove them).
What I don't understand is what I need to specify in order to sample from page number 2. My best guess would be just to change the IRL from
"https://api.twitter.com/1.1/search/tweets.json?q=&TWEET_QUERY.&count=100"
to
"https://api.twitter.com/2.1/search/tweets.json?q=&TWEET_QUERY.&count=100"
and so on.
Do you know how to do that part?
You won't change the actual URL - just adjust the parameters. The only way of paging thru results is by specifying either the max_id or since_id parameter.
So you can either go backwards in time using the max_id as follow:
1) run the query with parameter count=100 which gives you the last 100 tweets
2) remember the tweet ID of the oldest tweet of these 100 tweets
3) run your next query and specify max_id using that tweet ID. This will give you the 100 tweets before the given tweet ID
4) start over with 2) until you either run out of tweets (you get less than 100) or you simply have enough tweets gathered
Example URL would be something like: https://api.twitter.com/1.1/search/tweets.json?q=&TWEET_QUERY.&count=100&max_id=1234567890
..or alternatively you can use the since_id parameter which is great for a continuous batch run every hour or so. You simply remember the last tweet ID from your last run and specify this as since_id parameter. On every run you will get tweets 'since' this tweet ID.
Example URL would be something like: https://api.twitter.com/1.1/search/tweets.json?q=&TWEET_QUERY.&count=100&since_id=1234567890
The Twitter guide I sent before - has more examples if you need.
Hope this helps. Cheers, Falko
Thanks a ton!
I will try to make it work with my developer tomorrow.
Now I have a clear image on how to proceed.
Thanks for bearing with me.
Best,
Sander.
HI Folks,
I'm a Mtech student and my doing my thesis on analysic of twitter data through hive and in which i show that while analyses it take less time as comapared to existing tool ! But now My guide want that i want to show comparison with existing tool ! so anyone tell me from which tool i analyse sample data of twitter who take more than hive ..please help me out for that !!
Hi Falko
With assistance from my splendid developers we have managed to get the tweets out on a daily basis.
This has revealed a problem for us. We search on hashtags and as the hashtags are not solely defined to a certain event, we get a lot of rubbish.
I R - where I have done the search "manually" there is an option in specifying the language. So that I only want to take out lan="da", that is "danish".
Just for the record: The R-script is like this: TB<-searchTwitter("tinderbox", lan="da", n=10000)
And as R only forwards the request to the Twitter-API, I think, it should be possible to adjust your fine SAS-code.
Do you know where in the code to put in this filter?
Best, Sander Ehmsen.
Hi,
The search is done in the line with
https://api.twitter.com/1.1/search/tweets.json?q=&TWEET_QUERY.&count=10
. Based on the Twitter API doc you can add the lang parameter here which restricts tweets to the given language. So you could try something like&lang=da
Hope this helps. Cheers, Falko
Falko,
Do you have any sample code for posting a status update to a Twitter account? I've seen code for getting tweets, and searching, but not how to automate tweeting status updates.
Thanks!
Hi Jason,
Once you have done the authentication bit - you can execute any API action including status updates. The Twitter API has more details including the parameters supported. Just make sure to use the Bearer token as the header when sending the POST request as shown in section Use the Bearer Token to issue the Twitter seach above. Being a SAS program - you can obviously go ahead and automate this further using SAS Macro Language if required.
Hope this helps. Cheers, Falko