How to import Twitter tweets in SAS DATA Step using OAuth 2 authentication style

41
Twitter is becoming an important data source for organizations who want to better understand customer feedback on social channels. Before you can analyze Twitter data, however, you’ll need to import it into SAS. This post explains one way to do just that.

Twitter supports authentication methods such as OAuth signed and application-only authentication. Both methods use access tokens and token secrets in order to authenticate with Twitter API.
 
In this blog I will focus on the application-only authentication method. Twitter supports authenticating on behalf of the application itself (as opposed to on behalf of a specific user). This means there wont be a user context and you wont have access to endpoints such as POST statuses/update. On the other hand this authentication method is simpler and sufficient for popular tasks such as tweet searches.
 
Twitter recently removed any XML based output for REST API calls and as a result SAS will need to import data using the JSON format. SAS can't natively read JavaScript Object Notation files but we can make use of dynamic language such as Groovy. Groovy statements can be executed using PROC GROOVY.
 

Communication to Twitter REST API is done via HTTP and SAS provides an excellent PROC HTTP procedure to do so.
 
Steps required to import Twitter tweets into a SAS data set:

  • Twitter API Registration
  • Obtain an OAuth Bearer access token
  • Execute a Twitter search request
  • Parse the JSON result content
  • Import the JSON into SAS data set

Registration

In order to communicate with Twitter on the API level you will need to register at https://dev.twitter.com. Once registered you can create one or more applications and recieve your consumer key and secret.

Obtain an OAuth2 Bearer Token

The following sas code snippet sets the required macro variables including consumer key and secret. You may also adjust the query term which is set to return all tweets tagged with either #SASGF13, #SASGF14 or #SASUSERS.

In order to parse the JSON result content we are going to use PROC GROOVY. Before using Groovy you will need to download the latest stable jar files from Groovy and reference the file groovy-all-2.x.x.jar in the proc groovy statement below.

 %let CONSUMER_KEY=<your_consumer_key>;
 %let CONSUME_SECRET=<your_consumer_secret>;
 %let JSON_TWEET_FILE=C:\Twitter\ResponseContent.txt;
 %let CSV_TWEET_FILE=C:\Twitter\ResponseContent.csv;
 %let TWEET_QUERY=%23SASGF13+OR+%23SASUSERS+OR+%23SASGF14;
 
 /* Create a temporary file name used for the XMLMap */
 %macro tempFile( fname );
  %if %superq(SYSSCPL) eq %str(z/OS) or %superq(SYSSCPL) eq %str(OS/390) %then 
    %let recfm=recfm=vb;
  %else
    %let recfm=;
  filename &fname TEMP lrecl=2048 &recfm;
 %mend;
 
 /* create temp files for the content and header input streams */
 %tempFile(in);
 %tempFile(hdrin);
 
 /* keep the response permanently */
 filename out "&JSON_TWEET_FILE.";
 
 /* post request content is the grant_type */
 data _null_;
   file in;
   put "grant_type=client_credentials&";
 run;
 
 /* request the bearer token by providing consumer key and secret */
 data _null_;
   file hdrin;
   consumerKey = urlencode("&CONSUMER_KEY.");
   consumerSecret = urlencode("&CONSUME_SECRET.");
   encodedAccessToken  = put( compress(consumerKey || ":" || consumerSecret),$base64x32767.);
   put "Authorization: Basic " encodedAccessToken;
 run;
 
 proc http method="post"
    in=in out=out
    headerin=hdrin
    url="https://api.twitter.com/oauth2/token"
    ct="application/x-www-form-urlencoded;charset=UTF-8";
 run;
 
 /* Store bearer token in macro variable 'BEARER_TOKEN' */
 proc groovy classpath="groovy-all-2.2.1.jar";
    submit "&JSON_TWEET_FILE.";
      import groovy.json.JsonSlurper
      def input = new File(args[0]).text
      def result = new JsonSlurper().parseText(input)
      println "Recieved bearer token: ${result.access_token}"
      exports.putAt('BEARER_TOKEN',result.access_token)
    endsubmit;
 quit;

Use the Bearer Token to issue the Twitter seach

Let's create a new header request file containing the newly recieved Bearer token and use the header when searching for tweets via PROC HTTP. The parameter 'count' limits the number of returned tweets.

 /* Use the bearer token as authorization */
 data _null_;
    file hdrin;
    put "Authorization: Bearer &BEARER_TOKEN.";
 run;
 
 /* Search for tweets; Limit output to 100 tweets */
 proc http method="get"
    out=out headerin=hdrin
    url="https://api.twitter.com/1.1/search/tweets.json?q=&TWEET_QUERY.&count=100"
    ct="application/x-www-form-urlencoded;charset=UTF-8";
 run;

Parse JSON results

We are going to use proc groovy in order to parse the JSON results. Once parsed we are going to create a temporary CSV file using the CSVWriter class from opencsv. Make sure to download the file corresponding jar file into the current folder location.

 /* parse JSON and generate temporary CSV file */
 proc groovy classpath="groovy-all-2.2.1.jar;opencsv-2.3.jar";
   submit "&JSON_TWEET_FILE." "&CSV_TWEET_FILE."; 
      import groovy.json.*
      import au.com.bytecode.opencsv.CSVWriter
 
      def input = new File(args[0]).text
      def output = new JsonSlurper().parseText(input)
      def csvoutput = new FileWriter(args[1])
 
      CSVWriter writer = new CSVWriter(csvoutput);
 
      String[] header = new String[15];
      header[0] = "id";
      header[1] = "text";
      header[2] = "truncated";
      header[3] = "created_at";
      header[4] = "user.id";
      header[5] = "user.name";
      header[6] = "user.screen_name";
      header[7] = "user.location";
      header[8] = "user.description";
      header[9] = "user.url";
      header[10] = "user.followers_count";
      header[11] = "user.friends_count";
      header[12] = "user.listed_count";
      header[13] = "retweet_count";
      header[14] = "favorite_count";
      writer.writeNext(header);
 
      output.statuses.each {
         String[] content = new String[15];
         content[0] = it.id.toString();
         content[1] = it.text.toString();
         content[2] = it.truncated.toString();
         content[3] = it.created_at.toString();
         content[4] = it.user.id.toString();
         content[5] = it.user.name.toString();
         content[6] = it.user.screen_name.toString();
         content[7] = it.user.location.toString();
         content[8] = it.user.description.toString();
         content[9] = it.user.url.toString();
         content[10] = it.user.followers_count.toString();
         content[11] = it.user.friends_count.toString();
         content[12] = it.user.listed_count.toString();
         content[13] = it.retweet_count.toString();
         content[14] = it.favorite_count.toString();
 
         writer.writeNext(content)
      }         
 
      writer.close();
 
    endsubmit; 
 quit;

Import into SAS

The final import of the CSV file is just a matter of executing a PROC IMPORT statement.

 /* Import the final tweets data set */
 proc import out=work.tweets datafile="&CSV_TWEET_FILE." dbms=csv replace;
    getnames=yes;
    datarow=2;
 run;

Note, for everyone who doesn't want to touch code, SAS Visual Analytics also provides the capability to import Twitter streams using the data builder interface. Finally, remember the use of Twitter API is subject to the Twitter terms of service.

Happy Tweeting!

Share

About Author

Falko Schulz

Distinguished Software Developer, Business Intelligence R&D

Falko Schulz is Distinguished Software Developer in the SAS Business Intelligence Research and Development division. He works actively on products such as SAS® Visual Analytics further enhancing user experience and analytical capabilities. Falko has a strong background in delivering business analytical applications in a wide range of industries. Prior to joining the R&D division Falko worked in customer facing roles responsible for critical technical implementations and BI architectural designs. During his 18 years at SAS, Falko worked out of various countries including Germany, Australia and US.

Related Posts

41 Comments

  1. Exactly what I was looking for. Thanks!

    These are the complications I faced:

    I had to update my Groovy version (http://groovy.codehaus.org/Download)

    Then update my -jreoptions, following the instructions provided here: http://www.jiangtanghu.com/blog/2012/10/28/hello-groovy-in-sas-9-3/

    and, something useful for the ones using Windows, had to replace the "\" [backslash] on the filenames path with "/" [slash] symbol for the path to be understood by proc groovy.

    I also didn't have the opencsv class installed on my pc therefore I decided to just output the file as a regular txt.

    In the end the piece of code that changed the most looked like this:

    data _null_;
    NJSON_TWEET_FILE=tranwrd("&JSON_TWEET_FILE.",'\','/');
    NCSV_TWEET_FILE=tranwrd("&CSV_TWEET_FILE.",'\','/');

    CALL SYMPUT('NJSON_TWEET_FILE',TRIM(NJSON_TWEET_FILE));
    CALL SYMPUT('NCSV_TWEET_FILE',TRIM(NCSV_TWEET_FILE));
    run;

    proc groovy;
    submit "&NJSON_TWEET_FILE." "&NCSV_TWEET_FILE.";

    import groovy.json.*

    def input = new File(args[0]).text

    def output = new JsonSlurper().parseText(input)

    PrintWriter out = new PrintWriter(args[1]);

    output.statuses.each {
    String[] content = new String[15];
    content[0] = it.id.toString();
    content[1] = it.text.toString();
    content[2] = it.truncated.toString();
    content[3] = it.created_at.toString();
    content[4] = it.user.id.toString();
    content[5] = it.user.name.toString();
    content[6] = it.user.screen_name.toString();
    content[7] = it.user.location.toString();
    content[8] = it.user.description.toString();
    content[9] = it.user.url.toString();
    content[10] = it.user.followers_count.toString();
    content[11] = it.user.friends_count.toString();
    content[12] = it.user.listed_count.toString();
    content[13] = it.retweet_count.toString();
    content[14] = it.favorite_count.toString();

    out.println(content);
    }
    out.flush();
    endsubmit;

    quit;

    Once again, AMAZING POST!

    Thank you very much.

    • Falko Schulz

      Thanks Julio. There are two ways of getting additional .jar files registered for proc groovy. One way is setting the -jreoptions in sasv9.cfg as you explained or alternatively setting the classpath parameter in proc groovy. I like the second option as it doesn't require altering the SAS configuration files. You may have noticed that I haven't used any folder references in my example. That's because I usually keep all project files within one folder and make sure SAS is launched with the sasinitialfolder option to set the default folder, e.g. -SASINITIALFOLDER "C:\Twitter".
      But either way - I'm glad you find this post useful. Cheers, Falko

  2. Hello Falko, this is a superb post thank you. I am increasing the size of the count parameter in the http request from 100 to 200 but I am still hitting a limit in the tweets output to the dataset (I seem to get back 108 tweets either way). Any idea why this might be happening?

    • Falko Schulz

      The number of tweets returned depends on the query term you used and obviously on the number tweets available. Note, that Twitter doesn't expose tweet history (just goes a few days back) so there may just not be more than 108 tweets in your case? Have a look at the Twitter Search API for some details how to fine tune your search term.

  3. Great post!

    Just wondering if the proc groovy would work with json from sites like data.medicare.gov that use the soda api?

    There is great data out there but I have not been successful in trying to use the api to automate the process without the manual export readin steps.

    Thanks again for the great post!

    Eric

    • Falko Schulz

      Hi Eric, PROC GROOVY simply submits any Groovy code you write. In my example above I'm using a JSON parser called JsonSlurper which is part of the Groovy API. The JavaScript Object Notation is open standard as such the parser should be able to parse any valid JSON file regardless of its source. If something doesn't work - it will usually tell you about syntax errors in the file. Hope this helps. Falko

    • Falko Schulz
      Falko Schulz on

      The token doesn't need to be refreshed as such. The section "Obtain an OAuth2 Bearer Token" explains how to use the oauth2/token service to obtain a token. Once a token has been received it can be used to execute API requests on an application's own behalf. In the example above the token is stored in the macro variable &BEARER_TOKEN so it can be re-used in subsequence authorization headers. Hope this helps. Cheers, Falko

  4. Hi Falko,

    I am trying to use this to parse arcgis geocoding rest service responses. Problem I have is Michael has set -noxcmd so proc groovy is disabled. Do you know of a temp way to work around this (I have access to do anything) rather than overriding the config we have in our Prod Demo server?

    Cheers,

    David.

    • Falko Schulz
      Falko Schulz on

      Well, I don't think you can overwrite the noxcmd option via SAS code. The only alternative would be to use standard SAS data step to read and parse your JSON file. If your JSON file structure is not too complicated and in a regular format - SAS functions such as scan or scanover may be sufficient. Have a look at the SAS Global Forum paper Efficient extraction of JSON information in SAS® using the SCANOVER
      function
      which has some more details. Hope this helps. Cheers, Falko

  5. Wonderful post. Thanks for the code.
    Can I use the same code if I wanted to search for users on twitter instead of tweets? Is it just a matter of changing the url in the proc http step and tweaking the fields in the proc groovy part?

    Thanks<
    Fatima

    • Falko Schulz
      Falko Schulz on

      Yes, this approach can be used for probably all REST API calls returning JSON. Yes, you would use Twitter users/search API and replace the search URL to something like https://api.twitter.com/1.1/users/search.json?q=&USER_QUERY where &USER_QUERY would be your search term. You would also need to adjust the following proc groovy part to import the relevant fields as the JSON structure would be slightly different. Hope this helps. Cheers, Falko

  6. i've tried various versions of java and permutation of the steps included (editing sas config file to point to groovy, environment variables)

    and still can't get sas to work because everytime I run proc groovy it gives me the "ERROR: The SUBMIT command failed."

    I can't get groovy to work in SAS...please help!

    • Chris Hemedinger
      Chris Hemedinger on

      Rob, I believe that you can put the groovy jar file anywhere on your system where SAS is running, and then reference the path/jar file with the CLASSPATH option on the PROC GROOVY statement.

  7. ok so i am a littel confused , I still dont know how to create the %let JSON_TWEET_FILE=C:\Twitter\ResponseContent.txt; file?

  8. I am using an API key and secret and it seems to generate a big long bearer token but when it is posted it is 403ed has something changed since this article was printed????

    29 proc http method="post"
    30 in=in out=out
    31 headerin=hdrin
    32 url="https://api.twitter.com/oauth2/token"
    33 ct="application/x-www-form-urlencoded;charset=UTF-8";
    34 run;

    ERROR: java.io.IOException: Server returned HTTP response code: 403 for URL:
    https://api.twitter.com/oauth2/token
    NOTE: PROCEDURE HTTP used (Total process time):
    real time 1.40 seconds
    cpu time 0.00 seconds

    NOTE: The SAS System stopped processing this step because of errors.

  9. Hi Falko!
    Great post!

    I got a problem when I tried to submit the last segment of the first part :
    proc groovy classpath="groovy-all-2.2.1.jar";
    submit "&JSON_TWEET_FILE.";
    import groovy.json.JsonSlurper
    def input = new File(args[0]).text
    def result = new JsonSlurper().parseText(input)
    println "Recieved bearer token: ${result.access_token}"
    exports.putAt('BEARER_TOKEN',result.access_token)
    endsubmit;
    quit;

    I got an error message "The submit command failed"
    Where do I more likely get wrong?
    Do I have to download groovy?

    Here is the log:

    123 proc
    123! groovy classpath="groovy-all-2.2.1.jar";
    NOTE: The ADD CLASSPATH command completed.
    124 submit "&JSON_TWEET_FILE.";
    125 import groovy.json.JsonSlurper
    126 def input = new File(args[0]).text
    127 def result = new JsonSlurper().parseText(input)
    128 println "Recieved bearer token: ${result.access_token}"
    129 exports.putAt('BEARER_TOKEN',result.access_token)
    130 endsubmit;
    ERROR: The SUBMIT command failed.
    org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
    Script3.groovy: 1: unable to resolve class groovy.json.JsonSlurper
    @ line 1, column 7.
    import groovy.json.JsonSlurper
    ^

    1 error

    at org.codehaus.groovy.control.ErrorCollector.failIfErrors(ErrorCollector.java:296)
    at org.codehaus.groovy.control.CompilationUnit.applyToSourceUnits(CompilationUnit.java:860)
    at org.codehaus.groovy.control.CompilationUnit.doPhaseOperation(CompilationUnit.java:521)
    at org.codehaus.groovy.control.CompilationUnit.processPhaseOperations(CompilationUnit.java:497)
    at org.codehaus.groovy.control.CompilationUnit.compile(CompilationUnit.java:474)
    at groovy.lang.GroovyClassLoader.parseClass(GroovyClassLoader.java:292)
    at groovy.lang.GroovyShell.parseClass(GroovyShell.java:727)
    at groovy.lang.GroovyShell.parse(GroovyShell.java:739)
    at groovy.lang.GroovyShell.parse(GroovyShell.java:766)
    at groovy.lang.GroovyShell.parse(GroovyShell.java:757)
    131 quit;

    NOTE: The SAS System stopped processing this step because of errors.
    NOTE: PROCEDURE GROOVY used (Total process time):
    real time 0.02 seconds
    cpu time 0.01 seconds

  10. Thanks Falko Schulz and thanks Julio for the alternative to CSVWriter as Open CSV 3.0 is too new to be supported or something. The error was major minor version not supported.

    I see, I could study Groovy code more.

    I used this search and it worked fine %let TWEET_QUERY=%23SASGF15+OR+%23SASGF14; I am ready for my own Internet research data gathering application now 🙂

  11. One could also use R I found it a lot easier installing the TwitterR package and authorising your Consumer Key and Consumer Secret to extract Twitter feeds than tinkering with jar files I simply cannot get groovy to work with SAS nor can I harvest any tweets from Twitter using SAS.

  12. Sander Ehmsen on

    Hi Falko

    Thanks for the big effort you have put into producing this code.

    We've had some problems figuring out how to get the macro-part working. But it seems fixed.

    I have one question I have not been able to solve. How do I order SAS to sample all available tweets and not just the first 100? We want to create a dataframe with tweets over time. So we will get the server to sample every night on some specific keywords. We have thus build a proces, that only takes the newest tweets, but I must be sure that we get a nearly complete corpus.

    I suspect it is in:

    /* Search for tweets; Limit output to 100 tweets */
    proc http method="get"
    out=out headerin=hdrin
    url="https://api.twitter.com/1.1/search/tweets.json?q=&TWEET_QUERY.&count=100"
    ct="application/x-www-form-urlencoded;charset=UTF-8";
    run;

    but I have not been able to figure out how to specify a non-max.
    I have done some sampling in R and here I have been able to specify limits as high as 10,000. So it is not the API from twitter that limits the sampling size.

    Best,
    Sander Ehmsen.

    • Falko Schulz
      Falko Schulz on

      Hi Sander,

      The maximum number of tweets for https://api.twitter.com/1.1/search/tweets.json is 100 (API details are here). This blog entry was just a basic example to explain the general concept behind the authentication required as well as dealing with JSON results in SAS.

      In order to get more than hundred tweets you would need to understand working with timelines (there is a guide with details). As a short summary you would introduce either a max_id parameter which specifies the max tweet ID you want to recieve - in other words you will get tweets older than this ID (there is also a since_id parameter which would return tweets younger than the one specified). In the past I was simply calling this macro x number of times in order to get multiple pages (each hundred tweets). On each call I'm appending the newly recieved tweets to a common tweet data set. I believe Twitter keeps about a one week history - so there is a natural limit how far back you can go. Initially you may have to run it a number of times - but in dialy or hourly batch job you may just have to deal with a few hundred dependent on your search term.

      Hope this helps. Cheers, Falko

      • Sander Ehmsen on

        Hi Falko

        Thanks for your quick respons. It was so quick that I didnt even notice it in my e-mail :-).

        I have tried looking at the guide for timelines. I think I understand how the structure must be created in order to get the right amount of tweets.
        But to be honorst I am not to worried about that part. I might simple just oversample (take 100 tweets but everytime doublesample some tweets and later on remove them).

        What I don't understand is what I need to specify in order to sample from page number 2. My best guess would be just to change the IRL from
        "https://api.twitter.com/1.1/search/tweets.json?q=&TWEET_QUERY.&count=100"
        to
        "https://api.twitter.com/2.1/search/tweets.json?q=&TWEET_QUERY.&count=100"
        and so on.

        Do you know how to do that part?

        • Falko Schulz
          Falko Schulz on

          You won't change the actual URL - just adjust the parameters. The only way of paging thru results is by specifying either the max_id or since_id parameter.

          So you can either go backwards in time using the max_id as follow:
          1) run the query with parameter count=100 which gives you the last 100 tweets
          2) remember the tweet ID of the oldest tweet of these 100 tweets
          3) run your next query and specify max_id using that tweet ID. This will give you the 100 tweets before the given tweet ID
          4) start over with 2) until you either run out of tweets (you get less than 100) or you simply have enough tweets gathered

          Example URL would be something like: https://api.twitter.com/1.1/search/tweets.json?q=&TWEET_QUERY.&count=100&max_id=1234567890

          ..or alternatively you can use the since_id parameter which is great for a continuous batch run every hour or so. You simply remember the last tweet ID from your last run and specify this as since_id parameter. On every run you will get tweets 'since' this tweet ID.

          Example URL would be something like: https://api.twitter.com/1.1/search/tweets.json?q=&TWEET_QUERY.&count=100&since_id=1234567890

          The Twitter guide I sent before - has more examples if you need.

          Hope this helps. Cheers, Falko

          • Sander Ehmsen on

            Thanks a ton!
            I will try to make it work with my developer tomorrow.

            Now I have a clear image on how to proceed.
            Thanks for bearing with me.

            Best,
            Sander.

  13. HI Folks,
    I'm a Mtech student and my doing my thesis on analysic of twitter data through hive and in which i show that while analyses it take less time as comapared to existing tool ! But now My guide want that i want to show comparison with existing tool ! so anyone tell me from which tool i analyse sample data of twitter who take more than hive ..please help me out for that !!

  14. Sander Ehmsen on

    Hi Falko

    With assistance from my splendid developers we have managed to get the tweets out on a daily basis.
    This has revealed a problem for us. We search on hashtags and as the hashtags are not solely defined to a certain event, we get a lot of rubbish.

    I R - where I have done the search "manually" there is an option in specifying the language. So that I only want to take out lan="da", that is "danish".
    Just for the record: The R-script is like this: TB<-searchTwitter("tinderbox", lan="da", n=10000)
    And as R only forwards the request to the Twitter-API, I think, it should be possible to adjust your fine SAS-code.

    Do you know where in the code to put in this filter?

    Best, Sander Ehmsen.

    • Falko Schulz
      Falko Schulz on

      Hi,

      The search is done in the line with https://api.twitter.com/1.1/search/tweets.json?q=&TWEET_QUERY.&count=10. Based on the Twitter API doc you can add the lang parameter here which restricts tweets to the given language. So you could try something like &lang=da

      Hope this helps. Cheers, Falko

  15. Jason DiVirgilio on

    Falko,
    Do you have any sample code for posting a status update to a Twitter account? I've seen code for getting tweets, and searching, but not how to automate tweeting status updates.

    Thanks!

    • Falko Schulz

      Hi Jason,
      Once you have done the authentication bit - you can execute any API action including status updates. The Twitter API has more details including the parameters supported. Just make sure to use the Bearer token as the header when sending the POST request as shown in section Use the Bearer Token to issue the Twitter seach above. Being a SAS program - you can obviously go ahead and automate this further using SAS Macro Language if required.
      Hope this helps. Cheers, Falko

Leave A Reply

Back to Top