Five reasons to use the SAS DATA Step or PROC SQL

When I joined SAS Institute I had no knowledge of SAS. I studied, researched and practiced. The questions I asked would make even the most seasoned programmer blush a beetroot red! Those were an intense nine months—I’ll tell you more in a later post about the learning system at SAS Institute which is simply world class.

Going to my manager Stephen Keelan (who happens to be one of the most enabling managers I’ve ever worked with) with my SAS questions, brought the inevitable response of “It depends.”

It puzzled me, as I just couldn’t fathom why things weren’t black or white. Have you ever wondered why sometimes things appear grey?

Welcome to the programmer’s world where everything depends on your need. I’m sure you can now relate to Stephen’s answer. In this blog post I will answer a question posed by a student when I taught the SAS SQL 1: Essentials course.

What is the difference between the DATA Step and PROC SQL?

Consider 2 ways of grocery shopping:

Method 1: Go directly to the aisles for your products, you may have to ask around to get to the right shelf, but you know exactly where you are going and how you are going to get there.

This is how the DATA Step behaves. It is procedural, meaning that you write a procedure for what you need. Programming languages like FORTRAN, BASIC, C, JAVA, COBOL and SAS have files with records stored sequentially—so processing happens one record at a time. These languages were designed to mimic human behavior in processing paper forms.
In SAS, what you ask for is what you get (and how you get it).

Method 2: Give the clerk your shopping list. You have no control over the clerk’s route. They might go to the stock room directly since they know your product is yet to be stocked on shelves. Regardless of the route, the one guarantee you have is that you never know how they did your shopping.

This is how PROC SQL behaves. It is non-procedural, meaning that you state what you want. SQL retrieves data in what it considers to be the most efficient way. SQL was designed to query relational database management systems (DBMS).
From my earlier post you’ll recall that you can tell SQL what to do but not how to do it.

Five reasons to use the DATA Step or PROC SQL

DATA Step

1. Multiple datasets. The DATA Step is a true workhorse. It allows you to create multiple datasets in one swift step. PROC SQL requires several SELECT clauses to create multiple datasets.

2. Reading text. The ability of SAS to read virtually any data (text files, etc.) is outstanding. Use the DATA Step when you want to read text files. PROC SQL cannot read text files.

3. Control. The SQL optimizer has a mind of its own. For example if you want to join three tables, small, medium and large, it might consider that the large table is indexed so it easily joins it with the small table and then takes the intermediate table and joins with the mid-sized table. If you like control over the how the join is done, then go to the DATA Step.

PROC SQL

4. Joining tables. PROC SQL is flexible while joining multiple tables that don’t have key variables in common. The data step requires you to have common named key columns before you attempt a merge.

5. Querying dictionary tables. When you query a dictionary table, SAS launches a discovery process which can end up searching libraries, open tables and execute views. It is often more efficient to use PROC SQL which optimizes the query before the discovery process is launched.

I’ve listed just a few reasons that are important to consider. I’ll end with a question for you, dear reader. Tell me if you can find more ways where the DATA Step and PROC SQL outshine each other. In return I might tell you the story of the M&Ms and Dr. Jim Goodnight. Didn’t I start out by saying “It Depends!” on your comments?

tags: data step, proc sql

17 Comments

  1. Alex Ling
    Posted December 20, 2010 at 8:10 am | Permalink

    I have to be honest with you Charu, I have been taught using SQL when I was in school and SQL will always come first. Yes I do lose some controls using SQL but most of the time, for the works that I do, for example when I just want to join two tables, I just want to join two tables. I don't really care about the inner workings behind it, therefore sort the tables for me if it is required. I am not sure if I want to do it myself every time I want to join two tables.
    Having said, to answer your question, I found that doing GROUP BY and HAVING in DATA step is much easier than in SQL, where I have to specify all the column names that I used in SELECT in GROUP BY statement. The statements that are used in DATA step seems to be less cryptic (more human-readable) than in SQL.

  2. charu
    Posted December 20, 2010 at 12:04 pm | Permalink

    that's right, I would add comfort to my list. SQL comes naturally for those who've learned it first & that's why we have PROC SQL within SAS.
    Your point about joining is valid.. its just that often we are unaware that SQL actually does a sort behind the scenes which we do explicitly with SAS. thanks so much for your valuable comments Alex!

  3. Lori
    Posted January 6, 2011 at 1:27 pm | Permalink

    IMHO, the choice of SQL versus Data Step is largely dependent on the size of the dataset(s) and the host system resources. Data Step processing minimizes memory space requirements, while SQL makes the most of the memory allotted. In the early stage of an SQL join, every record of one dataset is matched with every record of another dataset before WHERE or ON conditions are applied. Clearly, this method is contraindicated for large datasets.
    For smaller datasets, the choice may be only preferential. However, I think that Data Step processing is always the answer for very large datasets in a shared resource environment if you want to maintain popularity with your colleagues.

  4. charu
    Posted January 10, 2011 at 9:32 pm | Permalink

    definitely SQL's default cartesian product without a WHERE or ON is quite cumbersome & extremely inefficient Appreciate your insight about the memory allocation in the data step vs. SQL.I'm guessing the data step is your primary tool for joins..
    thanks so much Lori, for taking the time to read & comment!

  5. Graham Rice
    Posted January 12, 2011 at 3:46 am | Permalink

    I have recently found the best way to merge is using data step Hash merges as it loads the data into memory and runs really fast especially with datasets over 1 GB. Still learning new techniques all the time. SAS should promote these types of merges rather than focussing on PROC SQL or a DATA Merge step...

  6. charu
    Posted January 12, 2011 at 11:53 pm | Permalink

    Thanks for pointing out hash merges. We teach it in the SAS programming 3 class.. a 2-hr segment. Datastep/SQL/hash merges (as you pointed out) all qualify for potential join techniques. Benchmarking for each site would provide the best solution. watch out for Hash merges maybe in a future post. Thanks for the tip Graham.

  7. Sunil Gupta
    Posted January 25, 2011 at 8:42 am | Permalink

    Thanks Charu for your five reasons. Along with them, I have rediscovered PROC SQL to be more modular type code that is more concise and very flexible. Using the power of PROC SQL has saved me time in type extra code. PROC SQL is now my first choice and use DATA Step only if needed. Because of the many tasks that can easily be performed using PROC SQL, I have started to keep a list of my top 10 SAS papers on PROC SQL.

  8. charu
    Posted January 25, 2011 at 6:29 pm | Permalink

    thanks Sunil for your thoughts on PROC SQL usage. I found your SAS paper on WHERE statement examples very useful!
    Do you think you could share the list of SAS papers on PROC SQL. others may also find your reference extremely valuable.

  9. charu
    Posted January 31, 2011 at 11:44 am | Permalink

    Thanks a lot! To read Sunil’s top 10 PROC SQL papers, please go to this page http://www.sascommunity.org/wiki/Sunil_Gupta
    #Sunil.27s_Top_10_PROC_SQL_Papers

  10. Dianne
    Posted February 8, 2011 at 2:05 pm | Permalink

    I would also recommend Christianna Williams SQL for Data step diehards, as well as Howard Schreirer's recent SAS pub on SQL.

  11. charu
    Posted February 10, 2011 at 4:59 pm | Permalink

    thanks for the tip Dianne, Michele Reister, blog manager had suggested a followup post with everyone's links so watch out for that!

  12. Ryan
    Posted February 22, 2011 at 8:38 am | Permalink

    I frankly agree with Charu. SQL is really only practical for the people who learned it as their first querying language. For most clinical trial data where the data is not immensely large, the amount of processing time and code saved by using SQL is rather minimal. It's alot more transparent to read DATA step coding and a MERGE will tell you if there's going to be a many to many merge (cartesian product).

  13. charu
    Posted March 4, 2011 at 7:44 am | Permalink

    Thanks Ryan for your comments. I guess it really boils down to "It depends" which was how I started this blog post. As you aptly pointed out if transparency is what the user wants, then SAS datastep is the way to go...... otherwise leave it to SQL where you might not know what's under the hood.

  14. Sini
    Posted July 8, 2011 at 2:43 pm | Permalink

    Charu, your article on Data Step Vs SQL is very insightful!
    I like both data step and sql..There are couple of instances where I think one outsmarts the other..
    For selecting the max or min from given records..I cant think of any other way than first. and last.
    similiarly, I think to count unique occurance of one variable, (I will have to borrow clinical trial jargons at this point..) Eg.If one has to count unique patients per System Organ Class and Preferred term, SQL solves the problem in one step, while Data Step has to first subset the data then do a no dup on it before counting

  15. Sherry Wang
    Posted February 3, 2012 at 3:35 pm | Permalink

    Hi Charu,

    I enjoyed your SAS programming 2 course. Thank you.
    Would you please send me the link on joining tables SAS vs SQL?

    Thanks,
    Sherry

  16. Charu Shankar Charu Shankar
    Posted March 5, 2012 at 5:08 pm | Permalink

    Thanks Sini for writing about your SQL experience. Have you maybe tried PROC MEANS to get max & min values or PROC FREQ with the nlevels option to get distinct values. These powerful PROCS can easily bypass the datastep. Here are some examples:
    * to get max & min invoice values try this code
    proc means data=sashelp.cars max min;
    var invoice;
    run;
    * to get unique counts you could try the amazing nlevels option
    proc freq data=prg1.nonsales nlevels ;
    tables employee_id;
    run;
    * or if you want to isolate the duplicates you might like to try this:
    proc freq data=prg1.nonsales ;
    tables employee_id/out=test(where=(count > 1));
    run;
    these are just some alternatives to PROC SQL.

  17. Neeraj
    Posted February 23, 2014 at 5:02 pm | Permalink

    There is a concept of cursors in SQL. I had some task in which i had to use SQL cursors. I am wondering if that is possible with PROC SQL.

One Trackback

  1. By Homepage on December 31, 2012 at 7:59 am

    ... [Trackback]...

    [...] There you will find 99669 more Infos: blogs.sas.com/content/sastraining/2010/12/17/five-reasons-to-use-the-sas-data-step-or-proc-sql/ [...]...

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <p> <pre lang="" line="" escaped=""> <q cite=""> <strike> <strong>