When I joined SAS Institute I had no knowledge of SAS. I studied, researched and practiced. The questions I asked would make even the most seasoned programmer blush a beetroot red! Those were an intense nine months—I’ll tell you more in a later post about the learning system at SAS Institute which is simply world class.
Going to my manager Stephen Keelan (who happens to be one of the most enabling managers I’ve ever worked with) with my SAS questions, brought the inevitable response of “It depends.”
It puzzled me, as I just couldn’t fathom why things weren’t black or white. Have you ever wondered why sometimes things appear grey?
Welcome to the programmer’s world where everything depends on your need. I’m sure you can now relate to Stephen’s answer. In this blog post I will answer a question posed by a student when I taught the SAS SQL 1: Essentials course.
What is the difference between the DATA Step and PROC SQL?
Consider 2 ways of grocery shopping:
Method 1: Go directly to the aisles for your products, you may have to ask around to get to the right shelf, but you know exactly where you are going and how you are going to get there.
This is how the DATA Step behaves. It is procedural, meaning that you write a procedure for what you need. Programming languages like FORTRAN, BASIC, C, JAVA, COBOL and SAS have files with records stored sequentially—so processing happens one record at a time. These languages were designed to mimic human behavior in processing paper forms.
In SAS, what you ask for is what you get (and how you get it).
Method 2: Give the clerk your shopping list. You have no control over the clerk’s route. They might go to the stock room directly since they know your product is yet to be stocked on shelves. Regardless of the route, the one guarantee you have is that you never know how they did your shopping.
This is how PROC SQL behaves. It is non-procedural, meaning that you state what you want. SQL retrieves data in what it considers to be the most efficient way. SQL was designed to query relational database management systems (DBMS).
From my earlier post you’ll recall that you can tell SQL what to do but not how to do it.
Five reasons to use the DATA Step or PROC SQL
DATA Step
1. Multiple datasets. The DATA Step is a true workhorse. It allows you to create multiple datasets in one swift step. PROC SQL requires several SELECT clauses to create multiple datasets.
2. Reading text. The ability of SAS to read virtually any data (text files, etc.) is outstanding. Use the DATA Step when you want to read text files. PROC SQL cannot read text files.
3. Control. The SQL optimizer has a mind of its own. For example if you want to join three tables, small, medium and large, it might consider that the large table is indexed so it easily joins it with the small table and then takes the intermediate table and joins with the mid-sized table. If you like control over the how the join is done, then go to the DATA Step.
PROC SQL
4. Joining tables. PROC SQL is flexible while joining multiple tables that don’t have key variables in common. The data step requires you to have common named key columns before you attempt a merge.
5. Querying dictionary tables. When you query a dictionary table, SAS launches a discovery process which can end up searching libraries, open tables and execute views. It is often more efficient to use PROC SQL which optimizes the query before the discovery process is launched.
I’ve listed just a few reasons that are important to consider. I’ll end with a question for you, dear reader. Tell me if you can find more ways where the DATA Step and PROC SQL outshine each other. In return I might tell you the story of the M&Ms and Dr. Jim Goodnight. Didn’t I start out by saying “It Depends!” on your comments?
We also have a video on this topic, check it out:
21 Comments
Greeting!
I had a query regarding the use of DATA Step & Proc Step against the Proc SQL when we are accessing a very large data.
Which would be a better function to handle and call in a very large data.
Hi Charu, I just have came across your blog and the discussion is really impressive. I am an experienced SAS programmer but just lost touch as I went into management side for a year. Now back to the technical front again when a question strikes again while playing with codes.
I found Proc SQL is almost 3 times faster than SAS Procs(eg : Proc freq). I know SQL use smart query optimizer but any detailed explanation including in and out of both SAS Proc and PROC Freq in terms of running at back end?
Regards,
Deepak Malhotra
There is a concept of cursors in SQL. I had some task in which i had to use SQL cursors. I am wondering if that is possible with PROC SQL.
Pingback: Homepage
Thanks Sini for writing about your SQL experience. Have you maybe tried PROC MEANS to get max & min values or PROC FREQ with the nlevels option to get distinct values. These powerful PROCS can easily bypass the datastep. Here are some examples:
* to get max & min invoice values try this code
proc means data=sashelp.cars max min;
var invoice;
run;
* to get unique counts you could try the amazing nlevels option
proc freq data=prg1.nonsales nlevels ;
tables employee_id;
run;
* or if you want to isolate the duplicates you might like to try this:
proc freq data=prg1.nonsales ;
tables employee_id/out=test(where=(count > 1));
run;
these are just some alternatives to PROC SQL.
Hi Charu,
I enjoyed your SAS programming 2 course. Thank you.
Would you please send me the link on joining tables SAS vs SQL?
Thanks,
Sherry
Charu, your article on Data Step Vs SQL is very insightful!
I like both data step and sql..There are couple of instances where I think one outsmarts the other..
For selecting the max or min from given records..I cant think of any other way than first. and last.
similiarly, I think to count unique occurance of one variable, (I will have to borrow clinical trial jargons at this point..) Eg.If one has to count unique patients per System Organ Class and Preferred term, SQL solves the problem in one step, while Data Step has to first subset the data then do a no dup on it before counting
Thanks Ryan for your comments. I guess it really boils down to "It depends" which was how I started this blog post. As you aptly pointed out if transparency is what the user wants, then SAS datastep is the way to go...... otherwise leave it to SQL where you might not know what's under the hood.
I frankly agree with Charu. SQL is really only practical for the people who learned it as their first querying language. For most clinical trial data where the data is not immensely large, the amount of processing time and code saved by using SQL is rather minimal. It's alot more transparent to read DATA step coding and a MERGE will tell you if there's going to be a many to many merge (cartesian product).
thanks for the tip Dianne, Michele Reister, blog manager had suggested a followup post with everyone's links so watch out for that!
I would also recommend Christianna Williams SQL for Data step diehards, as well as Howard Schreirer's recent SAS pub on SQL.
Thanks a lot! To read Sunil’s top 10 PROC SQL papers, please go to this page http://www.sascommunity.org/wiki/Sunil_Gupta
#Sunil.27s_Top_10_PROC_SQL_Papers
thanks Sunil for your thoughts on PROC SQL usage. I found your SAS paper on WHERE statement examples very useful!
Do you think you could share the list of SAS papers on PROC SQL. others may also find your reference extremely valuable.
Thanks Charu for your five reasons. Along with them, I have rediscovered PROC SQL to be more modular type code that is more concise and very flexible. Using the power of PROC SQL has saved me time in type extra code. PROC SQL is now my first choice and use DATA Step only if needed. Because of the many tasks that can easily be performed using PROC SQL, I have started to keep a list of my top 10 SAS papers on PROC SQL.
Thanks for pointing out hash merges. We teach it in the SAS programming 3 class.. a 2-hr segment. Datastep/SQL/hash merges (as you pointed out) all qualify for potential join techniques. Benchmarking for each site would provide the best solution. watch out for Hash merges maybe in a future post. Thanks for the tip Graham.
A fourth option could be user-defined formats. I personally prefer using these over hash joins.
I have recently found the best way to merge is using data step Hash merges as it loads the data into memory and runs really fast especially with datasets over 1 GB. Still learning new techniques all the time. SAS should promote these types of merges rather than focussing on PROC SQL or a DATA Merge step...
definitely SQL's default cartesian product without a WHERE or ON is quite cumbersome & extremely inefficient Appreciate your insight about the memory allocation in the data step vs. SQL.I'm guessing the data step is your primary tool for joins..
thanks so much Lori, for taking the time to read & comment!
IMHO, the choice of SQL versus Data Step is largely dependent on the size of the dataset(s) and the host system resources. Data Step processing minimizes memory space requirements, while SQL makes the most of the memory allotted. In the early stage of an SQL join, every record of one dataset is matched with every record of another dataset before WHERE or ON conditions are applied. Clearly, this method is contraindicated for large datasets.
For smaller datasets, the choice may be only preferential. However, I think that Data Step processing is always the answer for very large datasets in a shared resource environment if you want to maintain popularity with your colleagues.
that's right, I would add comfort to my list. SQL comes naturally for those who've learned it first & that's why we have PROC SQL within SAS.
Your point about joining is valid.. its just that often we are unaware that SQL actually does a sort behind the scenes which we do explicitly with SAS. thanks so much for your valuable comments Alex!
I have to be honest with you Charu, I have been taught using SQL when I was in school and SQL will always come first. Yes I do lose some controls using SQL but most of the time, for the works that I do, for example when I just want to join two tables, I just want to join two tables. I don't really care about the inner workings behind it, therefore sort the tables for me if it is required. I am not sure if I want to do it myself every time I want to join two tables.
Having said, to answer your question, I found that doing GROUP BY and HAVING in DATA step is much easier than in SQL, where I have to specify all the column names that I used in SELECT in GROUP BY statement. The statements that are used in DATA step seems to be less cryptic (more human-readable) than in SQL.