When I joined SAS Institute I had no knowledge of SAS. I studied, researched and practiced. The questions I asked would make even the most seasoned programmer blush a beetroot red! Those were an intense nine months—I’ll tell you more in a later post about the learning system at SAS Institute which is simply world class.
Going to my manager Stephen Keelan (who happens to be one of the most enabling managers I’ve ever worked with) with my SAS questions, brought the inevitable response of “It depends.”
It puzzled me, as I just couldn’t fathom why things weren’t black or white. Have you ever wondered why sometimes things appear grey?
Welcome to the programmer’s world where everything depends on your need. I’m sure you can now relate to Stephen’s answer. In this blog post I will answer a question posed by a student when I taught the SAS SQL 1: Essentials course.
What is the difference between the DATA Step and PROC SQL?
Consider 2 ways of grocery shopping:
Method 1: Go directly to the aisles for your products, you may have to ask around to get to the right shelf, but you know exactly where you are going and how you are going to get there.
This is how the DATA Step behaves. It is procedural, meaning that you write a procedure for what you need. Programming languages like FORTRAN, BASIC, C, JAVA, COBOL and SAS have files with records stored sequentially—so processing happens one record at a time. These languages were designed to mimic human behavior in processing paper forms.
In SAS, what you ask for is what you get (and how you get it).
Method 2: Give the clerk your shopping list. You have no control over the clerk’s route. They might go to the stock room directly since they know your product is yet to be stocked on shelves. Regardless of the route, the one guarantee you have is that you never know how they did your shopping.
This is how PROC SQL behaves. It is non-procedural, meaning that you state what you want. SQL retrieves data in what it considers to be the most efficient way. SQL was designed to query relational database management systems (DBMS).
From my earlier post you’ll recall that you can tell SQL what to do but not how to do it.
Five reasons to use the DATA Step or PROC SQL
1. Multiple datasets. The DATA Step is a true workhorse. It allows you to create multiple datasets in one swift step. PROC SQL requires several SELECT clauses to create multiple datasets.
2. Reading text. The ability of SAS to read virtually any data (text files, etc.) is outstanding. Use the DATA Step when you want to read text files. PROC SQL cannot read text files.
3. Control. The SQL optimizer has a mind of its own. For example if you want to join three tables, small, medium and large, it might consider that the large table is indexed so it easily joins it with the small table and then takes the intermediate table and joins with the mid-sized table. If you like control over the how the join is done, then go to the DATA Step.
4. Joining tables. PROC SQL is flexible while joining multiple tables that don’t have key variables in common. The data step requires you to have common named key columns before you attempt a merge.
5. Querying dictionary tables. When you query a dictionary table, SAS launches a discovery process which can end up searching libraries, open tables and execute views. It is often more efficient to use PROC SQL which optimizes the query before the discovery process is launched.
I’ve listed just a few reasons that are important to consider. I’ll end with a question for you, dear reader. Tell me if you can find more ways where the DATA Step and PROC SQL outshine each other. In return I might tell you the story of the M&Ms and Dr. Jim Goodnight. Didn’t I start out by saying “It Depends!” on your comments?
We also have a video on this topic, check it out: