To optimize a Structured Query Language (SQL), the database professional must befriend this order and perhaps even embrace it.
Who is your best friend? I’m talking about the order in which SQL processes your statements. Simply put, in what order does SQL do your work? (From my previous post you know another SQL rule, the order in which YOU submit statements to SQL.) I just want us to be clear that these are two very different orders.
First up, what makes SQL different from other programming languages? It’s the way SQL processes code. Most programming languages including SAS start at the top of your code and make their way to the bottom. However, SQL processes them in a unique order, known as Logical Query Processing Phase. This phase (also referred to as a clause) generates a series of virtual tables with each virtual table feeding into the next clause or phase. And no, we can’t view these virtual tables, sorry about that. That’s just the way SQL works!
When I shared the processing order with my recent SQL1 class, Lindsay Jacks of Cancer Care Ontario asked if I would blog about it. I hope this sharing will help breakdown any surprises you may expect to see when you submit SAS code. And in turn help in understanding how to optimize SQL queries.
Here then is the logical query processing phase order.
The 1st clause SQL processes is the FROM. It tells SQL where to grab your tables from. Given that this clause is the first to execute, it’s also our first chance to boil down your table sizes. From an efficiency perspective, this is why we may need to think about putting as many ON clauses as possible on JOINs versus putting them on the WHERE clause.
The 2nd clause SQL processes is the WHERE. The WHERE clause pre-processes data. It selects just the rows that meet the WHERE criteria. The results of WHERE clause processing are stored in an intermediate table.
SQL coders who have sometimes been surprised to see the WHERE clause fail may now have their answer. In the example that follows, the WHERE has no idea how to filter Bonus and doesn’t know what Bonus means. Just because we constructed Bonus on the SELECT doesn’t mean anything to the WHERE as the SELECT is one of the last clauses in the processing order & yet to be looked at by SQL. Consider the following example where the Log rightly complains:
43 /* Try to subset by referencing a new column with its alias */ 44 proc sql; 45 select Employee_ID, Employee_Gender, Salary, 46 Salary * .10 as Bonus 47 from orion.Employee_payroll 48 where Bonus < 3000; ERROR: The following columns were not found in the contributing tables: Bonus. 49 quit; NOTE: The SAS System stopped processing this step because of errors. NOTE: PROCEDURE SQL used (Total process time): real time 0.01 seconds cpu time 0.01 seconds
3. GROUP BY
The 3rd clause SQL processes is the GROUP BY. The role of the GROUP BY is to get your data into groups. To do this grouping, sometimes SQL may need to order your data, but there’s no guarantee that data will be in ascending or descending order. Explicitly placing an ORDER BY clause is our only guarantee that data will be in order.
The 4th clause SQL processes is the HAVING. This clause subsets the groups created by the GROUP BY clause based on the HAVING clause predicates, and, like the WHERE clause, builds an intermediate table.
The 5th clause SQL processes is the SELECT. This is what tells SQL what columns to pull into the query, existing or calculated. Note: Columns are calculated on the SELECT only one time regardless of how often they are referenced, whether on the WHERE (with the calculated keyword) or the ORDER BY (without the calculated keyword). The SELECT statement is able to use the results of anything it calculated in other instances using the same calculation.
6. ORDER BY
The 6th clause SQL processes is the ORDER BY. By now it should come as no surprise that the ORDER BY has access to columns created on the SELECT but the WHERE doesn’t (see 2 above to as a reminder).
Any earlier confusion about the calculated SELECT columns (see 2 again) not being available in the WHERE clause, but available in the ORDER BY clause would now make perfect sense. Because the SELECT is executed before the ORDER BY, all columns in the SELECT will be available at the time of ORDER BY execution. So the following code works perfectly:
68 proc sql; 69 select Employee_ID, Employee_Gender, salary, 70 Salary * .10 as Bonus 71 from orion.Employee_payroll 72 order by 4; 73 quit; NOTE: PROCEDURE SQL used (Total process time): real time 0.03 seconds cpu time 0.03 seconds
Itzik Ben-Gan makes a good point with a poster that captures the processing order.
Why do I call this processing order your best friend in the database world? Because we have likely at some time come across never ending queries and not known how to optimize. Here then is a starting point for us to try to keep in mind next time we submit a query, so we can modify the way we optimize SQL queries.
As I wrap up this post, I realized I sort of exhausted my creativity in writing this actual post J and didn’t have the energy to think up a mnemonic like the earlier one of So Few Workers Go Home On Time. Can you think up one for the processing order? I’d love to hear your thoughts.