HAVING (clause) fun with SAS Enterprise Guide

Last week I attended a meeting of the Toronto Area SAS Society. (Okay, I didn't just attend; I was a presenter as well.) This user group meeting contained a feature that I had never seen before: "Solutions to the Posed Problem".

Weeks before the meeting, an "open problem" was posted to the TASS community group site. The problem included 134 records of time series data and an objective: create a data set that contains just the most recent record (according to the date and time values) for each unique identifier in the set.

Here's a snapshot of a data sample, with the desired output records highlighted.

It's a simple little puzzle similar to those that SAS programmers must deal with every day. But the instructions for this particular problem were to "use an interface product" to solve it. The implication was "use SAS Enterprise Guide", but the entrants got creative with that requirement. Some used SAS Enterprise Guide to import the data (which was provided as an Excel file), but then wrote a program to do the rest. Some simply opened a program node and wrote code to do the whole shebang.

Art Tabachneck, one of the TASS leaders and a frequent SAS-L contributor, tried to use just "point-and-click" to solve the problem without resorting to a program node, but he didn't manage it. He knew that the Query Builder was the best chance for an SQL solution, and that he would need a HAVING clause to make it work. He was on the right track! But in the limited time that he had to devote to the problem, he couldn't bend the Query Builder to his will. In the end, Art wound up working within the program node, just like most other participants.

When I returned home, I made it my mission to devise a "pure" task-based solution. And here it is:

Step 1: Import the Excel file
That's easy. Even as Art observed, you simply select File->Import Data, point to the XLS file on your local machine, and then click Finish when the wizard appears. The default behavior produces a perfect SAS data set.

Step 2: Start with the query
With the output from the import step, select the Query Builder task. We need the all of the columns represented, so drag all columns over to the Select Data tab.

Step 3: Create a computed column for a combined date-time value
All of the successful solutions did this step somehow. I borrowed from the most elegant of these and created an "Advanced expression"-based column named "Test_DateTime" as:

DHMS(t1.Test_Date,0,0,t1.test_time)

Step 4: Create a MAX aggregation version of that computed column
Create another column based on Test_DateTime column, and this time apply the MAX summarization. Name the new column MAX_of_Test_DateTime.

This is the trick! As soon as you have an aggregated measure as part of the query, the Filter tab will divide into two sections, revealing a "Filter the summarized data" section. That's the piece that allows you to generate a "Having" structure.

Step 5: Create a Having filter based on the summarized column
The filter is effectively:

   CALCULATED MAX_of_Test_DateTime = CALCULATED Test_DateTime

The Query Builder generates more verbose SQL than the above, but you get the idea. Here's a screen shot that shows the Having filter in place.

Step 6: Change the Grouping to include just SAMPLE_ID
When you have a summarized measure in the query the Query Builder provides control over the grouping behavior. By default, the query is "grouped by" all non-summarized columns. But for this query, we want to group only by each value of the Sample_ID column. On the Select Data tab, uncheck "Automatically select groups". Then click the Edit Groups button and exclude all except for t1.Sample_ID.

Here's the Select Data tab with the proper grouping in place:

When you run the query, you should get the desired result. Here's the SQL that was generated:

PROC SQL;
 CREATE TABLE WORK.LATEST_TIME_PER_ID AS 
 SELECT DISTINCT t1.Sample_id, 
   t1.Test_Date, 
   t1.test_time, 
   t1.test_value
 FROM WORK.OPEN_QUESTIONS t1
 GROUP BY t1.Sample_id
 HAVING (MAX(DHMS(t1.Test_Date, 0, 0, t1.test_time))) 
         = (DHMS(t1.Test_Date, 0, 0, t1.test_time));
QUIT;

I want to thank the members of TASS for this little puzzle. I actually learned some new things by reviewing the various solutions that were entered -- and I think that's the point of the exercise. If you're curious, here's a link to the complete "report" of submitted solutions.

tags: SAS Enterprise Guide, sas users, sql, TASS

15 Comments

  1. Posted September 19, 2012 at 5:29 am | Permalink

    And mission accomplished!!! Great solution.

    Also thanks for sharing the TASS "Solutions to the Posed Problem" feature... something to consider for our local SAS user groups in Australia.

  2. Matt Malczewski
    Posted September 19, 2012 at 8:41 am | Permalink

    This is great, Chris. I'll make sure we post this on all the TASS sites we manage. Thanks again for coming up and giving your talks, they were outstanding!

  3. Art Tabachnek Arthur Tabachneck
    Posted September 19, 2012 at 9:21 am | Permalink

    Chris, very nicely done. In fact, you win the contest's grand prize: a chance to present your solution at one of our upcoming meetings! It was really great having you here and I'm sure that all of our members learned a number of things from your presentations. I know I did!

  4. Chris
    Posted September 19, 2012 at 10:28 pm | Permalink

    Why wouldn't you be able to use the import node, followed with a sort based on SampleID and Date and Time (both descending), followed by a no duplicates sort based soly on sample id?

    • Chris Hemedinger Chris Hemedinger
      Posted September 20, 2012 at 9:38 am | Permalink

      Chris, that works too! It's two passes through the data instead of just one (in addition to the import step), but it produces the same output. In the Sort Data task, the option for "no duplicates" is "Keep only the first record for each 'Sort By' group" on the Options page.

  5. David Birch
    Posted September 20, 2012 at 12:32 am | Permalink

    The "Filter the summarized data" section of the Filter Tab is new to EGuide 5.1 - but doesn't get much attention in the "What's New" documentation.

    I've often created a "pure" task-based solution even in EGuide 4.1 - but it takes three Query nodes instead of one. i.e. separate steps (nodes) for generating the MAX per ID value, then re-joining and filtering. So, it's often been easier and 'cleaner' to use a Code node and write my own SQL with a HAVING clause. However, I would often show newbies the three-step solution when helping them learn SAS.

    Thanks for highlighting this very desired feature.

    • Chris Hemedinger Chris Hemedinger
      Posted September 20, 2012 at 9:34 am | Permalink

      David, actually the "Filter the summarized data" tab (HAVING clause support) has been in place since at least EG 4.1! The requirement is that you have a summarized column and at least one GROUP column.

      • David Birch
        Posted September 20, 2012 at 10:45 pm | Permalink

        Chris, do you have a screen shot of the "Filter the summarized data" tab from EG 4.1 ? I don't have access to EG at the moment, but I've used it extensively and never noticed it, even though filtering on summarized values was something that wanted to do a lot.

      • Mark Jones
        Posted June 14, 2013 at 6:42 am | Permalink

        Hi Chris,

        Thanks for this, the Query Builder is a very good task, I like it a lot and often use it to write queries for me, especially when joining multiple tables at the same time!

        However, I have often wondered (and been asked) why you must GROUP BY at least one column and have a summarised column in the selection in the Query Builder before you are able to specify a HAVING filter. This implies a HAVING clause requires a GROUP BY clause and to output a summarised column which is not the case in SQL. For example it is difficult to re-create the following query in the Query Builder to select the tallest person from SASHELP.CLASS:

        proc sql;
          select * 
          from sashelp.class
          having height = max(height);
        quit;
        

        As no GROUP BY clause is present and no summarised column is present on the SELECT clause the Query Builder does not let you specify a "Filter the Summarized Data" filter. Conversely I can easily pull out the tallest person of each (for example) sex using the Query Builder, although even then I am forced to keep MAX_of_height as a column in the output even though I might not want to output it.

        The best I have managed to do so far is to create a new computed column which just contains a constant and then to GROUP BY that. This is a bit of a fudge and you have to keep the new computed column in the output too (removing it from the selected variables turns the GROUP BY clause into "GROUP BY 1" which is grouping by the first selected column rather than the constant "1". So the underlying SQL generated by the query builder is:

        proc sql;
          select *, max(height) as MAX_of_height, 
           1 as const  /* Do not really want the last two columns in the output */
          from sashelp.class
          group by const
          having height = max(height);
        quit;
        

        Is there a particular reason why the Query Builder requires a GROUP BY clause and a summarised column on the SELECT clause before a HAVING clause can be added?

        Cheers,
        Mark

        • Chris Hemedinger Chris Hemedinger
          Posted June 14, 2013 at 9:30 am | Permalink

          Hi Mark,

          SQL is very powerful -- as you know -- and the query builder does not support every permutation of a query, although it does aim to support the most common operations that even advanced users might need.

          There is a way to achieve your result by DROPping the columns that you don't want, but not within the SQL. Instead, you can go to the "Options for this query" window (click Options toolbar in the query window) and add:

          drop=const max_of_height

          The code that results:

          PROC SQL;
             CREATE TABLE WORK.FOR_MARK(drop=const max_of_height) AS 
             SELECT t1.Name, 
                    t1.Sex, 
                    t1.Age, 
                    t1.Height, 
                    t1.Weight, 
                    /* MAX_of_Height */
                      (MAX(t1.Height)) AS MAX_of_Height, 
                    /* const */
                      (1) AS const
                FROM SASHELP.CLASS t1
                GROUP BY (CALCULATED const)
                HAVING t1.height = max(t1.height);
          QUIT;

          You can also use the new subquery feature within EG 5.1 to achieve a similar result, and it might be more useful. You design the inner query first (calculating the MAX of Height), then save as a template. You can use that template as a filter in a different query. Here's the resulting code:

          PROC SQL;
           CREATE TABLE WORK.outer_query(label="with subquery") AS 
           SELECT t1.Name, 
               t1.Sex, 
               t1.Age, 
               t1.Height, 
               t1.Weight
           FROM SASHELP.CLASS t1
           WHERE t1.Height = (
              SELECT /* MAX_of_Height */
                 (MAX(t1.Height)) AS MAX_of_Height
                 FROM SASHELP.CLASS t1
              );
          QUIT;
          • Posted June 24, 2013 at 8:22 am | Permalink

            Hi Chris. Thanks for getting back to me on this. The sub-query approach is a nice one for EG 5.1+ users. I think I'll still be heading over to the SAS Ballot to request those little tweaks to the Query Builder though :o)

            Cheers,
            Mark

  6. Bob
    Posted September 20, 2012 at 3:29 am | Permalink

    Thanks Chris - very elegant. I'd have used a similar approach, but would have created a dataset of maximum date/times and joined it to the source data

    • Chris Hemedinger Chris Hemedinger
      Posted September 20, 2012 at 9:48 am | Permalink

      Bob, there are twenty ways to do everything in SAS. That's what makes it fun (and sometimes a bit challenging)! If you look at some of the submitted solutions, they use DATA step and the hash object. To me, that's overkill. But some folks use hash as their "hammer" to fix all problems :)

  7. Doc Muhlbaier
    Posted September 25, 2012 at 10:01 pm | Permalink

    I've used the approach a few times. In general, both in teaching and working with fellow SAS programmers, by they time they understand enough SQL to use the HAVING clause, they also prefer to "hand code" those sorts of queries.

One Trackback

  1. [...] are many other ways to accomplish this simple example in SAS. For example, you may be able to use the HAVING clause feature in the Query Builder, or use SQL joins and special join conditions to filter a table based on another result set. But [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <p> <pre lang="" line="" escaped=""> <q cite=""> <strike> <strong>