HAVING (clause) fun with SAS Enterprise Guide

Last week I attended a meeting of the Toronto Area SAS Society. (Okay, I didn't just attend; I was a presenter as well.) This user group meeting contained a feature that I had never seen before: "Solutions to the Posed Problem".

Weeks before the meeting, an "open problem" was posted to the TASS community group site. The problem included 134 records of time series data and an objective: create a data set that contains just the most recent record (according to the date and time values) for each unique identifier in the set.

Here's a snapshot of a data sample, with the desired output records highlighted.

It's a simple little puzzle similar to those that SAS programmers must deal with every day. But the instructions for this particular problem were to "use an interface product" to solve it. The implication was "use SAS Enterprise Guide", but the entrants got creative with that requirement. Some used SAS Enterprise Guide to import the data (which was provided as an Excel file), but then wrote a program to do the rest. Some simply opened a program node and wrote code to do the whole shebang.

Art Tabachneck, one of the TASS leaders and a frequent SAS-L contributor, tried to use just "point-and-click" to solve the problem without resorting to a program node, but he didn't manage it. He knew that the Query Builder was the best chance for an SQL solution, and that he would need a HAVING clause to make it work. He was on the right track! But in the limited time that he had to devote to the problem, he couldn't bend the Query Builder to his will. In the end, Art wound up working within the program node, just like most other participants.

When I returned home, I made it my mission to devise a "pure" task-based solution. And here it is:

Step 1: Import the Excel file
That's easy. Even as Art observed, you simply select File->Import Data, point to the XLS file on your local machine, and then click Finish when the wizard appears. The default behavior produces a perfect SAS data set.

Step 2: Start with the query
With the output from the import step, select the Query Builder task. We need the all of the columns represented, so drag all columns over to the Select Data tab.

Step 3: Create a computed column for a combined date-time value
All of the successful solutions did this step somehow. I borrowed from the most elegant of these and created an "Advanced expression"-based column named "Test_DateTime" as:

DHMS(t1.Test_Date,0,0,t1.test_time)

Step 4: Create a MAX aggregation version of that computed column
Create another column based on Test_DateTime column, and this time apply the MAX summarization. Name the new column MAX_of_Test_DateTime.

This is the trick! As soon as you have an aggregated measure as part of the query, the Filter tab will divide into two sections, revealing a "Filter the summarized data" section. That's the piece that allows you to generate a "Having" structure.

Step 5: Create a Having filter based on the summarized column
The filter is effectively:

   CALCULATED MAX_of_Test_DateTime = CALCULATED Test_DateTime

The Query Builder generates more verbose SQL than the above, but you get the idea. Here's a screen shot that shows the Having filter in place.

Step 6: Change the Grouping to include just SAMPLE_ID
When you have a summarized measure in the query the Query Builder provides control over the grouping behavior. By default, the query is "grouped by" all non-summarized columns. But for this query, we want to group only by each value of the Sample_ID column. On the Select Data tab, uncheck "Automatically select groups". Then click the Edit Groups button and exclude all except for t1.Sample_ID.

Here's the Select Data tab with the proper grouping in place:

When you run the query, you should get the desired result. Here's the SQL that was generated:

PROC SQL;
 CREATE TABLE WORK.LATEST_TIME_PER_ID AS 
 SELECT DISTINCT t1.Sample_id, 
   t1.Test_Date, 
   t1.test_time, 
   t1.test_value
 FROM WORK.OPEN_QUESTIONS t1
 GROUP BY t1.Sample_id
 HAVING (MAX(DHMS(t1.Test_Date, 0, 0, t1.test_time))) 
         = (DHMS(t1.Test_Date, 0, 0, t1.test_time));
QUIT;

I want to thank the members of TASS for this little puzzle. I actually learned some new things by reviewing the various solutions that were entered -- and I think that's the point of the exercise. If you're curious, here's a link to the complete "report" of submitted solutions.

tags: SAS Enterprise Guide, sas users, sql, TASS

12 Comments

  1. Posted September 19, 2012 at 5:29 am | Permalink

    And mission accomplished!!! Great solution.

    Also thanks for sharing the TASS "Solutions to the Posed Problem" feature... something to consider for our local SAS user groups in Australia.

  2. Matt Malczewski
    Posted September 19, 2012 at 8:41 am | Permalink

    This is great, Chris. I'll make sure we post this on all the TASS sites we manage. Thanks again for coming up and giving your talks, they were outstanding!

  3. Art Tabachnek Arthur Tabachneck
    Posted September 19, 2012 at 9:21 am | Permalink

    Chris, very nicely done. In fact, you win the contest's grand prize: a chance to present your solution at one of our upcoming meetings! It was really great having you here and I'm sure that all of our members learned a number of things from your presentations. I know I did!

  4. Chris
    Posted September 19, 2012 at 10:28 pm | Permalink

    Why wouldn't you be able to use the import node, followed with a sort based on SampleID and Date and Time (both descending), followed by a no duplicates sort based soly on sample id?

    • Chris Hemedinger Chris Hemedinger
      Posted September 20, 2012 at 9:38 am | Permalink

      Chris, that works too! It's two passes through the data instead of just one (in addition to the import step), but it produces the same output. In the Sort Data task, the option for "no duplicates" is "Keep only the first record for each 'Sort By' group" on the Options page.

  5. David Birch
    Posted September 20, 2012 at 12:32 am | Permalink

    The "Filter the summarized data" section of the Filter Tab is new to EGuide 5.1 - but doesn't get much attention in the "What's New" documentation.

    I've often created a "pure" task-based solution even in EGuide 4.1 - but it takes three Query nodes instead of one. i.e. separate steps (nodes) for generating the MAX per ID value, then re-joining and filtering. So, it's often been easier and 'cleaner' to use a Code node and write my own SQL with a HAVING clause. However, I would often show newbies the three-step solution when helping them learn SAS.

    Thanks for highlighting this very desired feature.

  6. Bob
    Posted September 20, 2012 at 3:29 am | Permalink

    Thanks Chris - very elegant. I'd have used a similar approach, but would have created a dataset of maximum date/times and joined it to the source data

    • Chris Hemedinger Chris Hemedinger
      Posted September 20, 2012 at 9:48 am | Permalink

      Bob, there are twenty ways to do everything in SAS. That's what makes it fun (and sometimes a bit challenging)! If you look at some of the submitted solutions, they use DATA step and the hash object. To me, that's overkill. But some folks use hash as their "hammer" to fix all problems :)

  7. Doc Muhlbaier
    Posted September 25, 2012 at 10:01 pm | Permalink

    I've used the approach a few times. In general, both in teaching and working with fellow SAS programmers, by they time they understand enough SQL to use the HAVING clause, they also prefer to "hand code" those sorts of queries.

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <p> <pre lang="" line="" escaped=""> <q cite=""> <strike> <strong>