Selecting all observations when any observation is of interest

Now, we all know by now that I'm not a programmer (that makes me very sad sometimes and may frustrate some of you at times), but I know a good paper and presentation when I see one. Christopher Bost knows how to teach a topic. I went to his Tuesday afternoon SAS Global Forum presentation because his paper sounded so interesting. (The room was packed, and no one left, so there must have been others who agreed with me.)

Bost's paper, Selecting All Observations When Any Observation Is of Interest, was written for those who work with medical claims data, fund balances, insurance claims data (you get it - any large data set that might contain multiple observations per person). His paper and presentation were about trading efficiencies - Can you sacrifice a little time processing the data, if the trade is a gain in a simpler way of coding?

Bost proposed three methods: the DATA step approach and then two different ways to produce the same results with PROC SQL.

DATA Step

According to Bost, a common method for selecting all observations when any observation is of interest is to use a match-merge. This requires three steps:

Subset the observations of interest
Keep one observation per person
Match-merge the data

"By the way, I want to go on record as saying that I did this the most inefficient way possible because I wanted to show how efficient PROC SQL is," Bost told the audience with a grin. "You can do it more efficiently." (See the code in his paper.)

To put proof to Bost's pudding, he provided a list of pros and cons. First, he says that DATA step works fine and everybody should know how to do it, but what if the files were huge? It took three DATA steps and two PROC SORT steps on his tiny example data set.

PROC SQL

Here's the order when writing a PROC SQL clause (I tried to take a picture of this, so Bost sent me his slide):

Bost chose two methods in PROC SQL, but he says there are many PROC SQL methods that he could have chosen. He chose to show the audience a SUBQUERY and the GROUP BY and HAVING clauses. (Again, I won't include the code here. Bost has done a better job than I could at explaining it in his paper.)

Pros of PROC SQL methods used in his paper:

Subtle difference in sorting data - DATA step processing is done - almost always - sequentially. "In SQL, processing is not done from first to last. In fact, believe it or not, not doing things sequentially can be much more efficient - depending upon what you are doing," said Bost.
Single step instead of five steps
Flexibility

Cons:

Two passes through the data set - one for the subquery and one for the outer query. "As Kirk Lafler explained it to me," said Bost. "It is possible that the subquery gets executed for each observation." Possible.

Check out Bost's paper, and if you get a chance to hear him present (teach), don't pass it up!!

These code examples are provided as is, without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Recipients acknowledge and agree that SAS Institute shall not be liable for any damages whatsoever arising out of their use of this material. In addition, SAS Institute will provide no support for the materials contained herein.

Blogs