Selecting the Second Duplicate Observation

 

Dear Miss SAS Answers,

I have read through a few Dear Miss SAS Answers blog posts, but I cannot seem to find what I am looking for. I need to extract only the second duplicate from a whole list of duplicates per account number. Is there a way to do that?

Seconds please,

Olivia

Hi, Olivia;

This is a good application for the NODUPKEY and DUPOUT= options in PROC SORT followed by FIRST. processing in the DATA step! Let’s say you have this data:

   Original Data

Account     Another

 Number    Variable

               1                    2

               1                    2

               1                    2

               3                   4

               4                   5

               4                   5

               4                   5

               4                   5

               6                   7

               6                  7

You can see that there are three observations for account number 1, one observation for account number 3, four observations for account number 4, and two observations for account number 6.  The observations you want are highlighted above.

Sorting the original data set by AccountNumber and AnotherVariable with the NODUPKEY option will keep only the first original observation for the AccountNumber||AnotherVariable  combination in the OUT= data set.  By including the DUPOUT= option, you specify where to write the duplicate observations, rather than just tossing them into the ether.* If you need to determine matches based on more than one variable’s value, be sure to include all those variables in your BY statement.**

If you then read through the DUPOUT= data set and only output the first observation containing each value of AccountNumber, you will have the second duplicate records for each AccountNumber with duplicates in your output data set. 

Here’s the SORT step.  You can write the first observation for an account number to the single data set and all other observations for that account number to the dups data set based on the BY variable values. 

proc sort data=original out=single dupout=dups nodupkey;

  by AccountNumber AnotherVariable;

run;

Once you have the dups data set, you can use it as input to a DATA step with a BY statement to output the observation with the first occurrence of a value of AccountNumber  (or whatever combination of variables you desire) to the seconds data set.

data seconds;

  set dups;

  by AccountNumber;

  if first.AccountNumber then output;

run;

Here is the seconds data set:

 Seconds Data Set

Account     Another

 Number    Variable

               1                    2

               4                    5

               6                    7

 

If you want to add back the originals where there is no duplicate, you could use the IN= data set options on the data sets read with DATA step MERGE to keep everything from seconds and only the non-matches from single.  The unique row for AccountNumber 3 is the new observation added.

data final;

   merge single(in=s1)

                 seconds(in=s2);

   by AccountNumber;

   if s2 or (s1 and not s2);

run;

Remerged Final Data

 Account     Another

 Number    Variable

                1                   2

                3                   4

                4                   5

                6                   7

I hope this solution helps with your “second duplicate” problem. 

Miss SAS Answers

p.s. Many thanks to Olivia in South Africa for this interesting SAS coding challenge!

* In-database processing does not occur when the DUPOUT= option is specified.

**The option for the PROC SORT statement known as NODUPLICATES|NODUPS has been deprecated because it did not function as most people thought it should based on its name.

tags: miss sas answers, programming

13 Comments

  1. Clint Rickards
    Posted September 20, 2012 at 11:41 am | Permalink

    Interesting... Here is an alternate approaches that doesn't involve using the dupout= option, just one data step. it looks a bit more complicated but it avoids multiple passes of data:

    proc sort data=original;
    by acct;
    run;

    data seconds_only;
    set original;
    by acct;
    drop cnt;
    if first.acct then
    do;
    cnt = 0;
    end;

    cnt + 1;
    if cnt = 2 then
    output;
    run;

    data singles_and_seconds;
    set original;
    by acct;
    drop cnt;
    if first.acct then
    do;
    cnt = 0;
    if last.acct then
    do;
    output;
    return;
    end;
    end;

    cnt + 1;
    if cnt = 2 then
    output;
    run;

    • Miss SAS Answers Miss SAS Answers
      Posted September 20, 2012 at 3:32 pm | Permalink

      Thanks, Clint! For clarity's sake, you would run either the first or the second DATA step, depending on what you wanted in the final output data set.

  2. TC
    Posted September 20, 2012 at 2:14 pm | Permalink

    As the original question asks for the 'second' duplicate this implies there is some inherent meaning to that record. Therefore either you need to maintain the original order of the dataset before processing duplicate account numbers or you need to assume there is an additional key on which to sort.

    Having said that, I favour Clint's approach as there is no guarantee that the dupout option will honour the original order.

    • Miss SAS Answers Miss SAS Answers
      Posted September 20, 2012 at 3:43 pm | Permalink

      Thanks, TC! Actually the setting of the EQUALS|NOEQUALS option for the PROC SORT determines whether or not the original relative order of the observations in the input data set is maintained in the output data set within BY groups. The default is EQUALS, which maintains the order set by the SORTEQUALS|SORTNOEQUALS SAS system option.

      • TC
        Posted September 21, 2012 at 3:27 am | Permalink

        You are of course correct. I am just showing my age by still distrusting PROC SORT from the days of V6 ;-)

        • Miss SAS Answers Miss SAS Answers
          Posted September 24, 2012 at 1:19 pm | Permalink

          :-)

  3. AJ
    Posted September 20, 2012 at 3:24 pm | Permalink

    Using Miss SAS method, couldn't you use the NODUPKEY option on the dups data set to get the second observation. I assume AnotherVariable is a date/time variable.

    proc sort data=dups out=second nodupkey;
    by AccountNumber AnotherVariable;

    run;

    That being said, I prefer using the first.variable and counter; then the program could easily be used for the nth account.

    • Miss SAS Answers Miss SAS Answers
      Posted September 24, 2012 at 1:22 pm | Permalink

      Yes, that is an advantage of the first.variable and counter method. Of course, it is totally up to you which method your choose. As is the case in a lot of places in SAS, there's more than one way to obtain the result you want. My advice to beginning SAS programmers is to pick one method and stick with it. Once you become more proficient in SAS programming skills, then you can explore the other options and decide which is the most efficient for the data and program with which you are working.

  4. sanjeev
    Posted October 1, 2012 at 9:25 am | Permalink

    Hi..

    This is the easiest way to get second duplicate obs from the dataset.

    data want (drop=x);
    set y;
    by accountnumber;
    x=ifn(first.accountanumber,0,lag1(first.accountanumber));
    if x;
    run;

    Regards..
    Sanjeev.K

    • Miss SAS Answers Miss SAS Answers
      Posted October 1, 2012 at 1:25 pm | Permalink

      Nice use of the IFN function! Thanks!

  5. sanjeev
    Posted October 1, 2012 at 9:27 am | Permalink

    Hi..
    This is the easiest way to get second duplicate obs from the dataset.
    data want (drop=x);
    set Dups;
    by accountnumber;
    x=ifn(first.accountanumber,0,lag1(first.accountanumber));
    if x;
    run;
    Regards..
    Sanjeev.K

  6. Deborah Hurley
    Posted October 2, 2012 at 1:38 pm | Permalink

    I think this may be more versatile & intuitive for average users:
    proc sort data=OriginalData; by AccountNumber AnotherVariable;
    data all one two; *you can choose just the "two" if needed, but this allows you to choose all or any #event;
    set OriginalData;
    by AccountNumber AnotherVariable;
    retain event ;
    if first.AccountNumber then event=0;
    event=event+1;
    if event ge 1 then output all;
    if event=1 then output one;
    if event=2 then output two;
    * etc *;
    run;

    • Miss SAS Answers Miss SAS Answers
      Posted October 2, 2012 at 4:14 pm | Permalink

      It seems the DATA step with first. processing is the most popular. I wanted to introduce the DUPOUT= option for the PROC SORT statement for those who are not familiar with it. :-)

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <p> <pre lang="" line="" escaped=""> <q cite=""> <strike> <strong>