Dear Miss SAS Answers,
I have read through a few Dear Miss SAS Answers blog posts, but I cannot seem to find what I am looking for. I need to extract only the second duplicate from a whole list of duplicates per account number. Is there a way to do that?
Seconds please,
Olivia
Hi, Olivia;
This is a good application for the NODUPKEY and DUPOUT= options in PROC SORT followed by FIRST. processing in the DATA step! Let’s say you have this data:
Original Data
Account Another
Number Variable
1 2
1 2
1 2
3 4
4 5
4 5
4 5
4 5
6 7
6 7
You can see that there are three observations for account number 1, one observation for account number 3, four observations for account number 4, and two observations for account number 6. The observations you want are highlighted above.
Sorting the original data set by AccountNumber and AnotherVariable with the NODUPKEY option will keep only the first original observation for the AccountNumber||AnotherVariable combination in the OUT= data set. By including the DUPOUT= option, you specify where to write the duplicate observations, rather than just tossing them into the ether.* If you need to determine matches based on more than one variable’s value, be sure to include all those variables in your BY statement.**
If you then read through the DUPOUT= data set and only output the first observation containing each value of AccountNumber, you will have the second duplicate records for each AccountNumber with duplicates in your output data set.
Here’s the SORT step. You can write the first observation for an account number to the single data set and all other observations for that account number to the dups data set based on the BY variable values.
proc sort data=original out=single dupout=dups nodupkey;
by AccountNumber AnotherVariable;
run;
Once you have the dups data set, you can use it as input to a DATA step with a BY statement to output the observation with the first occurrence of a value of AccountNumber (or whatever combination of variables you desire) to the seconds data set.
data seconds;
set dups;
by AccountNumber;
if first.AccountNumber then output;
run;
Here is the seconds data set:
Seconds Data Set
Account Another
Number Variable
1 2
4 5
6 7
If you want to add back the originals where there is no duplicate, you could use the IN= data set options on the data sets read with DATA step MERGE to keep everything from seconds and only the non-matches from single. The unique row for AccountNumber 3 is the new observation added.
data final;
merge single(in=s1)
seconds(in=s2);
by AccountNumber;
if s2 or (s1 and not s2);
run;
Remerged Final Data
Account Another
Number Variable
1 2
3 4
4 5
6 7
I hope this solution helps with your “second duplicate” problem.
Miss SAS Answers
p.s. Many thanks to Olivia in South Africa for this interesting SAS coding challenge!
* In-database processing does not occur when the DUPOUT= option is specified.
**The option for the PROC SORT statement known as NODUPLICATES|NODUPS has been deprecated because it did not function as most people thought it should based on its name.
14 Comments
Can we get by Proc SQL ?
I think this may be more versatile & intuitive for average users:
proc sort data=OriginalData; by AccountNumber AnotherVariable;
data all one two; *you can choose just the "two" if needed, but this allows you to choose all or any #event;
set OriginalData;
by AccountNumber AnotherVariable;
retain event ;
if first.AccountNumber then event=0;
event=event+1;
if event ge 1 then output all;
if event=1 then output one;
if event=2 then output two;
* etc *;
run;
It seems the DATA step with first. processing is the most popular. I wanted to introduce the DUPOUT= option for the PROC SORT statement for those who are not familiar with it. :-)
Hi..
This is the easiest way to get second duplicate obs from the dataset.
data want (drop=x);
set Dups;
by accountnumber;
x=ifn(first.accountanumber,0,lag1(first.accountanumber));
if x;
run;
Regards..
Sanjeev.K
Hi..
This is the easiest way to get second duplicate obs from the dataset.
data want (drop=x);
set y;
by accountnumber;
x=ifn(first.accountanumber,0,lag1(first.accountanumber));
if x;
run;
Regards..
Sanjeev.K
Nice use of the IFN function! Thanks!
Using Miss SAS method, couldn't you use the NODUPKEY option on the dups data set to get the second observation. I assume AnotherVariable is a date/time variable.
proc sort data=dups out=second nodupkey;
by AccountNumber AnotherVariable;
run;
That being said, I prefer using the first.variable and counter; then the program could easily be used for the nth account.
Yes, that is an advantage of the first.variable and counter method. Of course, it is totally up to you which method your choose. As is the case in a lot of places in SAS, there's more than one way to obtain the result you want. My advice to beginning SAS programmers is to pick one method and stick with it. Once you become more proficient in SAS programming skills, then you can explore the other options and decide which is the most efficient for the data and program with which you are working.
As the original question asks for the 'second' duplicate this implies there is some inherent meaning to that record. Therefore either you need to maintain the original order of the dataset before processing duplicate account numbers or you need to assume there is an additional key on which to sort.
Having said that, I favour Clint's approach as there is no guarantee that the dupout option will honour the original order.
Thanks, TC! Actually the setting of the EQUALS|NOEQUALS option for the PROC SORT determines whether or not the original relative order of the observations in the input data set is maintained in the output data set within BY groups. The default is EQUALS, which maintains the order set by the SORTEQUALS|SORTNOEQUALS SAS system option.
You are of course correct. I am just showing my age by still distrusting PROC SORT from the days of V6 ;-)
:-)
Interesting... Here is an alternate approaches that doesn't involve using the dupout= option, just one data step. it looks a bit more complicated but it avoids multiple passes of data:
proc sort data=original;
by acct;
run;
data seconds_only;
set original;
by acct;
drop cnt;
if first.acct then
do;
cnt = 0;
end;
cnt + 1;
if cnt = 2 then
output;
run;
data singles_and_seconds;
set original;
by acct;
drop cnt;
if first.acct then
do;
cnt = 0;
if last.acct then
do;
output;
return;
end;
end;
cnt + 1;
if cnt = 2 then
output;
run;
Thanks, Clint! For clarity's sake, you would run either the first or the second DATA step, depending on what you wanted in the final output data set.