Selecting the top n% and bottom n% of observations from a data set

0

n% of observations from a data setSAS® offers several ways that you can find the top n% and bottom n% of data values based on a numeric variable. The RANK procedure with the GROUPS= option is one method. Another method is The UNIVARIATE procedure with the PCTLPTS= option. Because there are several ways to perform this task, you can choose the procedure that you are most familiar with. In this blog post, I use the SUMMARY procedure to generate the percentile values and macro logic to dynamically choose the desired percentile statistics. After the percentiles are generated, I subset the data set based on those values. This blog post provides two detailed examples: one calculates percentiles for a single variable and one calculates percentiles within a grouping variable.

Calculate Percentiles of a Single Variable

Calculating percentiles of a single variable includes the following steps. Within the macro, a PROC SUMMARY step calculates the percentiles. The subsequent DATA step uses CALL SYMPUTX to create macro variables for the percentile values, and the final DATA step uses those macro variables to subset the data. Here is the code, which is explained in detail below:

/* Create sample data */
data test;                   
   do i=1 to 10000;                                                     
      x=ranuni(i)*12345;                                         
      output;                                                         
   end; 
   drop i; 
run;     
 
proc sort data=test;
   by x;
run; 
 
%macro generate_percentiles(ptile1,ptile2); 
/* Output desired percentile values */                         
proc summary data=test;                                               
   var x;                                                       
   output out=test1 &ptile1= &ptile2= / autoname;                               
run;                                                                 
 
/* Create macro variables for the percentile values */     
data _null_;                                                         
   set test1;                                                         
   call symputx("&ptile1", x_&ptile1);                                     
   call symputx("&ptile2", x_&ptile2);                                     
run;    
%put &&&ptile1;
%put &&&ptile2; 
 
data test2;                                                             
   set test;                                                           
/* Use a WHERE statement to subset the data */                         
   where x le &&&ptile1 or x ge &&&ptile2;                                       
run;  
 
proc print;
run; 
 
%mend;
 
options mprint mlogic symbolgen;
%generate_percentiles(p1,p99)
%generate_percentiles(p25,p75)

After creating and sorting the sample data, I begin my macro definition with two parameters that enable me to substitute the desired percentiles in my macro invocation:

%macro generate_percentiles(ptile1,ptile2);

The PROC SUMMARY step writes the desired percentiles for variable X to the Test1 data set. The AUTONAME option names the percentile statistics in the following format, <varname>_<percentile> (for example, x_p25).

proc summary data=test;                                               
   var x;                                                       
   output out=test1 &ptile1= &ptile2= / autoname;                               
run;

Next, I want to store the values of the percentile statistics in macro variables so that I can use them in later processing. I use CALL SYMPUTX to do this, which gives the macro variables the same name as the statistic. To see the resulting values in the log, I use indirect macro referencing (notice the 3 &s) in the %PUT statement:

data _null_;                                                         
   set test1;                                                         
   call symputx("&ptile1", x_&ptile1);                                     
   call symputx("&ptile2", x_&ptile2);                                     
run;    
%put &&&ptile1;
%put &&&ptile2;

The SAS log shows the following:

MLOGIC(GENERATE_PERCENTILES):  %PUT &&&ptile1
SYMBOLGEN:  && resolves to &.
SYMBOLGEN:  Macro variable PTILE1 resolves to p1
SYMBOLGEN:  Macro variable P1 resolves to 123.22158288
123.22158288
MLOGIC(GENERATE_PERCENTILES):  %PUT &&&ptile2
SYMBOLGEN:  && resolves to &.
SYMBOLGEN:  Macro variable PTILE2 resolves to p99
SYMBOLGEN:  Macro variable P99 resolves to 12232.136483
12232.136483

I use these macro variables in a WHERE statement within a DATA step to subset the data set based on the percentile values:

data test2;                                                             
   set test;                                                           
/* Use a WHERE statement to subset the data */                         
   where x le &&&ptile1 or x ge &&&ptile2;                                       
run;

Finally, the macro invocations below pass in the desired percentile statistics:

%generate_percentiles(p1,p99)
%generate_percentiles(p25,p75)

The percentile statistics that are available with PROC SUMMARY are included in the documentation for the Output statement.

Calculate Percentiles within Groups

If you want to calculate percentiles based on BY groups, then you can modify the code as shown below. The logic is similar to the steps for calculating percentiles of a single variable, with the addition of macro variables that represent each BY group. For further explanation, see the comments included in the code.

/* Create sample data */
data test; 
 do group='a','b'; 
   do i=1 to 10000;                                                     
      x=ranuni(i)*12345;                                         
      output;                                                         
   end; 
 end;
   drop i; 
run;     
 
proc sort data=test;
   by group x;
run; 
 
%macro generate_percentiles(ptile1,ptile2); 
/* Output desired percentile values by group */                         
proc summary data=test; 
   by group; 
   var x;                                                       
   output out=test1 &ptile1= &ptile2= / autoname;                               
run;                                                                 
 
/* Create macro variables for each value of the BY variable */
/* Create macro variables for the percentile values for each BY group */ 
/* Create a macro variable that is the count of the unique
values of the BY variable */ 
data _null_;   
  retain count 0; 
   set test1;   
   by group;
   if first.group then do;
    count+1;
    call symputx('val'||left(count),group); 
    call symputx("&ptile1"||'_'||left(count), x_&ptile1);                                     
    call symputx("&ptile2"||'_'||left(count), x_&ptile2);  
   end; 
  call symput('last',left(count));
run;    
%put _user_;
 
/* Loops through each value of the BY variable */ 
%do i=1 %to &last;
 
data test&i;                                                             
   set test;  
   where group="&&val&i"; 
/* Use an IF statement to subset the data */
   if x le &&&ptile1._&i or x ge &&&ptile2._&i;                                       
run;  
 
proc print;
run; 
 
%end;
%mend;
 
options mprint mlogic symbolgen;
%generate_percentiles(p1,p99)

Calculating percentiles has many applications, including ranking data, finding outliers, and subsetting data. Using a procedure in Base SAS® that enables you to request percentile statistics along with the power of the macro language, you can dynamically generate desired values that can be used for further processing and analysis.

Share

About Author

Kathryn McLawhorn

Principal Technical Support Analyst

Kathryn McLawhorn has worked in Technical Support at SAS since 1997. She started as a consultant in the Base Procedures and ODS group, and she is currently a consultant in the CAS/Open Source Languages and SAS Programming group. She primarily supports report writing, ODS, and Base summary procedures. Kathryn has her SAS Certification in both Base Programming for SAS 9 and Advanced Programming for SAS 9.

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top