Something for nothing? Adding group descriptive statistics

8

Can you actually get something for nothing?  With PROC SQL’s subquery and remerging features, yes, you can.

Often there is a need to add group descriptive statistics such as group counts, minimum and maximum values for further by-group processing.  Instead of first creating the group count, minimum or maximum values and then merging the summarized dataset to the original dataset, why not take advantage of PROC SQL to complete two steps in one?  With PROC SQL’s subquery and summary functions by the group variable, you can easily remerge the new group descriptive statistics back to the original dataset. 

By selecting all variables from the original data set and applying a LEFT JOIN, we are preserving the original dataset.  In the subquery, an internal temporary dataset is created based on selecting both the linking variable and the new group descriptive statistics variable.  The linking variable uniquely identifies records.  A summary function is applied to the linking variable with the GROUP BY clause.  Conditions can be applied as needed, for example, to exclude missing values.

The advantages of applying subqueries within remerge techniques is that the original dataset does not change in data content or record sort order.  Without subqueries, however, the GROUP BY clause is applied with the wildcard ‘*’ which selects all variables.  In general, this approach should be avoided since it may incorrectly cause values to be summarized or records to be rearranged.  This method may be more appropriate for selected limited number of variables in the original dataset.

proc sql;
create table class2 as
   select a.*, b.sexn                    /* add group count variable */
   from sashelp.class as a
   left join                             /* keep all records */
   (select name, count(name) as sexn     /* create group count variable */
   from sashelp.class  where sex > ' '   /* apply condition as needed */
   group by sex                          /* group by variable */
   ) as b on a.name=b.name;              /* link by detail variable */
quit;

As can be seen, this technique can be duplicated for multiple group descriptive statistics. In the second example, several variables are added - SEX, AGEN and AGEMX. The first subquery is grouped by SEX and the second subquery is grouped by AGE. Without the need for a second DATA step or creating a second dataset, you can now quickly add group descriptive statistics and move forward to the next by-group processing step.
proc sql;
  create table class3 as
  select  a.*, b.sexn, c.agen, c.agemx   /* add group count variables */
  from sashelp.class as a
  left join
  (select name, count(name) as sexn      /* first subquery */
  from sashelp.class  where sex > ' '
  group by sex
  ) as b on a.name=b.name
  left join                              /* second subquery */
  (select name, count(age) as agen, max(age) as agemx /* Count & Max */
  from sashelp.class  where age > .
  group by age
  ) as c on a.name=c.name
 ;
quit;


These code examples are provided as is, without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Recipients acknowledge and agree that SAS Institute shall not be liable for any damages whatsoever arising out of their use of this material. In addition, SAS Institute will provide no support for the materials contained herein.


 

Share

About Author

Sunil Gupta

Associate Director, Statistical Programming, Cytel

Sunil Gupta is a presenter at SAS Global Forum and a contributor to SAS Users blog. You may reach him at SAS Savvy—Smarter SAS Searches.

Related Posts

8 Comments

  1. hi ... nice examples, but this seems useless ...

    (select name, count(age) as agen, max(age) as agemx /* Count & Max */
    from sashelp.class where age > .
    group by age
    ) as c on a.name=c.name

    since if you group by AGE, the MAX(AGE) will just be the AGE that's already in each observation of the original data set ... so AGEMX always is the same value as AGE

    • Thanks Mike for your comment. Yes, that is true. Instead of using AGE as a grouping vairable, it would be better to use another variable that better groups records instead of one that is generaly unique. SAS programmers have the option of getting descriptive stats on different variables within the same subquery if the grouping variable is the same or create a second subquery if the grouping variable is different.

  2. Jaydip Thakar on

    In first and second example of the code, the query should conatain "Not equal to " in place of ">" sign..please correct me if i am wrong..

  3. To make the SAS tip complete, below is a better example of using two different group variables, SEX and RACE, to produce several types of group descriptive statistics - COUNT, MIN and MAX. Note that for the SEX grouping variable, while COUNT is based on SEX, MIN and MAX are based on WEIGHT. I created RACE as the second group variable.

    As I indicated, the alternative method to directly apply GROUP BY without the subquery will rearrange your dataset, which may not be wanted.

    data class_r;
    set sashelp.class;

    if _n_ < 10 then race='White';
    else if _n_ < 16 then race='Black';
    else race='Other';
    run;

    proc sql;
    create table class4 as
    select a.*, b.sexn, b.minwgt, b.maxwgt, c.racen
    from sashelp.class as a
    left join
    (select name, count(sex) as sexn, min(weight) as minwgt, max(weight) as maxwgt
    from sashelp.class where sex > ' ' group by sex
    ) as b on a.name=b.name
    left join
    (select name, count(race) as racen
    from class_r where race > ' ' group by race
    ) as c on a.name=c.name
    ;
    quit;

  4. How about getting more from nothing? Now you can easily create a flag variable PEAK that flags the a record that, for example, contains the maximum pulse for each patient. Notice how the intermediate group descriptive variable MPULSE is useful to create the PEAK variable but is not required to be saved in the final dataset. Using Proc SQL this way makes it so easy and yet so powerful! Let me know what you think of this Proc SQL example.

    proc sql;
    create table vitals2 as
    select a.*,
    case
    when b.mpulse = pulse then 'Y'
    else ''
    end as peak length=1 format=$1.
    from vital as a
    left join
    (select usubjid, max(pulse) as mpulse
    from vital where pulse > . group by usubjid) as b on a.usubjid=b.usubjid;
    quit;

  5. Thank you Sunil! the kind of examples you provided makes it easy to understand the sub queries better, proc sql will definitely reduce the number of steps required to derive variable based on summary statistics.

Leave A Reply

Back to Top