Split data into groups that have the same mean and variance

15
Split data into groups that have the same mean and variance

A frequently asked question on SAS discussion forums concerns randomly assigning units (often patients in a study) to various experimental groups so that each group has approximately the same number of units. This basic problem is easily solved in SAS by using PROC SURVEYSELECT or a DATA step program.

A more complex problem is when the researcher wants the distribution of a covariate to be approximately equal for each group. For example, a medical researcher might want each group of patients to have approximately the same mean and variance for their cholesterol levels, as shown in the plot to the right. This second problem is much harder because it involves distributing the units (non-randomly) so that the groups satisfy some objective. Conceptually, you want an assignment that minimizes the difference between moments (mean, variance, skewness,...) of the subgroups.

The OPTEX procedure for optimal design

Creating an experimental design that optimizes some criterion is one of the many uses of the OPTEX procedure in SAS/QC software In fact, this problem is one of the examples in the PROC OPTEX documentation. The example in this blog post follows the documentation example; see the documentation for details.

To solve this assignment problem with PROC OPTEX, you need to specify three things:

  • A data set (call it UNITS) that contains the individual units (often patients) that you want to assign to the treatments. For this example, the units are the living patients in the Sashelp.Heart data set.
  • A data set (call it TREATMENTS) that contains the names of the treatment groups. This example uses five groups with values 1, 2, 3, 4, and 5.
  • The covariate in the problem. The OPTEX procedure will assign units to groups so that the first k moments are approximately equal across treatment groups. This example uses the Cholesterol variable and k=2, which means that the mean and variance of the cholesterol levels for patients in each treatment group will be approximately equal.

The following statements create the two data sets and define a macro variable that contains the name of the Cholesterol variable:

/* Split the living Sashelp.heart patients into five groups so that 
   mean and variance of cholesterol is the same across groups. */
data Units;                   /* each row is a unit to be assigned */
set Sashelp.Heart(where=(status = "Alive"));
keep Sex AgeAtStart Height Weight Diastolic Systolic MRW Cholesterol;
run;
 
%let NumGroups =5;           /* number of treatment groups */
data Treatments;
do Trt = 1 to &NumGroups;    /* Trt is variable that assigns patients to groups */
   output;
end;
run;
 
%let Var = Cholesterol;      /* name of covariate */

As discussed in the PROC OPTEX documentation, the following call creates an output data set (named GROUPS) that assigns each patient to a group (1–5). A call to PROC MEANS displays the mean and variance of each group.

proc optex data=Treatments seed=97531 coding=orthcan;
   class Trt;
   model Trt;              /* specify treatment model */
   blocks design=Units;    /* specify units */
   model &Var &Var*&Var;   /* fixed covariates: &Var--> mean, &Var*&Var--> variance */
   output out=Groups;      /* merged data: units assigned to groups */
run;
 
proc means data=Groups mean std;
  class Trt;
  var &Var;
run;

Success! The table shows that each group has approximately the same size and approximately the same mean and variance of the Cholesterol variable. Remember, though, that this assignment scheme is not random, so be careful not to make unjustified inferences from estimates based on these group assignments.

I am not an expert on experimental design, but a knowledgeable colleague tells me that optimal design theory states that the design that minimizes the variance of the treatment effects (adjusting for the first two moments of the covariate) is the design in which treatment means and variances of the covariate are as equal as possible. This is the design that PROC OPTEX has produced.

The following call to PROC SGPLOT uses box plots to visualize the distribution of the Cholesterol variable across treatment groups. The graph is shown at the top of this article. Clearly the distribution of cholesterol is very similar for each group.

proc sgplot data=Groups;
   vbox &Var / category=Trt;
run;

In summary, the OPTEX procedure in SAS/QC software enables you to assign units to groups, where each group has approximately the same distribution of a specified covariate. In this article, the covariate measured cholesterol levels in patients, but you can also group individuals according to income, scholastic aptitude, and so on. For large data sets, the assignment problem is a challenging optimization problem, but PROC OPTEX provides a concise syntax and solves this problem efficiently.

Share

About Author

Rick Wicklin

Distinguished Researcher in Computational Statistics

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of SAS/IML software. His areas of expertise include computational statistics, simulation, statistical graphics, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.

15 Comments

  1. Very interesting question. I would try Genetic Algorithm .
    And I think SAS/OR also could solve such kind of question.
    @Rob might try it .

  2. Rick,
    Yeah. But GA can not guarantee you to get the optimal solution.
    That is the reason why I call @RobPratt in the first place.
    I also want know whether OR could solve this problem
    when the dataset has one million obs .

  3. Rick,
    What happen if using more than one variable other than one to split the data given the sample size is large enough?

    • Rick Wicklin

      Great question! PROC OPTEX supports multiple variables. For example, if you want the groups to have patients that have the same average diastolic and systolic blood pressures, you can write:

      %let Var = systolic diastolic;      /* names of multiple covariates */
      proc optex data=Treatments seed=97531 coding=orthcan;
         class Trt;
         model Trt;              /* specify treatment model */
         blocks design=Units;    /* specify units */
         model &Var;             /* multiple covariate means will be approx same */
         output out=Groups;      /* merged data: units assigned to groups */
      run;
  4. Rick,
    The following code should look right and get the similar result as yours.

    Analysis Variable : Cholesterol
    COL_MEM N Obs Mean Std Dev
    1 630 221.9544715 43.0901775
    2 662 221.9627907 43.1227683
    3 629 221.9457237 43.0978582
    4 648 221.9667722 43.0939030
    5 649 221.9685039 43.0944852

    data Units;            
    set Sashelp.Heart(where=(status = "Alive"));
    keep Sex AgeAtStart Height Weight Diastolic Systolic MRW Cholesterol;
    run;
     
    proc iml;
    use Units nobs nobs;
    read all var {Cholesterol};
    close;
     
    start function(x) global(Cholesterol,nobs,group,mean_std);
    if countunique(x)=group then do;
    do i=1 to group;
     idx=loc(x=i);
     temp=Cholesterol[idx]; 
     mean_std[i,1]=mean(temp);
     mean_std[i,2]=std(temp);
    end;
    sse=sum(mean_std[,]-mean_std[><,]);
    end;
    else sse=999999;
    return (sse);
    finish;
     
    group=5;  /* <--Change it(divide into 5 groups)*/
    mean_std=j(group,2,.);
     
    encoding=j(2,nobs,1);
    encoding[2,]=group;    
     
    id=gasetup(2,nobs,123456789);
    call gasetobj(id,0,"function");
    call gasetsel(id,10,1,1);
    call gainit(id,1000,encoding);
     
    niter = 100;
    do i = 1 to niter;
     call garegen(id);
     call gagetval(value, id);
    end;
    call gagetmem(mem, value, id, 1);
     
    col_mem=t(mem);
    create group var {col_mem};
    append;
    close;
     
    print value[l = "Min Value:"] ;
    call gaend(id);
    quit;
     
    data want;
     merge group Units(keep=Cholesterol);
    run;
    proc means data=want mean std;
      class col_mem;
      var Cholesterol;
    run;
    • Nice.
      I've managed to run KSHARP's GA code with 2 variables. The code is attached below.

      I think it's cleaner to apply the ## operator (elementwise power operator) to the objective function to be minimized (SSE in KSHARP's code).
      I wonder which command in the GA setup forces the group split to be relatively equal in size of the number of observations. Thanks in advance for the answer.

      proc iml;
      use units nobs nobs;
      read all var {Cholesterol AgeAtStart};
      close;

      start function(x) global(Cholesterol, AgeAtStart, nobs, group, mean_std);
      if countunique(x)=group then do;
      do i=1 to group;
      idx=loc(x=i);
      temp=AgeAtStart[idx];
      mean_std[i,1]=mean(temp);
      mean_std[i,2]=std(temp);

      temp1=Cholesterol[idx];
      mean_std[i,3]=mean(temp1);
      mean_std[i,4]=std(temp1);

      end;
      sse=sum((mean_std[,]-mean_std[><,])##2) ;
      end;
      else sse=999999;
      return (sse);
      finish;

      group=5; /* <--Change it(divide into 5 groups)*/
      mean_std=repeat(j(group,2,.),1,2);
      /*print mean_std;*/

      encoding=j(2,nobs,1);

      encoding[2,]=group;

      id=gasetup(2,nobs,123456789);
      call gasetobj(id,0,"function");
      call gasetsel(id,10,1,1);
      call gainit(id,100,encoding);

      niter = 50;
      do i = 1 to niter;
      call garegen(id);
      call gagetval(value, id);
      end;
      call gagetmem(mem, value, id, 1);

      col_mem=t(mem);
      create group var {col_mem};
      append;
      close;

      print value[l = "Min Value:"] ;
      call gaend(id);
      quit;

      data want;
      merge group units(keep=AgeAtStart Cholesterol);
      run;
      proc means data=want mean std;
      class col_mem;
      var AgeAtStart Cholesterol;
      run;

      • Arne ,
        You can use special encoding (fixed-length sequence) and SHAPE() function.

        I noticed there are some missing value in Cholesterol ,so I fixed it again and looks like better than your result.

        OUTPUT:
        The MEANS Procedure
        Analysis Variable : Cholesterol
        group N Obs Mean Std Dev
        1 627 221.9649123 43.1137738
        2 627 221.9665072 43.1220553
        3 627 221.9505582 43.0823468
        4 627 221.9569378 43.0712286
        5 627 221.9601276 43.1105645

        data Units;            
        set Sashelp.Heart(where=(status = "Alive"));
        n+1;
        keep n Sex AgeAtStart Height Weight Diastolic Systolic MRW Cholesterol;
        run;
         
        proc iml;
        use Units nobs nobs;
        read all var {Cholesterol};
        close;
         
        start function(x) global(Cholesterol,group,mean_std);
        xx=shape(x,group,0,.);
        do i=1 to group;
         idx=setdif(xx[i,],{.});
         temp=Cholesterol[idx]; 
         mean_std[i,1]=mean(temp);
         mean_std[i,2]=std(temp);
        end;
        sse=sum(mean_std[,]-mean_std[><,]);
        return (sse);
        finish;
         
        group=5;  /* <--Change it(divide into 5 groups)*/
        mean_std=j(group,2,.);
         
        id=gasetup(3,nobs,123456789);
        call gasetobj(id,0,"function");
        call gasetsel(id,10,1,1);
        call gainit(id,1000);
         
        niter = 100;
        do i = 1 to niter;
         call garegen(id);
         call gagetval(value, id);
        end;
        call gagetmem(mem, value, id, 1);
         
        col_mem=shape(mem,group,0,.);
         
        create group from col_mem;
        append from col_mem;
        close;
         
        print value[l = "Min Value:"] ;
        call gaend(id);
        quit;
         
        data group;
         set group;
         group+1;
        run;
        proc transpose data=group out=member(drop=_: index=(col1));
        by group;
        var col:;
        run;
         
        data want;
         merge member(rename=(col1=n) where=(n is not missing)) Units(keep=n Cholesterol);
         by n;
        run;
        proc means data=want mean std;
          class group;
          var Cholesterol;
        run;
  5. Rick,
    It seems blog eat " " .
    original code
    sum(mean_std[ ,]-mean_std[><,]);
    should be
    sum(mean_std[ ,]-mean_std[><,]);

Leave A Reply

Back to Top