Summarizing data

0

Because it is near the end of the year, I thought a blog about "Summarizing" data might be in order.

For these examples, I am going to use a simulated data set called Drug_Study, containing some categorical and numerical variables. For those interested readers, the SAS code that I used to generate the data set Drug_Study is shown below. (Please refer to two earlier blogs that describe how to create simulated data sets.)

*Program to create data set Drug_Study';
 
proc format;
   value $Gender 'M' = 'Male'
                 'F' = 'Female'; 
   run;
 
   data Drug_Study;
   call streaminit(13579);
   length Chol_Group $ 6;
   do i = 1 to 1000;
      do Drug = 'Placebo','A','B';
         Subject + 1;
         if rand('Bernoulli',.5) = 1 then Gender = 'F';
         else Gender = 'M';
 
         HR = rand('Normal',80,10) - 10*(Drug = 'A') + 10*(Drug = 'B') 
             - 5*(Gender = 'F');
         HR = round(HR);
 
         Cholesterol = rand('Normal',200,20) - 20*(Drug = 'A') -10*(Drug = 'B') 
                       - 10*(Gender = 'F');
         Cholesterol = round(Cholesterol);
 
         if Cholesterol lt 180 then Chol_Group = 'Low';
         else if Cholesterol lt 200 then Chol_Group = 'Normal';
         else Chol_Group = 'High';
 
         output;
      end;
   end;
   drop i;
   format Gender $Gender.;
run;
 
title "Listing of Drug_Study - first 9 observations";
proc print data=Drug_Study(obs=9);
   id Subject;
   var Drug Gender Cholesterol Chol_Group;
run;

The first nine observations from this data set are shown below.

Let's start out­ with the most basic summarization—computing statistics for all numeric variables for the entire data set. You can write a program as simple as:

proc means data=Drug_Study;
run;

However, this program will compute default statistics for every numeric variable in the data set (including Subject). You will probably want to include PROC MEANS options to select what statistics you want and a VAR statement to list the variables that you want to summarize. Here is an example:

title "Summarizing the Entire Drug_Study Data Set";
proc means data=Drug_Study n nmiss median mean std clm maxdec=3;
   var HR Cholesterol;
run;

The procedure options selected here are some of my favorites. This is what they represent:

n The number of nonmissing observations
nmiss The number of observations with missing values
median The median value (the 50th percentile)
mean The mean
std The standard deviation
clm The 95% confidence limits for the mean. (You are 95% confident that the mean from which this sample was taken is between these two limits.)
maxdec The maximum number of places to print following the decimal point

Here is the output:

You can tell this is simulated (fake) data because there are no missing values. This would be very unusual in a real-life situation.

Most researchers would like to break down the two variables of interest (HR and Cholesterol) by some of the classification variables, such as Gender and Drug. You use a CLASS statement to do this. Let's start out by requesting statistics on HR and Cholesterol, broken down by Drug.

title "Variables Broken Down by Drug";
proc means data=Drug_Study n nmiss median mean std clm maxdec=3;
   class Drug;
   var HR Cholesterol;
run;

 

Here is the output:

You could repeat this program, substituting Gender instead of Drug as the CLASS variable, yielding the following:

Finally, you can include a CLASS statement, listing those two class variables. The program, modified to show statistics broken down by Drug and Gender is shown next, followed by the output.

title "Variables Broken Down by Drug and Gender";
proc means data=Drug_Study n nmiss median mean std clm maxdec=3;
   class Drug Gender;
   var HR Cholesterol;
run;


It is tedious having to run PROC MEANS four times to produce all of the outputs shown thus far. Instead, you can use the two CLASS variables and add the PROC MEANS option PRINTALLTYPES when you run the program. It looks like this:

title "Variables Broken Down by Drug and Gender";
footnote "Including the PROC MEANS option PRINTALLTYPES";
proc means data=Drug_Study n nmiss median mean std clm maxdec=3
   printalltypes;
   class Drug Gender;
   var HR Cholesterol;
run;

In one run, you now see the two variables, HR and Cholesterol, for all combinations of the class variables as shown below.

All of the programs used in this blog were run using SAS OnDemand for Academics using the SAS Studio editor. You could have also produced all of the outputs using built-in SAS Studio tasks. You can read more about how to do this in my book A Gentle Introduction to Statistics Using SAS Studio in the Cloud.

As always, comments and/or corrections are welcome.

Share

About Author

Ron Cody

Private Consultant

Dr. Ron Cody was a Professor of Biostatistics at the Rutgers Robert Wood Johnson Medical School in New Jersey for 26 years. During his tenure at the medical school, he taught biostatistics to medical students as well as students in the Rutgers School of Public Health. While on the faculty, he authored or co-authored over a hundred papers in scientific journals. His first book, Applied Statistics and the SAS Programming Language, was first published by Prentice Hall in 1985 and is now in its fifth edition. Since then, he has published over a dozen books on SAS programming and statistical analysis using SAS. His latest book, A Gentle Introduction to Statistics Using SAS Studio was published this year. Ron has presented numerous papers at SAS Global forums, regional conferences, as well as local user groups. He is presently a contract instructor for SAS Institute and continues to write books on SAS and statistical topics.

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top