Create a bar chart with an "Others" category

When a categorical variable has dozens or hundreds of categories, it is often impractical and undesirable to create a bar chart that shows the counts for all categories. Two alternatives are popular:

  • Display only the Top 10 or Top 20 categories. As I showed last week, to do this in SAS you can use PROC FREQ to tabulate and order the data, and then use the OBS= data set option to truncate the data. See the article "Create a bar chart with only a few categories" for details and SAS code.
  • Display the Top 10 or Top 20 categories, but also aggregate the counts of the remaining categories to create a new "Others" category. This merging of small categories is done by default in some software, such as in SAS/IML Studio bar charts, and is very useful at the exploratory and model-building phases of data analysis. This bar chart was also available in the now-defunct SAS/INSIGHT product.
This article describes how to create an "Others" category in PROC SGPLOT. It assumes that the categorical variable is a character variable. If your variable is numeric, a short DATA step that uses the PUT function will create a character variable.

Preliminary step: Count and order the categories

As described in the previous article, you can use PROC FREQ to tabulate the counts of observations. The following statements count the number of different vehicles for car manufactures in the Sashelp.Cars data set. The FreqOut data set is sorted in descending order by the Count variable.

/* Combine small categories into an "Other" category; sum the counts */
%let VName = Make;                 /* character categorical variable */
 
proc freq data=sashelp.cars ORDER=FREQ noprint;
tables &VName / out=FreqOut;
run;

Main step: Create the "Others" category by aggregating small categories

In order to create the new category, just sum the counts of the small categories. For example, if you intend to show the 20 largest categories, sum the counts of categories 21, 22, 23, and so forth. There are several ways to do this. For me, the SAS/IML language is to most compact:

/* aggregate counts; append new level "Others" to truncated categorical variable */
%let NumCats = 20; /* number of categories in bar chart */
proc iml;
varNames = {"&VName" "Count"};
/* read categories and counts */
use FreqOut; read all var varNames; close FreqOut;
 
/* sum up the counts for the smaller categories */
OtherCount = sum(Count[&NumCats+1:nrow(Count)]);
&VName = &VName[1:&NumCats] // "Others";  /* truncate and append "Others" */
Count  = Count[1:&NumCats] // OtherCount; /* truncate and append sum of small counts */
 
/* write new data set with the Top categories and "Others" */
create Other var varNames; append; close Other;

The vertical concatenation operator (//) was used to append the "Others" category and counts to the Top 20. You can now create the bar chart by calling the SGPLOT procedure, as follows:

/* create bar chart that includes the "Others" category */
proc sgplot data=Other;
  title "Top &NumCats Vehicle Manufacturers and 'Others' Category";
  hbar &VName / freq=Count;
  yaxis discreteorder=data;
  xaxis label = "Count" grid;
run;

The graph shows that Toyota, Chevrolet, and other companies produce many different models, but that collectively the companies that manufacture 10 or fewer models produce more than 100 models. Advantages of the "Others" category include:

  • You can see the relative size ("market share") of the Top 20 categories.
  • You can plot percentages instead of counts, and the percentages will be correctly interpretable as a percentage of the whole.
  • The graph shows all the data, so it is more honest than a graph that selectively shows only a portion of the data.

Other ways to compute the "Others" category

You could argue that using SAS/IML to add a series of numbers is overkill. However, when I tried to use the DATA step, I realized that I potentially need to change the length of the categorical variable. If the length of the categorical variable is less than six, the word "Others" gets truncated. I am not aware of any way to change the length of a character variable "dynamically" in the DATA step, whereas in PROC IML the concatenation operator ensures that c = c // "Others" always has space for at least six characters.

In the following code, I follow a suggestion made by Chris Hemedinger and create a macro variable that is always greater than or equal to six. That macro variable is then used in a LENGTH statement to ensure that the categorical variable can hold the string "Others":

/* DATA step approach */
/* 1. Check length of the category variable. Create macro variable . */
data _null_;
   dsid = open('sashelp.cars');
   cur_len = varlen(dsid,varnum(dsid,"&VName")); /* length of categorical var */
   rc = close(dsid);
   call symputx('len',max(cur_len, 6));          /* new length is always >= 6 */
run;
 
/* 2. Keep first few obs; accumulate count of Others */
data Other2(keep=&VName Count);
length &VName $ &len;                     /* make sure length >= 6 */
retain OtherCount;
set FreqOut END=DONE;
if _N_>&NumCats & ^DONE then do;          /* sum up the counts for the smaller categories */ 
   OtherCount + Count;
   delete;                                /* get rid of this observation */
end;
if DONE then do;
   OtherCount + Count; 
   &VName = "Others"; Count = OtherCount; /* append last obs */
end;
run;

The Others2 data set is identical to the Others data set that was created by PROC IML. However, my Base SAS implementation is not as compact as the SAS/IML version. Perhaps you can do better? Let me know in the comments.

tags: Data Analysis, Statistical Graphics

3 Comments

  1. Bruno
    Posted January 15, 2013 at 11:32 am | Permalink

    Rick thanks for the example, just using DATA Step to assign the OTHERS for all the observations greater than &numCats, Proc SGPLOT will take care of summation

    * * append new level "Others" *;
    data other;
      set work.freqout;
      topCat = &vname;  
      if _n_ > &numCats then do;
        topCat = "OTHERS";
      end;
    run;
    • Posted January 15, 2013 at 11:35 am | Permalink

      Yes, you are right! I guess I was thinking of ALSO making a report, but even in that case I can use your trick and use PROC FREQ to summarize. Great observation, thanks!

  2. Posted January 22, 2013 at 1:40 pm | Permalink

    A colleague pointed out to me that the PARETO procedure in SAS/QC software can do the computations and make the chart without any pre-processing. Here's the example from this article:

    title;
    %let numcats = 20;
    /* Set MAXNCAT= option to numcats+1 to specify the total number of bars */
    /* The LAST= option adds the Others category                            */
    /* Alas, there is no way to move the axis to the bottom. */
    proc pareto data=sashelp.cars;
       hbar make / maxncat=%eval(&numcats+1) nocurve scale=count other="Others" last="Others"
       odstitle="Top 20 Vehicle Manufacturers and 'Others' Category";
    run;

    There are other options that you can specify on the HBAR statement that control how the "Others" category is created. For details, see the documentation for the PARETO procedure

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <p> <pre lang="" line="" escaped=""> <q cite=""> <strike> <strong>