Tabulate counts when there are unobserved categories


Suppose that you are tabulating the eye colors of students in a small class (following Friendly, 1992). Depending upon the ethnic groups of these students, you might not observe any green-eyed students. How do you put a 0 into the table that summarizes the number of students who have each eye color?

If you are using PROC FREQ to tabulate the counts, I have previously shown how to use the DATA step to create the set of all possible categories and run PROC FREQ on the resulting data. The trick is to use the ZEROS option on the WEIGHT statement to include categories with zero counts.

In the SAS/IML language you can do something similar to enable the TABULATE subroutine to report 0 counts for unobserved categories. However, I don't like the previous way that I implemented the idea, so I want to try again. The following SAS/IML module uses the LOC-ELEMENT technique to accomplish the task, but wraps the functionality with a simpler syntax that is similar to the syntax for the TABULATE subroutine. The only difference is a fourth argument that specifies the "reference set", which is the set of all possible categories.

proc iml;
/* output levels and frequencies for categories in x, including all 
   levels in the reference set */
start TabulateLevels(OutLevels, OutFreq, x, refSet);
   call tabulate(levels, freq, x);        /* compute the observed frequencies */
   OutLevels = union(refSet, levels);     /* union of data and reference set */ 
   idx = loc(element(OutLevels, levels)); /* find the observed categories */
   OutFreq = j(1, ncol(OutLevels), 0);    /* set all frequencies to 0 */
   OutFreq[idx] = freq;                   /* overwrite observed frequencies */
/* count faces for 10 rolls of six-sided die, using the reference set 1:6 */
rolls = {4 6 4 6 5 2 4 3 3 6};           /* only 5 faces observed */
call TabulateLevels(Levels, Freq, rolls, 1:6);
print Freq[c=(char(Levels))];

The table shows the output for a set of 10 rolls of a six-sided die. The "1" face was not observed during the experiment, but we want to report a count of 0 for that category.

I like this module better because its syntax is similar to the TABULATE subroutine. The name even includes "Tabulate", which will make the function easier to find when I need to use it.


About Author

Rick Wicklin

Distinguished Researcher in Computational Statistics

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of PROC IML and SAS/IML Studio. His areas of expertise include computational statistics, simulation, statistical graphics, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.

Leave A Reply

Back to Top