Visualize the Gini-Simpson diversity index

A previous article discusses the Gini-Simpson diversity index and how to compute it in SAS. Suppose you have a sample that contains R classes. (Classes are also called groups or categories.) Intuitively, the sample exhibits "high diversity" if the class sizes are approximately equal. The sample shows "low diversity" if the class sizes are very different, and especially if there is one large group and R-1 small groups. This article visualizes the Gini-Simpson diversity index by plotting the index versus groups sizes. The visualization gives insight into questions like "how high (or low) can the diversity index be when my data contains R classes?"

Definitions and notation

In this article, N is the number of observations in the data and the size of the groups are n₁, n₂, ..., n_R. The Gini-Simpson diversity index (G-S index, for short) is computed as 1 – λ where λ is the Simpson homogeneity index, defined by
$\lambda = \sum\limits_{i=1}^R \frac{n_i}{N} \frac{n_i-1}{N-1}$
The case where R=1 is not interesting because λ=1 (the sample is homogeneous), so henceforth assume that R > 1.

What is the diversity index for the most diverse sample?

Let's consider a sample that has N observations, and the size of the R groups are equal. If R divides N evenly, then n_i = N/R for all i. Then the formula for the Simpson homogeneity index simplifies to
$\lambda = \frac{1}{R} \frac{N-R}{N-1}$ [for R equal groups] From this formula, you can see the following:

If the R groups are equal in size, the Simpson homogeneity index λ is always smaller than 1/R. For a large sample that has a small number of group (R ≪ N), λ is only slightly smaller than 1/R, which is an upper bound on the homogeneity index. For example, if there are R=3 evenly sized groups, the Simpson homogeneity index is always smaller than 1/3.
Similarly, if the R groups are equal in size, the Gini-Simpson diversity index, which is 1 – λ, is always greater than 1 – 1/R = (R – 1)/R. For example, if there are R=3 evenly sized groups, the G-S index is greater than 2/3.

You can use the SAS IML function from the previous article to compute the Gini-Simpson diversity index for data that are evenly divided between R classes. For example, the following program uses N=30 and R=3 groups where each group contains 10 observations.

proc iml;
/* Input: Vector of counts (n1, n2, ..., nR) 
   Output: a 1x3 vector whose elements are:
         { N = sum of the counts,
           Simpson index of homogeneity,
           Gini-Simpson index of diversity   }
*/
start GiniSimpsonIndex(count);
   Sum = sum(count);
   SimpsonIndex = sum( (count/Sum) # ((count-1)/(Sum-1)) );
   GiniSimpsonIndex = 1 - SimpsonIndex;
   return Sum || SimpsonIndex || GiniSimpsonIndex;
finish;
 
count = {10 10 10};            /* vector of groups sizes */
gs = GiniSimpsonIndex(count);  /* return N, Simpson, Gini-Simpson */
print gs[L="R=3 Even Groups (n_i=10)" c={"N" "Simpson" "Gini-Simpson"}];

For R=3 equally sized groups, the Simpson index must be less than 1/3 and the Gini-Simpson index must be greater than 2/3. Indeed, the computation shows that the indices are 0.31 and 0.69, respectively.

The analysis shows that the diversity index depends on the number of groups, R. You can repeat the computation of the indices for N=30 and for various sizes of the subgroups:

N = 30;
Rs = {2, 3, 5, 6, 10};           /* number of groups: use numbers that evenly divide N */
labls = {"N" "R" "Group Size (n_i)" "Gini-Simpson Index" "Bound"};
result = j(nrow(Rs), 5, N);  
do i = 1 to nrow(Rs);
   R = Rs[i];                    /* number of groups */
   n_i = N/R;                    /* size of each group */
   LB = (R-1)/R;                 /* a lower bound on the diversity index */
   count = repeat(n_i, 1, R);    /* vector of groups sizes, such as {10 10 10} */
   gs = GiniSimpsonIndex(count); /* return N, Simpson, Gini-Simpson */
   result[i,2:5] = R || n_i || gs[,3] || LB;
end;
print result[L="" c=labls F=BEST7.];

Notice that the G-S index is always greater than the last column ("Bound"). Notice also that the "Bound" depends only on R whereas the G-S index depends on both N and R.

The distribution of the diversity index for R=2 classes

The formula for the Gini-Simpson diversity index is challenging to interpret when the sizes of the subgroups are not equal. Let's visualize the G-S index for a fixed sample size (N) and a fixed number of groups (R) as the size of the subgroups vary. Without loss of generality, let's enumerate the groups according to their relative sizes. Thus, we assume that n₁ ≥ n₂ ≥ ... ≥ n_R > 0. In addition, there are only R-1 free parameters because Σ n₁ = N.

The simplest case is R=2. The largest group must have at least N/2 elements and can have at most N-1 elements. The size of the second group is determined by the equation n₂ = N - n₁. Consequently, the following SAS IML statements compute the Gini-Simpson index for all possible sizes of two groups. You can then plot the value of the index as a function of the size of the larger subgroup:

/* run the analysis for all possible (n1,n2) sizes where n1 >= n2 > 0 and n1+n2=30 */
N = 30;
n1 = T( N/2 : (N-1) );         /* sizes for first group */
n2 = N - n1;                   /* sizes for second group */
count = n1 || n2;
*print count[L="Group Sizes" c={'n1' 'n2'}];
 
r = j(nrow(count), 3, .);
do i = 1 to nrow(count);
   r[i,] = GiniSimpsonIndex( count[i,] );  
end;
 
/* visualize the Gini-Simpson probability */
result = count || r;
create Diversity2Group from result[c={'n1' 'n2' 'N' 'Simpson' 'GiniSimpson'}];
   append from result;
close;
QUIT;
 
title "Diversity Index for Sample with 2 Groups and N=30";
proc sgplot data=Diversity2Group noautolegend;
   needle x=n1 y=GiniSimpson;
   scatter x=n1 y=GiniSimpson / datalabel=GiniSimpson datalabelpos=Top markerattrs=(symbol=CircleFilled);
   xaxis integer values=(15 to 30);
   yaxis offsetmin=0 offsetmax=0.1 grid;
   format GiniSimpson 3.2;
   label GiniSimpson="Gini-Simpson Diversity Index" n1="n1 = Size of Group 1";
run;

The graph visualizes the Gini-Simpson diversity statistics for various sizes of two subgroups in a sample of size N=30. The data are "most diverse" when each subgroup has 15 elements (G-S index = 0.52). The index does not change much for small deviations from this case. For example, the index is 0.5 when n₁=18 and n₂=12. The sample is least diverse when the larger subgroup has 29 elements and the smaller has one element (G-S index = 0.07).

Recall that the G-S index is a probability. When the subgroups are equal, there is a 52% probability that two randomly selected items are in different groups. In the extreme case, there is only a 7% probability that two randomly selected items are in different groups.

The distribution of the diversity index for R=3 classes

The next simplest case is when there are R=3 classes in the data. For this case, the largest subgroup has a size (n₁) that can vary between N/3 and N-2 observations. The values of the other subgroups must be positive numbers that are smaller than n₁. The following program generates all possible triplets (n₁, n₂, n₃) for which n₁ ≥ n₂ ≥ n₃ > 0 and that satisfy n₁ + n₂ + n₃ = N. Again, we'll choose N=30 for the visualization.

/* run the analysis for R=3 groups and N=30 obs where n1 >= n2 >= n3 > 0 and 
   n1 + n2 + n3 =30 */
N = 30;
n1 = T( N/3 :(N-2) );           /* range of n1 */
n2 = T( 1 : N/2 );              /* size of n2 */
g = expandgrid(n1, n2);         /* possible pairs of (n1,n2) */
c = g || (N - g[,1] - g[,2]);   /* possible pairs of (n1,n2,n3) */
isValid = loc(c[,1]>=c[,2] & c[,2]>=c[,3] & c[,3]>0); /* enforce constraints */
count = c[isValid, ];
gs = j(nrow(count), 3, .);
do i = 1 to nrow(count);
   gs[i,] = GiniSimpsonIndex( count[i,] );
end;
 
result = count || gs;
create Diversity3Group from result[c={'n1' 'n2' 'n3' 'N' 'Simpson' 'GiniSimpson'}];
append from result;
close;
QUIT;

The Diversity3Group contains the G-S statistic for all possible sizes of the three subgroups. We can use a heat map to visualize the G-S statistics as a function of (n₁, n₂). To ensure that the color scale encompasses the whole range [0,1] for the G-S index, you can define a range attribute map. However, for brevity, I will merely add two fake observations to the data: one with G-S index equal to 0 and one with G-S index equal to 1.

data Fake;  /* ensure color model is on [0,1] */
Simpson = 0; GiniSimpson = 0; output;
Simpson = 1; GiniSimpson = 1; output;
run;
 
data Diversity3Group;
set Fake Diversity3Group;
GS = putn(GiniSimpson, 3.2);    /* label for the cells */
run;
 
%let BrBgRamp = CX8C510A CXD8B365 CXF6E8C3 CXF5F5F5 CXC7EAE5 CX5AB4AC CX01665E ;
title "Diversity Index for Sample with 3 Groups and N=30";
title2 "n1 + n2 + n3 = 30"; 
proc sgplot data=Diversity3Group noautolegend;
   heatmapparm x=n1 y=n2 colorresponse=GiniSimpson / name="heat" outline colormodel=(&BrBgRamp);
   gradlegend "heat";
   scatter x=n1 y=n2 / datalabel=GS datalabelpos=center  markerattrs=(size=0);
   xaxis grid integer values=(9 to 29);
   yaxis grid integer values=(0 to 15);
   label GiniSimpson = "Gini-Simpson Diversity Index";
run;

This graph visualizes the range of the G-S diversity index over all possible sizes of three classes. You can make several observations:

The largest value of the diversity index is 0.69 and it occurs when n₁ = n₂ = n₃ = 10.
The diversity index is flat near the maximum, as shown by the large number of blue-colored cells. For example, when the group sizes are (16, 7, 7), the diversity index is 0.60, which is relatively large.
The diversity index is the probability that two items selected at random belong to the same group. Therefore, it is useful to ask when the probability is 0.5. The pale-white cells are all close to 0.5. For example, the diversity index is 0.5 when the class sizes are (20, 8, 2) or (19, 10, 1).
The diversity index is small when there is one large group and two smaller groups. The least value is 0.13, which occurs when the class sizes are (28, 1, 1). The index increases rapidly when the sizes deviate from that extreme case.

Summary

The Gini-Simpson diversity index is a measure of the diversity of a sample that has R classes. It is largest (at least 1- 1/R) when all groups are similarly sized. It is smallest when there is one large group, and the other groups all have one observation. This article shows two visualizations of the G-S index, one with R=2 groups and one with R=3 groups. When the class sizes are approximately equal (the sample is divers), the G-S index is not very sensitive to variations in class sizes. However, when the class sizes are very different (the sample is homogeneous), the G-S index is sensitive to changes in the class sizes.

Blogs