I've seen analyses of Fisher's iris data so often that sometimes I feel like I can smell the flowers' scent. However, yesterday I stumbled upon an analysis that I hadn't seen before.
The typical analysis is shown in the documentation for the CANDISC procedure in the SAS/STAT documentation. A (canonical) linear discriminant analysis (LDA) that uses four variables (SepalLength, SepalWidth, PetalLength, and PetalWidth) constructs hyperplanes that can almost separate the species. With an LDA, only three flowers are misclassified: one virginica is misclassified as a versicolor and two versicolor are misclassified as virginica. The following plot is from the SAS/STAT documentation and shows how the first canonical coordinate does a fantastic job of discriminating between the species.
I've always been impressed by this analysis. I consider it a remarkable example of the power of statistics to solve classification problems.
Yesterday I saw a Web page from StatLab Heidelberg at the Institute for Applied Mathematics, Universität Heidelberg. The author points out that the area of the petals and sepals do a much better job of discriminating between the various species. In fact, they point out that you do not even need to use discriminant analysis to obtain a result that is almost as good as LDA!
The following DATA step constructs the area of the sepals and petals, assuming that they are nearly rectangular. You can then plot the new variables:
data iris; set Sashelp.Iris; SepalArea = SepalLength * SepalWidth; PetalArea = PetalLength * PetalWidth; run; proc sgplot data=iris; title "Discriminant Analysis of Iris Data Using Area"; scatter x=SepalArea y=PetalArea / group=Species; refline 150 750 / axis=y; /* add "dividing lines" */ run;
The single variable, PetalArea, does nearly as good a job at classifying the iris species as linear discriminant analysis. In the scatter plot, you can draw horizontal lines that nearly separate the species. The lines that are drawn misclassify only four versicolor as virginica. That's an amazing result for using a single, easily constructed, variable, which has the additional advantage of having physical significance.
I have seen a similar idea used in the discriminant and cluster analysis of the Sashelp.Fish data set. For the fish, it is common to transform weight by a cube-root transformation in order to obtain a variable that is comparable to the heights, widths, and lengths of the fish. However, until now I had never thought about using the area of the iris flower parts.
The iris quadratic transformation is impressive because of its simplicity. I wonder when this was first noticed. Can anyone provide a reference that precedes the 2005 date on the StatLab web page? An internet search located one instance of this transformation in the book Data Clustering and Pattern Recognition, which was printed in Chinese, but I cannot determine who originated the idea.
Update (Nov 2015): Giacomo Patrizi is very insistent that I reference Nieddu and Patrizi (2000), which appeared before the StatLab result; see his comment below.