# Discriminating Fisher's iris data by using the petal areas

I've seen analyses of Fisher's iris data so often that sometimes I feel like I can smell the flowers' scent. However, yesterday I stumbled upon an analysis that I hadn't seen before.

The typical analysis is shown in the documentation for the CANDISC procedure in the SAS/STAT documentation. A (canonical) linear discriminant analysis (LDA) that uses four variables (SepalLength, SepalWidth, PetalLength, and PetalWidth) constructs hyperplanes that can almost separate the species. With an LDA, only three flowers are misclassified: one virginica is misclassified as a versicolor and two versicolor are misclassified as virginica. The following plot is from the SAS/STAT documentation and shows how the first canonical coordinate does a fantastic job of discriminating between the species.

I've always been impressed by this analysis. I consider it a remarkable example of the power of statistics to solve classification problems.

Yesterday I saw a Web page from StatLab Heidelberg at the Institute for Applied Mathematics, Universität Heidelberg. The author points out that the area of the petals and sepals do a much better job of discriminating between the various species. In fact, they point out that you do not even need to use discriminant analysis to obtain a result that is almost as good as LDA!

The following DATA step constructs the area of the sepals and petals, assuming that they are nearly rectangular. You can then plot the new variables:

```data iris; set Sashelp.Iris; SepalArea = SepalLength * SepalWidth; PetalArea = PetalLength * PetalWidth; run;   proc sgplot data=iris; title "Discriminant Analysis of Iris Data Using Area"; scatter x=SepalArea y=PetalArea / group=Species; refline 150 750 / axis=y; /* add "dividing lines" */ run;```

The single variable, PetalArea, does nearly as good a job at classifying the iris species as linear discriminant analysis. In the scatter plot, you can draw horizontal lines that nearly separate the species. The lines that are drawn misclassify only four versicolor as virginica. That's an amazing result for using a single, easily constructed, variable, which has the additional advantage of having physical significance.

I have seen a similar idea used in the discriminant and cluster analysis of the Sashelp.Fish data set. For the fish, it is common to transform weight by a cube-root transformation in order to obtain a variable that is comparable to the heights, widths, and lengths of the fish. However, until now I had never thought about using the area of the iris flower parts.

The iris quadratic transformation is impressive because of its simplicity. I wonder when this was first noticed. Can anyone provide a reference that precedes the 2005 date on the StatLab web page? An internet search located one instance of this transformation in the book Data Clustering and Pattern Recognition, which was printed in Chinese, but I cannot determine who originated the idea.

1. Peter Borysov
Posted August 9, 2012 at 9:55 am | Permalink

Hello Rick,
I think the idea that you described is called kernel embedding, where the feature space is enriched to increase the performance of linear classifier. One problem is that in general this could lead to overfitting.

2. Chris Hemedinger
Posted August 10, 2012 at 8:47 am | Permalink

As an aside, the SASHELP.IRIS data contains some errors: three are well-known, and one was "added" in SAS 9.2 but corrected in SAS 9.3. Read about the history here.

Rick Wicklin, PhD, is a senior researcher in computational statistics at SAS and is a principal developer of PROC IML and SAS/IML Studio.  His areas of expertise include computational statistics, statistical graphics, statistical simulation, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.