The unlikely pedigree of sample data in SAS

0

We ship many sample data sets with SAS products. By using well-understood sample data sets, we can teach concepts or show off product features without distracting the audience/students with data collection or prep.

At least, that's the way it's supposed to work. But occasionally the sample data can cause a distraction on its own, especially when its origin is called into question. Jiangtang Hu tells the colorful story of the Fisher's Iris data, which can be found in SAS 9.2 as SASHELP.IRIS.

According to Mr. Hu, the SASHELP.IRIS data set contains some errors: deviations from the original Iris data published by R.A. Fisher. Three of the errors are well-known, propagated by many scientists over the years in several different repositories. A fourth error exists only in the version shipped with SAS 9.2.


I became familiar with the IRIS data set a couple of years ago when we introduced the Scatter Plot Matrix task into SAS Enterprise Guide 4.3. The IRIS data, with its one categorical column ("Species") and several measurement columns ("PetalWidth", "PetalLength", etc.), is a great way to show off the capabilities of PROC SGSCATTER, which is the SAS procedure that the new task uses to create its results (pictured here). SASHELP.IRIS was the primary data set used by our testers to verify the behavior of the Scatter Plot Matrix task.

Mr. Hu's blog post aroused my curiosity. Using our internal defects tracking system, I went back in time to September 2007 and saw the activity that resulted in the inclusion of the IRIS data in SASHELP. The version that we included was meant to be a copy of the data that we had already been shipping with SAS/STAT examples for many years. That version already had the three common errors known to the community; I cannot yet explain how the fourth error might have been introduced (although I might continue the research for an episode of CSI: SASHELP Data).

I also learned that the errors did not go unnoticed by our own staff. SAS Press author Warren Kuhfeld noticed the deviations as he prepared content for his book about Statistical Graphics in SAS. Warren applied his high professional standards to the problem and took steps to make sure that with SAS 9.3, the SASHELP.IRIS data set was corrected. The data set now matches the UCI version of the Iris data (consistent with what has been supplied with SAS examples in the past), but with the observations sorted by Species (so observation order might differ, but the values are the same).

Share

About Author

Chris Hemedinger

Director, SAS User Engagement

+Chris Hemedinger is the Director of SAS User Engagement, which includes our SAS Communities and SAS User Groups. Since 1993, Chris has worked for SAS as an author, a software developer, an R&D manager and a consultant. Inexplicably, Chris is still coasting on the limited fame he earned as an author of SAS For Dummies

Comments are closed.

Back to Top