Recently, a certification candidate asked me about indexes and how SAS decides whether to use one or not. I don't work with supper-large data sets and therefore haven't been overly concerned with efficient coding techniques and consequently had no idea what an index was. So off to SAS 9.2 online documentation, where a search led me to "Understanding SAS Indexes" . There I learned that an index is an optional file that increases performance of a SAS program by providing direct access to specific observations rather than reading the observations sequentially. It is like a lookup table for observations based on the values of a chosen variable. When you write a WHERE statement to subset of a very big data set, SAS can use the index to find the specific observations more efficiently than reading all of the observations sequentially.
Easy enough, and I thought I would write some code to personally experience the thrill of the index. That is when my education really began.
I wrote some code that created an index, and then tried some PROC PRINT steps that included a WHERE statement (with the MSGLEVEL option equal to 'i' so that the SAS log would indicate when SAS used the index). At first, my WHERE statements did not cause the use of the index. I tried a few different options, but nothing would work, so I did a little more investigation on the web and found a paper from SUGI 30 written by Michael A. Raithel that did a great job explaining how SAS chooses whether to use an index or not.
A critical factor in SAS's decision to use the index is the size of the subset that will be created by the WHERE statement. If the subset is less than 15% of the total data, then there is a very good chance that SAS will use the index. So I rewrote my data step to create data that include high & low frequency values of my indexed variable and tried again with much better results. Here is my code. Feel free to paste it into SAS and play with it yourself on a Saturday night if you are bored and need something to do.
options msglevel=i; data work.flights (index=(flightid) drop=i x); format origin $3. flightid $8.; * create observations with randomly generated, normally distributed flightid values; Do i=1 to 5000; x=1000*round(rand('beta',10,10),.1) flightid = CAT('IA11',x); origin='RDU'; output; end; run; * examine the frequency of the flight id values; proc freq data=work.flights; tables flightid; run; * example of the use of an index; proc print data=work.flights; where flightid gt 'IA11800'; run;
That should get you started. Anyone have other index tips and tricks that they have learned? Feel free to add them to the comments.