To INDEX or not to INDEX...

Recently, a certification candidate asked me about indexes and how SAS decides whether to use one or not. I don't work with supper-large data sets and therefore haven't been overly concerned with efficient coding techniques and consequently had no idea what an index was. So off to SAS 9.2 online documentation, where a search led me to "Understanding SAS Indexes" . There I learned that an index is an optional file that increases performance of a SAS program by providing direct access to specific observations rather than reading the observations sequentially. It is like a lookup table for observations based on the values of a chosen variable. When you write a WHERE statement to subset of a very big data set, SAS can use the index to find the specific observations more efficiently than reading all of the observations sequentially.

Easy enough, and I thought I would write some code to personally experience the thrill of the index. That is when my education really began.

I wrote some code that created an index, and then tried some PROC PRINT steps that included a WHERE statement (with the MSGLEVEL option equal to 'i' so that the SAS log would indicate when SAS used the index). At first, my WHERE statements did not cause the use of the index. I tried a few different options, but nothing would work, so I did a little more investigation on the web and found a paper from SUGI 30 written by Michael A. Raithel that did a great job explaining how SAS chooses whether to use an index or not.

A critical factor in SAS's decision to use the index is the size of the subset that will be created by the WHERE statement. If the subset is less than 15% of the total data, then there is a very good chance that SAS will use the index. So I rewrote my data step to create data that include high & low frequency values of my indexed variable and tried again with much better results. Here is my code. Feel free to paste it into SAS and play with it yourself on a Saturday night if you are bored and need something to do.

options msglevel=i;
data work.flights (index=(flightid) drop=i x);
   format origin $3. flightid $8.;
* create observations with randomly generated, 
  normally distributed flightid values;
     Do i=1 to 5000;
          x=1000*round(rand('beta',10,10),.1)  
          flightid = CAT('IA11',x);
          origin='RDU';
          output;
       end;
   run;

* examine the frequency of the flight id values;
proc freq data=work.flights;
         tables flightid;
   run;

* example of the use of an index;
proc print data=work.flights;
    where flightid gt 'IA11800';
   run;

That should get you started. Anyone have other index tips and tricks that they have learned? Feel free to add them to the comments.

4 Comments

Pramod.R on November 30, 2010 4:03 am

Hi Mark,
Thank you so much! This documentaion pretty much ansers most of my questions on Indexing
Tal Galili on November 27, 2010 4:09 pm

Hello there (I couldn't find your e-mail so I am writing this here)
I'd like to invite you to add your blog to the new:
http://sas-x.com/
Website.
It's a news website, all dedicated to SAS, based on bloggers content.
I am it's founder, and I created it after creating a similar (and somewhat successful) such website around the topic of R programming. That website can be seen at:
http://www.r-bloggers.com/
If you are interested in joining, you can simply add your site at:
http://sas-x.com/add-your-blog/
Best wishes,
Tal
Mark on November 27, 2010 10:18 am

Pramod,
Thanks for the comment. I'm not sure that I fully understand your question, as I'm not sure I know how hashing algorithms apply to indexing in SAS. I suggest reading the SAS online documentation "Understanding SAS indexes" It is long (9,000+ words) and has lots of examples of the different types of indexes and descriptions of how SAS creates the indexes. Maybe someone else reading this blog will be able to comment on your question.
Mark
Pramod.R on November 26, 2010 3:06 am

Hi Mark,
Neat article about indexing!
I had a quick question here.. Any idea what indexing mechanism does SAS usally employ? I know there is an option to set the type of indexing the SAS uses (like HASH,etc).. But what is the default type that is used?
Thanks in advance,
Pramod R

Blogs

Blogs

To INDEX or not to INDEX...

About Author

Related Posts

4 Comments

Blogs

About Author

Related Posts

QPSOLVE: A new SAS IML function for quadratic optimization

Lengths and formats in SAS: the long and short of it

How to use keyword-value pairs when calling SAS IML subroutines

4 Comments