Will indexing my SAS data sets help? This is one of the most frequent questions I hear in SAS Technical Support. The response is always the same: “Maybe. Tell me about your data, and what you are doing with it.” Here is a primer on effective indexing.
Indexing can improve performance in some situations, and in other situations, indexing harms performance. Several factors need to be considered:
- Is your data set large?
- Is the amount of data that you are extracting a small percentage of the total number of observations?
- Is the data refreshed frequently or seldom?
- Is the data frequently subset by the same variable or by more variables?
- Is the data sorted by those same variables?
The answers to these questions will guide you to your answer.
Size of the data set and extract
The size of the data set, the number of observations, and the percentage of the observations that will be extracted are the first questions to answer. The words large and small from the first two questions in the list above are relative terms. As a general rule, if you are extracting more than 35% of the data in a query, then an index does not provide much benefit, if any. Back to the point about large or small, if you get a 50% improvement in performance of a job that originally took 40 minutes, that’s impressive. If the job originally took 40 seconds, a 50% improvement does not seem compelling.
Data refresh, subset, and sort
Next you should consider how often the data set is refreshed or rebuilt: daily, weekly, or monthly? Creating an index is not without cost. It takes space and time to build the index. Index files can often take as much or more space than the data set itself. The amount of time required to build the index depends on the data itself. It can take from minutes to hours if the number of observations is measured by the tens or hundreds of millions. So if the data is refreshed each evening and you are going to query it once each day, it might not be worth the time and space to build index.
Likewise, if you are going to query the data multiple times using different variables each time, indexing the data set has very limited benefit, if any.
An example problem and solution
Let’s look at an example: a 30-million observation data set that’s refreshed each weekend. You need to extract data based on the state of residence of the individuals in the database. You create an index on the state, and query each state separately for additional processing. You see some performance benefit from having the index, but it does not seem to make a lot of difference, so you call SAS Technical Support to ask whether your SAS® software is broken.
As the conversation progresses, I ask whether the data set is sorted by the state variable and find out that it is not. This is important, and here’s why. When the index is built, SAS scans the data set and looks at the index variable(s) on each observation. If a new value is found, that value is added to the index table in ascending order along with the Record ID (RID) where it was found. If a duplicate value is found, only the RID is added. So the index table has ascending values of the index variable(s), and the RIDs are in ascending order.
If the data is originally organized by a date/time stamp or by an ID Number, then the entries for each state are scattered throughout the data set, which makes it take a very long time to build the index. Even when using the index to retrieve the data, SAS has to scan through most, if not all, of the data set.
The fix is simple: sort the data set first by state, and then create the index. Yes, it takes time to sort the data. However, the index will build in less time, and the speed of the query will amaze you. Without first sorting the data, you can start the query, have time to fill your coffee cup, have a brief conversation, and still get back to your computer before the extract is done. By sorting first and then creating the index, you might not have time to get out of your chair before the query is done.
One final point to note...
A subsetting IF statement does not use an index. Only a WHERE statement or a WHERE clause uses an index.