Indexing a SAS data set to improve processing categories in SAS/IML

0

I have blogged about three different SAS/IML techniques that iterate over categories and process the observations in each category. The three techniques are as follows:

  1. Use a WHERE clause on the READ statement to read only the observations in the ith category. This is described in the article "BY-group processing in SAS/IML."
  2. Use the UNIQUE-LOC technique, as described in the article "The UNIQUE-LOC trick." In this technique, you read all of the data, use the UNIQUE function to find the unique categories, loop over the categories, and use the LOC function to find the observations for each category.
  3. Use the UNIQUEBY technique, as described in the article "An efficient alternative to the UNIQUE-LOC technique." In this technique, you read all of the data, sort it by the categories, and use the UNIQUEBY function to index the observations for each category.

In general, the WHERE technique is the slowest because it requires reading the data multiple times. The UNIQUE-LOC and UNIQUEBY techniques are faster, but they require all of the data be read into memory, which might not be feasible for huge data sets.

There is simple variation of the WHERE technique that sometimes increases the speed of reading the data: use INDEX statement to index the categorical variable(s) prior to reading the data. Although the INDEX statement is not guaranteed to speed up reading the data, using the INDEX statement is easy and there is almost never a downside to using it.

Suppose that you have the following data, which are intentionally generated so that the Category variable is not in sorted order:

%let N = 200;
%let NumGroups = 1000;
data MyData(keep=x Category);
do i = 1 to &N;
   do Category = 1 to &NumGroups;
      x = Category + rand("Normal");
      output;
   end;
end;
run;

Suppose that you want to read the data into PROC IML and compute some statistics for each of the 1,000 categories. For this small example with 200,000 rows, the full data will fit into memory, but let's pretend that it doesn't fit or that there is some other reason why we want to use the WHERE clause to read each category into SAS/IML vectors.

The following SAS/IML program is explained in my previous article on BY-group processing. However, this time I've used the INDEX statement:

proc iml;
t0 = time();
use MyData;
index Category;              /* index by Category */
do i = 1 to &NumGroups;
   read all var {x} where (Category=i); /* read i_th category */
   mean = mean(x);           /* compute statistics for the    */
   var = var(x);             /* observations in this category */
end;
close MyData;
IndexTime = time()-t0;
print IndexTime;

For this example, reading indexed data is twice as fast as reading unindexed data. (Other examples I've tried are less impressive, with speed-up on the order of 10%–30%.) The program is slower than using the UNIQUE-LOC technique, but remember that this program reads the data 1,000 times, whereas the UNIQUE-LOC technique reads the data once.

If the data fit in memory, the UNIQUE-LOC or UNIQUEBY techniques are faster ways to process each category in the data. However, if you are forced to use a WHERE clause to read subsets of the data, you should try using the INDEX statement. It can't hurt, and it might help.

Share

About Author

Rick Wicklin

Distinguished Researcher in Computational Statistics

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of SAS/IML software. His areas of expertise include computational statistics, simulation, statistical graphics, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.

Leave A Reply

Back to Top