Random Sampling: What's Efficient?

Suppose you wish to select a random sample from a large SAS dataset. No problem. The PROC SURVEYSELECT step below randomly selects a 2 percent sample:

proc surveyselect data=large out=sample
	 method=srs   /* simple random sample */
	 n=1000000;	 /* sample size */
run;

Do you have a SAS/STAT license? If not, or if you seek maximum efficiency, keep reading.

I benchmarked the above program against a SAS dataset with 50 million rows and 3 columns, running SAS 9.4 on my 64-bit Windows 7 machine.

Time: 2.8 seconds

Is the DATA step more efficient? It might be, if you know some tricks. Let's look at a few.

The DATA step below selects a random sample of approximately 1 million rows.

899  data sample;
900     set large;
901     if rand('uniform') < .02;
902  run;
 
NOTE: There were 50000000 observations read from the data set WORK.LARGE.
NOTE: The data set WORK.SAMPLE has 1001125 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           3.31 seconds
      cpu time            3.29 seconds

The RAND function generates random numbers between 0 and 1. The UNIFORM argument means that every value between 0 and 1 has an equal chance of being generated. The subsetting IF statement selects a random sample of approximately 2 percent.

Time: 3.3 seconds (Winner: PROC SURVEYSELECT, but keep reading)

The SET statement above reads all 50 million rows by default, which are then filtered by the subsetting IF statement. We can, however, change the SET statement to read only 1 million rows, bypassing the other 49 million, for a huge performance improvement, like this:

data sample;
   set large (obs=1000000);
run;

Time: .1 second

This, of course, is cheating, because the DATA step above reads the first million rows, which is not a random sample.

The DATA step below, like the one above, also reads only 1 million rows, bypassing the other 49 million, for very fast performance. This time, instead of reading the first million rows, the SET statement reads every 50th row.

data sample;
   do RowNum=50 to 50000000 by 50;
	  set large point=RowNum;
	  output;
   end;
   stop;
run;

The POINT= option designates a numeric variable whose value represents a row number. The SET statement jumps directly to each designated row, and reads that row, bypassing intervening rows.

The selected sample may be more representative than the first million rows, but it's still not random.

Time: .9 second

The DATA step below, like the two above, also reads only 1 million rows, bypassing the other 49 million. This time, however, the datastep selects a random sample, comparable to PROC SURVEYSELECT.

data sample;
 
   drop Need Pool Ratio Row;
 
   Need= 1000000;		* Desired sample size;
   Pool=50000000;		* Number of rows in dataset;
 
   do until (Need=0);	        * Continue until no additional sample rows needed;
 
      Row+1;			* Row number. Every row in the pool is a candidate           
                 for possible inclusion in the sample;
 
      Ratio=Need/Pool; 	        * Initial ratio is .02 or 2 percent.  
  Ratio is continuoulsy adjusted as necessary;
 
if rand('uniform') < ratio then do; * Expression will be true for 
    exactly 2 percent of rows;
 
   	    set large point=Row;     * Jump to desired row. Read desired row;
	    output;		     * Output desired row;
	    Need=Need-1;	     * Because a row was selected, subtract 1  
  from remaining rows needed.;
	 end;
 
      Pool=Pool-1;		     * Subtract 1 from pool, regardless of       
  whether a row was selected or not;
 
   end;
 
   stop;
 
run;

Method Time Advantage

PROC SURVEYSELECT 2.8 seconds ease of use

DATA step 2.2 seconds speed

5 Comments

Alexander on February 24, 2017 10:03 am

Does adding macro variables for names of input and sample datasets as well as for the sample size affects the performance of these programs?
- jim simon on February 24, 2017 12:38 pm
  
  No. Adding macro variables does not affect performance. What takes time is reading large amounts of data, row by row, as the pgm executes. Resolving macro variables is a 1-time action as the pgm compiles.
Pingback: Stratified random sample: What's efficient? - SAS Learning Post
David Pasta on June 23, 2016 3:35 pm

I like the approach where you update the ratio. That's a good way to make sure you get exactly the right number in the sample without any bias.

I probably would have approached the problem by generating a dataset with 50 million rows consisting of an index from 1 to 50,000,000 and a uniform pseudo-random number, then sorted on the random number to permute the numbers, then use the data step with "point" to access the first million (now in random order). I'm not sure whether I think it would go faster if I resorted the first million into numerical order so that the access to the database would be sequential -- I would think it would result in less disk transfers when you had several observations near each other, but depending on the size of the records and the blocksize on the storage device it might not matter.

In general, I'm wondering how the timings of what you suggest (and the method I describe above) vary as the dataset gets wider than just the 3 variables -- say with 100 variables or 1,000 or 10,000? The SASFILE LOAD probably would work for the permutation dataset or the 3-variable version but presumably would fail as the dataset got bigger.
Rick Wicklin on September 4, 2015 11:26 am

This kind of sampling is also called "random sampling without replacement." The sample contains a set of distinct observations. Other ways to sample without replacement are described in this blog post.

Another popular sampling scheme is "random sampling with replacement," in which an observation might appear multiple times in the sample. This is useful for bootstrap methods. The SURVEYSELECT procedure supports this and many other sampling schemes through the METHOD= option.

Incidentally, you might be able to speed up the DATA step programs if you use the global statement
SASFILE large LOAD;
which attempts to load the data set into RAM.

Blogs

Blogs

Random Sampling: What's Efficient?

About Author

Related Posts

5 Comments

Blogs

About Author

Related Posts

Confidence limits for counts in a multinomial distribution

Sort the rows or columns of a matrix independently

Create two types of Pareto charts in SAS

5 Comments