Suppose you wish to select a random sample from a large SAS dataset. No problem. The PROC SURVEYSELECT step below randomly selects a 2 percent sample:
proc surveyselect data=large out=sample method=srs /* simple random sample */ n=1000000; /* sample size */ run; |
Do you have a SAS/STAT license? If not, or if you seek maximum efficiency, keep reading.
I benchmarked the above program against a SAS dataset with 50 million rows and 3 columns, running SAS 9.4 on my 64-bit Windows 7 machine.
Time: 2.8 seconds
Is the DATA step more efficient? It might be, if you know some tricks. Let's look at a few.
The DATA step below selects a random sample of approximately 1 million rows.
899 data sample; 900 set large; 901 if rand('uniform') < .02; 902 run; NOTE: There were 50000000 observations read from the data set WORK.LARGE. NOTE: The data set WORK.SAMPLE has 1001125 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 3.31 seconds cpu time 3.29 seconds |
The RAND function generates random numbers between 0 and 1. The UNIFORM argument means that every value between 0 and 1 has an equal chance of being generated. The subsetting IF statement selects a random sample of approximately 2 percent.
Time: 3.3 seconds (Winner: PROC SURVEYSELECT, but keep reading)
The SET statement above reads all 50 million rows by default, which are then filtered by the subsetting IF statement. We can, however, change the SET statement to read only 1 million rows, bypassing the other 49 million, for a huge performance improvement, like this:
data sample; set large (obs=1000000); run; |
Time: .1 second
This, of course, is cheating, because the DATA step above reads the first million rows, which is not a random sample.
The DATA step below, like the one above, also reads only 1 million rows, bypassing the other 49 million, for very fast performance. This time, instead of reading the first million rows, the SET statement reads every 50th row.
data sample; do RowNum=50 to 50000000 by 50; set large point=RowNum; output; end; stop; run; |
The POINT= option designates a numeric variable whose value represents a row number. The SET statement jumps directly to each designated row, and reads that row, bypassing intervening rows.
The selected sample may be more representative than the first million rows, but it's still not random.
Time: .9 second
The DATA step below, like the two above, also reads only 1 million rows, bypassing the other 49 million. This time, however, the datastep selects a random sample, comparable to PROC SURVEYSELECT.
data sample; drop Need Pool Ratio Row; Need= 1000000; * Desired sample size; Pool=50000000; * Number of rows in dataset; do until (Need=0); * Continue until no additional sample rows needed; Row+1; * Row number. Every row in the pool is a candidate for possible inclusion in the sample; Ratio=Need/Pool; * Initial ratio is .02 or 2 percent. Ratio is continuoulsy adjusted as necessary; if rand('uniform') < ratio then do; * Expression will be true for exactly 2 percent of rows; set large point=Row; * Jump to desired row. Read desired row; output; * Output desired row; Need=Need-1; * Because a row was selected, subtract 1 from remaining rows needed.; end; Pool=Pool-1; * Subtract 1 from pool, regardless of whether a row was selected or not; end; stop; run; |
Method Time Advantage
PROC SURVEYSELECT 2.8 seconds ease of use
DATA step 2.2 seconds speed
5 Comments
Does adding macro variables for names of input and sample datasets as well as for the sample size affects the performance of these programs?
No. Adding macro variables does not affect performance. What takes time is reading large amounts of data, row by row, as the pgm executes. Resolving macro variables is a 1-time action as the pgm compiles.
Pingback: Stratified random sample: What's efficient? - SAS Learning Post
I like the approach where you update the ratio. That's a good way to make sure you get exactly the right number in the sample without any bias.
I probably would have approached the problem by generating a dataset with 50 million rows consisting of an index from 1 to 50,000,000 and a uniform pseudo-random number, then sorted on the random number to permute the numbers, then use the data step with "point" to access the first million (now in random order). I'm not sure whether I think it would go faster if I resorted the first million into numerical order so that the access to the database would be sequential -- I would think it would result in less disk transfers when you had several observations near each other, but depending on the size of the records and the blocksize on the storage device it might not matter.
In general, I'm wondering how the timings of what you suggest (and the method I describe above) vary as the dataset gets wider than just the 3 variables -- say with 100 variables or 1,000 or 10,000? The SASFILE LOAD probably would work for the permutation dataset or the 3-variable version but presumably would fail as the dataset got bigger.
This kind of sampling is also called "random sampling without replacement." The sample contains a set of distinct observations. Other ways to sample without replacement are described in this blog post.
Another popular sampling scheme is "random sampling with replacement," in which an observation might appear multiple times in the sample. This is useful for bootstrap methods. The SURVEYSELECT procedure supports this and many other sampling schemes through the METHOD= option.
Incidentally, you might be able to speed up the DATA step programs if you use the global statement
SASFILE large LOAD;
which attempts to load the data set into RAM.