In a previous blog, Random Sampling: What's Efficient?, I discussed the efficiency of various techniques for selecting a simple random sample from a large SAS dataset. PROC SURVEYSELECT easily does the job:
proc surveyselect data=large out=sample method=srs /* simple random sample */ rate=.01; /* 1% sample rate */ run;
Note: If you have not read the above-linked blog, I recommend you do so before you continue reading this one.
Now suppose your data includes a STATE variable, and you want to guarantee that your random sample includes the correct proportion of observations from each U.S. state. You need a stratified random sample. Again, PROC SURVEYSELECT easily does the job:
proc surveyselect data=large out=sample1 method=srs /* simple random sample */ rate=.01; /* 1% sample rate */ strata state /* sorted data required */ / alloc=proportional; run;
When the STRATA statement is used, PROC SURVEYSELECT requires the input dataset to be sorted by the strata column (STATE).
If you have a SAS/STAT license, you can enjoy the ease of use and flexibility of PROC SURVEYSELECT. If not, or if you're an efficiency hound like me, and you don't mind writing a little code (also like me) you can select a stratified random sample with a super-efficient 2-step BASE SAS program.
Step 1 of 2:
The PROC FREQ step below reads the LARGE dataset and creates a tiny output dataset, named COUNTS, with one column, named COUNT, and 50 rows, one per US state, sorted by STATE.
proc freq data=large (keep=state); table state / noprint out=Counts (keep=Count); run;
The LARGE dataset has 100 million rows and the four columns below:
The STATE variable is distributed in approximately the same proportions as the actual population of each US state. Additionally, the LARGE dataset is sorted by STATE.
Refer to the end of this blog to see how I created the LARGE dataset.
Step 2 of 2:
The DATA step below selects a stratified random sample of exactly 1 million rows (1%) from the LARGE dataset, reading only the selected rows, bypassing the other 99 million rows (99%) for extremely fast performance. As with PROC SURVEYSELECT, the LARGE dataset must be sorted by the strata column (STATE). The results are comparable to PROC SURVEYSELECT results.
data sample2; drop Need Pool Ratio Row Jump; set Counts; * One row per strata (US state); Need=round(Count*.01); * 1% sample from each strata; Pool=Count; * Available observations per strata; Jump+Count; * Cumulative count per strata; do until (Need=0); * Continue until no additional rows needed; Row+1; * Row number. Every row is considered; Ratio=Need/Pool; * 1 percent, continuously adjusted; if rand('uniform') < ratio then do; * Random selection; set large point=Row; * Jump to and read selected row; output; * Output selected row; Need=Need-1; * Subtract 1 from needed rows; end; Pool=Pool-1; * Subtract 1 from pool of remaining rows; end; Row=Jump; * JUMP (created above) to the last ROW of the strata; run;
Times below represent averages, rounded to the nearest second:
*I ran PROC SURVEYSELECT a second time, adding the DROP= option below to the SAMPLE1 output dataset, to drop variables otherwise created automatically by PROC SURVEYSELECT:
(drop=AllocProportion SampleSize ActualProportion SelectionProb SamplingWeight)
The DATA step below creates the LARGE dataset from the STATES dataset.
data large; keep ID State Count Char; length Char $ 200; set states(keep=state percent); do ID=1 to 100000000*percent until(count=100000000); * 100M; Count+1; output; end; run;
The DATA step below creates the STATES dataset.
data states; input @8 State & : $20. Population : comma13. Percent : percent8.; format Population comma13. Percent percent9.2; datalines; Copy and paste the below data here. ;