Splitting a data set into smaller data sets

11

Splitting one into smaller pieces

In his blog post, How to split one data set into many, Chris Hemedinger showed how to subset or split SAS data sets based on the values of categorical variables. For example, based on a value of variable REGION you may split a data set MARKETING into MARKETING_ASIA, MARKETING_AMERICA, MARKETING_EUROPE, and so on.

In some cases, however, we need to split a large data set into many – not by a subsetting variable values, but by a number of observations in order to produce smaller, better manageable data sets. Such an approach can be dictated by restrictions on the data set size imposed by hardware (memory size, transmission channel bandwidth etc.), processing time, or user interface convenience (e.g. search results displayed by pages).

For instance, we might need to split a data set into smaller tables of K observations or less each; or to split a data set into S equal (or approximately equal) pieces.

Also, we might need to split a data set into sequentially selected subsets where the first K observations go into the first data set, the second K observations go into the second data set, and so on. Alternatively, we might need to randomly select observations from a data set while splitting it into smaller tables.

This blog post provides possible coding solutions for such scenarios.

Splitting a data set into smaller data sets sequentially

Let’s say we need to split a data set SASHELP.CARS (number of observation N=428) into several smaller datasets. We will consider the following two sequential observation selection scenarios:

  1. Each smaller data set should have maximum of K observations.
  2. There should be S smaller data sets of approximately same size.

Ideally, we would like to split a data set into K observations each, but it is not always possible to do as the quotient of dividing the number of observations in the original dataset N by K is not always going to be a whole number. Therefore, we will split it into several smaller data sets of K observations each, but the last smaller data set will have the number of observations equal to the remainder of the division N by K.

Similarly, with the scenario 2, we will split the source data set into several smaller data sets of the same size, but the last smaller data set will have the number of observations equal to the remainder of the division N by K.

Below is a SAS macro code that covers both these scenarios. Parameter’s SPLIT_DEF value defines the type of scenario (SPLIT_DEF=NOBS means that the source data set is split into several smaller data sets with the number of observations not to exceed SPLIT_NUM value; SPLIT_DEF=SETS means that the source data set is split into SPLIT_NUM=value of smaller data sets).

%macro split (SRC_DATASET=, OUT_PREFIX=, SPLIT_NUM=, SPLIT_DEF=);
/* Parameters:
/*   SRC_DATASET - name of the source data set     */
/*   OUT_PREFIX - prefix of the output data sets   */
/*   SPLIT_NUM - split number                      */
/*   SPLIT_DEF - split definition (=SETS or =NOBS) */
 
   %local I K S TLIST;
 
   /* number of observations &K, number of smaller datasets &S */
   data _null_;
      if 0 then set &SRC_DATASET nobs=N;
      if upcase("&SPLIT_DEF")='NOBS' then
         do;
            call symputx('K',&SPLIT_NUM); 
            call symputx('S',ceil(N/&SPLIT_NUM));
            put "***MACRO SPLIT: Splitting into datasets of no more than &SPLIT_NUM observations";
         end;
         else if upcase("&SPLIT_DEF")='SETS' then
         do;
            call symputx('S',&SPLIT_NUM); 
            call symputx('K',ceil(N/&SPLIT_NUM));
            put "***MACRO SPLIT: Splitting into &SPLIT_NUM datasets";
        end;
         else put "***MACRO SPLIT: Incorrect SPLIT_DEF=&SPLIT_DEF value. Must be either SETS or NOBS.";
      stop; 
   run;
 
   /* terminate macro if nothing to split */
   %if (&K le 0) or (&S le 0) %then %return;
 
    /* generate list of smaller dataset names */
   %do I=1 %to &S;
      %let TLIST = &TLIST &OUT_PREFIX._&I;
   %end;
 
   /* split source dataset into smaller datasets */
   data &TLIST;
      set &SRC_DATASET;
      select;
         %do I=1 %to &S;
            when(_n_ <= &K * &I) output &OUT_PREFIX._&I; 
         %end;
      end;
   run;
 
%mend split;

The following are examples of the macro invocations:

%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=100, SPLIT_DEF=SET);
 
%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=100, SPLIT_DEF=NOBS);
 
%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=3, SPLIT_DEF=SETS);

These invocations will produce the following SAS logs:

%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=100, SPLIT_DEF=SET);
***MACRO SPLIT: Incorrect SPLIT_DEF=SET value. Must be either SETS or NOBS.
 
%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=100, SPLIT_DEF=NOBS);
***MACRO SPLIT: Splitting into datasets of no more than 100 observations
NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: The data set WORK.CARS_1 has 100 observations and 15 variables.
NOTE: The data set WORK.CARS_2 has 100 observations and 15 variables.
NOTE: The data set WORK.CARS_3 has 100 observations and 15 variables.
NOTE: The data set WORK.CARS_4 has 100 observations and 15 variables.
NOTE: The data set WORK.CARS_5 has 28 observations and 15 variables.
 
%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=3, SPLIT_DEF=SETS);
***MACRO SPLIT: Splitting into 3 datasets
NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: The data set WORK.CARS_1 has 143 observations and 15 variables.
NOTE: The data set WORK.CARS_2 has 143 observations and 15 variables.
NOTE: The data set WORK.CARS_3 has 142 observations and 15 variables.

Splitting a data set into smaller data sets randomly

For randomly splitting a data set into many smaller data sets we can use the same approach as above with a slight modification. In essence, we are going to randomly shuffle observations of our source data set first, and then apply the sequential splitting.

In order to implement this, we just need to replace the last data step in the above macro with the following 3 steps:

   /* generate random numbers, _R_ */
   data;
      set &SRC_DATASET;
      call streaminit(1234);
      _R_ = rand('uniform');
   run;
 
   /* sort data in _R_ order */
   proc sort;
      by _R_;
   run;
 
   /* split source dataset into smaller datasets */
   data &TLIST (drop=_R_);
      set;
      select;
         %do I=1 %to &S;
            when(_n_ <= &K * &I) output &OUT_PREFIX._&I; 
         %end;
      end;
   run;

This modified code will produce similar results (with the same information in the SAS log), however, smaller data sets will have their observations randomly selected from the source data set.

DATAn naming convention

You may have noticed that in this random splitting code I have not specified data set names neither in the DATA statement of the first DATA step, nor in the PROC SORT and not even in the SET statement of the last DATA step. Not only these shortcuts possible due to SAS’ DATAn naming convention, but it is a very robust way of dynamically assigning temporary data set names. This method is especially useful and appropriate for SAS macros as it guarantees that you do not accidentally overwrite a data set with the same name in SAS program that invokes your macro. Think about it: if you are a macro developer you need to make sure that whatever temporary data sets you create within your macro their names must be unique for a SAS session in order not to interfere with any data sets that may be created in the calling SAS program outside of your macro.

Here are defaults in SAS’ DATAn naming convention:

  • If you do not specify a name for the output data set in a DATA statement, SAS automatically assigns the default names WORK.DATA1, WORK.DATA2, and so on, to each successive data set that you create.
  • If you do not specify a name for the input data set in a SET statement, SAS automatically uses the last data set that was created. SAS keeps track of the most recently created data set through the reserved name _LAST_. When you execute a DATA or PROC step without specifying an input data set, by default, SAS uses the _LAST_ data set.

For more information on this useful SAS coding technique see special data set names and examples and warning on using special data set names.

Your thoughts?

Do you find this post useful? Have you ever split data sets into smaller ones based on a number of observations? Do you use special data set names and DATAn naming convention in your SAS coding? Please share your thoughts in the comments section below.

Related blog post: How to split a raw file or a data set into many external raw files

You may also like: How to create and use SAS macro functions

Share

About Author

Leonid Batkhan

Leonid Batkhan, Ph.D. in Computer Science and Automatic Control Systems, has been a SAS user for more than 25 years. He came to work for SAS in 1995 and is currently a Senior Consultant with the SAS Federal Data Management and Business Intelligence Practice. During his career, Leonid has successfully implemented dozens of SAS applications and projects in various industries. All posts by Leonid Batkhan >>>

11 Comments

  1. Bartosz Jabłoński on

    Leonid, thanks for another cool article!

    Let me share two small examples I have. I named them "hashSplit" and "obsSplit", hope you like them. 🙂

    All the best
    Bart

     
    /* split dataset into small parts */
     
    data cars;
      set sashelp.cars;
      output;
      output;
    run;
     
    %macro hashSplit(
     ds = cars
    ,parts = 100 /* parts = ceil(nobs/5) */
    ,prefix = Hpart
    );
      data _null_;
        if _N_ = 1 then 
          do;
            declare hash H(dataset:"&ds.(obs=0)", multidata:"Y");
            H.defineKey(all: "YES");
            H.defineDone();
          end;
     
        do _N_ = 1 by 1 until(_N_ = &parts. or EOF);
          set &ds. nobs = nobs end = eof;
          H.add();
        end;
     
        _iorc_ + 1;
        H.output(dataset: cats("&prefix.",_iorc_));
        H.clear();
        if EOF then stop;
      run;
     
    %mend hashSplit;
     
    %hashSplit(ds = cars, parts = 100)
    %hashSplit(ds = cars, parts = ceil(nobs/5) )
     
     
    %macro obsSplit( 
     ds = cars
    ,parts = 100 /* parts = ceil(nobs/5) */
    ,prefix = Opart
    );
      data _null_;
        call symputX("step", &parts., "L");
        call symputX("loop", ceil(divide(nobs, &parts.)) , "L");
        stop;
        set &ds. nobs = nobs;
      run;
     
      %local i;
      %do i = 1 %to %eval(&loop.);
     
        data &prefix.&i.;
          set &ds.(
            firstobs = %eval((&i.-1)*&step. + 1) 
            obs      = %eval(&i.*&step.) 
          );
        run;
     
      %end;
     
    %mend obsSplit;
     
    %obsSplit(ds = cars, parts = 100)
    %obsSplit(ds = cars, parts = ceil(nobs/5) )
    • Leonid Batkhan

      Thank you, Bart, for your comment. Your implementation examples are a great addition to this blog post. Hope our readers will appreciate it too.
      Re your second macro, I was too considering using FIRST= and OBS=, but chose to use SELECT/WHEN for the sole reason that it uses just a single DATA step. Just imagine your SAS log if you are to split 10,000,000 obs data set into 1,000 parts 🙂

  2. Hi Leonid,

    I'm relatively new to SAS. I loved this method. I have never seen someone use a DO loop with SELECT...WHEN. very nice. I also like how your variable and dataset names don't interfere with whatever was in the original dataset.

    I had been thinking about randomly splitting datasets before I saw your post - specifically, the time required. If we sort, we need to use at least O(n log n) time. I believe that this can be improved to O(n). I modified your program below. There was a slight difference on datasets with a million observations. The difference became more pronounced at 10 million. (92 seconds vs. 169 seconds) I stopped there.

    The downside is that my method may interfere if they have any variables named R, J, I or obsLeft(n).

    John Berglund

    (I don't even know how to get the fancy box surrounding my code. 🙂
    __________________________________

    %macro randSplit2 (SRC_DATASET=, OUT_PREFIX=, SPLIT_NUM=, SPLIT_DEF=);
    	/* Parameters:
    	/*   SRC_DATASET - name of the source data set     */
    	/*   OUT_PREFIX - prefix of the output data sets   */
    	/*   SPLIT_NUM - split number                      */
    	/*   SPLIT_DEF - split definition (=SETS or =NOBS) */
     
       %local I K S N TLIST;
     
    	/* number of observations &K, number of smaller datasets &S */
    	data _null_;
    		if 0 then
    			set &SRC_DATASET nobs=N;
     
    		if upcase("&SPLIT_DEF")='NOBS' then
    			do;
    				call symputx('N', N);
    				call symputx('K', &SPLIT_NUM);
    				call symputx('S', ceil(N/&SPLIT_NUM));
    				put "***MACRO SPLIT: Splitting into datasets of no more than &SPLIT_NUM observations";
    			end;
    		else if upcase("&SPLIT_DEF")='SETS' then
    			do;
    				call symputx('N', N);
    				call symputx('S', &SPLIT_NUM);
    				call symputx('K', ceil(N/&SPLIT_NUM));
    				put "***MACRO SPLIT: Splitting into &SPLIT_NUM datasets";
    			end;
    		else
    			put "***MACRO SPLIT: Incorrect SPLIT_DEF=&SPLIT_DEF value. Must be either SETS or NOBS.";
    		stop;
    	run;
     
    	/* terminate macro if nothing to split */ 
     
    	%if (&K le 0) or (&S le 0) %then
    		%return;
     
    	/* generate list of smaller dataset names */ 
     
    	%do I=1 %to &S;
    		%let TLIST = &TLIST &OUT_PREFIX._&I;
    	%end;
     
    	/* split source dataset into smaller datasets */
    	/* generate random numbers, R */
    	data;
    		set &SRC_DATASET;
    		call streaminit(1234);
    		R=rand('uniform');
    	run;
     
    	/* split source dataset into smaller datasets */
    	data &TLIST;
    		array obsLeft {&S};
     
    		if _n_=1 then
    			do;
     
    				do I=1 to &S;
    					obsLeft{I}=&K*I;
    				end;
    				obsLeft{&S}=&N;
    			end;
    		set;
    		R=R*obsLeft{&S};
     
    		select;
     
    			%do I=1 %to &S;
    				when (R<obsLeft{&I}) 
    					do;
     
    						do J=&I to &S;
    							obsLeft{J}=obsLeft{J}-1;
    						end;
    						output &OUT_PREFIX._&I;
    					end;
    			%end;
    		end;
    		retain obsLeft;
    		drop R J I;
     
    		%do I=1 %to &S;
    			drop obsLeft&I;
    		%end;
    	run;
     
    %mend randSplit2;
    • Leonid Batkhan

      Thank you, John, for your comment and sharing your implementation for the random split. Your SAS code is quite impressive for someone "relatively new to SAS". When it comes to code implementation there is always room for improvement, and you have done just that. Kudos to you! My goal for the blog post was mostly educational so I chose the simplest way to illustrate the idea. And, by the way, in my macro implementation the R variable may interfere with the original data set, so I went ahead and changed it to _R_ which may also interfere, but with less probability :). Ideally, we would need to construct a unique name based on what variables already exist on the source data set. But that can be a topic for my next post...

    • Leonid Batkhan

      Hi John,
      I took a closer look at your code and found that it does something very different from the random splitting that I have described in this blog post. While my solution randomizes observation selection across the whole source data set, your solution only randomizes observation selection within small chunks of the the source data set. In essence, your randSplit2 macro processes the source data set sequentially in chunks of S observations (where S is a number of the output data sets) and only randomizes observation selection within each chunk. I would've called your algorithm sequential splitting with partial randomization. Still, it might be useful for some scenarios, but comparing the two by run time is only valid with understanding their profound differences.

      Thanks again for providing your alternative solution.

  3. Jagdish Latke on

    Thanks for sharing the blog. It provided great information in splitting dataset. In clinical industry it's common thing to split the large data into smaller one when it exceeds specific storage capacity. It will help a lot. Thanks once again.

  4. Hi Leonid, I knew that you did not need to specify a data set name in a DATA step, but did not know that you could produce multiple data sets this way. That's great, and your example (splitting the data randomly) is a perfect example of where it makes sense to do this. I always learn something from your blogs. Thanks!

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top