Splitting a data set into smaller data sets

26

Splitting one into smaller pieces

In his blog post, How to split one data set into many, Chris Hemedinger showed how to subset or split SAS data sets based on the values of categorical variables. For example, based on a value of variable REGION you may split a data set MARKETING into MARKETING_ASIA, MARKETING_AMERICA, MARKETING_EUROPE, and so on.

In some cases, however, we need to split a large data set into many – not by a subsetting variable values, but by a number of observations in order to produce smaller, better manageable data sets. Such an approach can be dictated by restrictions on the data set size imposed by hardware (memory size, transmission channel bandwidth etc.), processing time, or user interface convenience (e.g. search results displayed by pages).

For instance, we might need to split a data set into smaller tables of K observations or less each; or to split a data set into S equal (or approximately equal) pieces.

Also, we might need to split a data set into sequentially selected subsets where the first K observations go into the first data set, the second K observations go into the second data set, and so on. Alternatively, we might need to randomly select observations from a data set while splitting it into smaller tables.

This blog post provides possible coding solutions for such scenarios.

Splitting a data set into smaller data sets sequentially

Let’s say we need to split a data set SASHELP.CARS (number of observation N=428) into several smaller datasets. We will consider the following two sequential observation selection scenarios:

  1. Each smaller data set should have maximum of K observations.
  2. There should be S smaller data sets of approximately same size.

Ideally, we would like to split a data set into K observations each, but it is not always possible to do as the quotient of dividing the number of observations in the original dataset N by K is not always going to be a whole number. Therefore, we will split it into several smaller data sets of K observations each, but the last smaller data set will have the number of observations equal to the remainder of the division N by K.

Similarly, with the scenario 2, we will split the source data set into several smaller data sets of the same size, but the last smaller data set will have the number of observations equal to the remainder of the division N by K.

Below is a SAS macro code that covers both these scenarios. Parameter’s SPLIT_DEF value defines the type of scenario (SPLIT_DEF=NOBS means that the source data set is split into several smaller data sets with the number of observations not to exceed SPLIT_NUM value; SPLIT_DEF=SETS means that the source data set is split into SPLIT_NUM=value of smaller data sets).

%macro split (SRC_DATASET=, OUT_PREFIX=, SPLIT_NUM=, SPLIT_DEF=);
/* Parameters:
/*   SRC_DATASET - name of the source data set     */
/*   OUT_PREFIX - prefix of the output data sets   */
/*   SPLIT_NUM - split number                      */
/*   SPLIT_DEF - split definition (=SETS or =NOBS) */
 
   %local I K S TLIST;
 
   /* number of observations &K, number of smaller datasets &S */
   data _null_;
      if 0 then set &SRC_DATASET nobs=N;
      if upcase("&SPLIT_DEF")='NOBS' then
         do;
            call symputx('K',&SPLIT_NUM); 
            call symputx('S',ceil(N/&SPLIT_NUM));
            put "***MACRO SPLIT: Splitting into datasets of no more than &SPLIT_NUM observations";
         end;
         else if upcase("&SPLIT_DEF")='SETS' then
         do;
            call symputx('S',&SPLIT_NUM); 
            call symputx('K',ceil(N/&SPLIT_NUM));
            put "***MACRO SPLIT: Splitting into &SPLIT_NUM datasets";
        end;
         else put "***MACRO SPLIT: Incorrect SPLIT_DEF=&SPLIT_DEF value. Must be either SETS or NOBS.";
      stop; 
   run;
 
   /* terminate macro if nothing to split */
   %if (&K le 0) or (&S le 0) %then %return;
 
    /* generate list of smaller dataset names */
   %do I=1 %to &S;
      %let TLIST = &TLIST &OUT_PREFIX._&I;
   %end;
 
   /* split source dataset into smaller datasets */
   data &TLIST;
      set &SRC_DATASET;
      select;
         %do I=1 %to &S;
            when(_n_ <= &K * &I) output &OUT_PREFIX._&I; 
         %end;
      end;
   run;
 
%mend split;

The following are examples of the macro invocations:

%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=100, SPLIT_DEF=SET);
 
%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=100, SPLIT_DEF=NOBS);
 
%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=3, SPLIT_DEF=SETS);

These invocations will produce the following SAS logs:

%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=100, SPLIT_DEF=SET);
***MACRO SPLIT: Incorrect SPLIT_DEF=SET value. Must be either SETS or NOBS.
 
%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=100, SPLIT_DEF=NOBS);
***MACRO SPLIT: Splitting into datasets of no more than 100 observations
NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: The data set WORK.CARS_1 has 100 observations and 15 variables.
NOTE: The data set WORK.CARS_2 has 100 observations and 15 variables.
NOTE: The data set WORK.CARS_3 has 100 observations and 15 variables.
NOTE: The data set WORK.CARS_4 has 100 observations and 15 variables.
NOTE: The data set WORK.CARS_5 has 28 observations and 15 variables.
 
%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=3, SPLIT_DEF=SETS);
***MACRO SPLIT: Splitting into 3 datasets
NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: The data set WORK.CARS_1 has 143 observations and 15 variables.
NOTE: The data set WORK.CARS_2 has 143 observations and 15 variables.
NOTE: The data set WORK.CARS_3 has 142 observations and 15 variables.

Splitting a data set into smaller data sets randomly

For randomly splitting a data set into many smaller data sets we can use the same approach as above with a slight modification. In essence, we are going to randomly shuffle observations of our source data set first, and then apply the sequential splitting.

In order to implement this, we just need to replace the last data step in the above macro with the following 3 steps:

   /* generate random numbers, _R_ */
   data;
      set &SRC_DATASET;
      call streaminit(1234);
      _R_ = rand('uniform');
   run;
 
   /* sort data in _R_ order */
   proc sort;
      by _R_;
   run;
 
   /* split source dataset into smaller datasets */
   data &TLIST (drop=_R_);
      set;
      select;
         %do I=1 %to &S;
            when(_n_ <= &K * &I) output &OUT_PREFIX._&I; 
         %end;
      end;
   run;

This modified code will produce similar results (with the same information in the SAS log), however, smaller data sets will have their observations randomly selected from the source data set.

DATAn naming convention

You may have noticed that in this random splitting code I have not specified data set names neither in the DATA statement of the first DATA step, nor in the PROC SORT and not even in the SET statement of the last DATA step. Not only these shortcuts possible due to SAS’ DATAn naming convention, but it is a very robust way of dynamically assigning temporary data set names. This method is especially useful and appropriate for SAS macros as it guarantees that you do not accidentally overwrite a data set with the same name in SAS program that invokes your macro. Think about it: if you are a macro developer you need to make sure that whatever temporary data sets you create within your macro their names must be unique for a SAS session in order not to interfere with any data sets that may be created in the calling SAS program outside of your macro.

Here are defaults in SAS’ DATAn naming convention:

  • If you do not specify a name for the output data set in a DATA statement, SAS automatically assigns the default names WORK.DATA1, WORK.DATA2, and so on, to each successive data set that you create.
  • If you do not specify a name for the input data set in a SET statement, SAS automatically uses the last data set that was created. SAS keeps track of the most recently created data set through the reserved name _LAST_. When you execute a DATA or PROC step without specifying an input data set, by default, SAS uses the _LAST_ data set.

For more information on this useful SAS coding technique see special data set names and examples and warning on using special data set names.

Your thoughts?

Do you find this post useful? Have you ever split data sets into smaller ones based on a number of observations? Do you use special data set names and DATAn naming convention in your SAS coding? Please share your thoughts in the comments section below.

Related blog post: How to split a raw file or a data set into many external raw files

You may also like: How to create and use SAS macro functions

Share

About Author

Leonid Batkhan

Leonid Batkhan is a long-time SAS consultant and blogger. Currently, he is a Lead Applications Developer at F.N.B. Corporation. He holds a Ph.D. in Computer Science and Automatic Control Systems and has been a SAS user for more than 25 years. From 1995 to 2021 he worked as a Data Management and Business Intelligence consultant at SAS Institute. During his career, Leonid has successfully implemented dozens of SAS applications and projects in various industries. All posts by Leonid Batkhan >>>

26 Comments

  1. I created a SAS Macro that divides a dataset into 'n' equal parts for bootstrap validation, using the Firstobs and obs data set options. I haven't tested the program written in the SAS Blog, but after reading it, I believe you used more steps than necessary to achieve the same result as mine. You can find my SAS macro code on my blog, where I have used it successfully on many datasets. it took me 2 hrs to sucessfully get the logic behind the relationship between obs, firstobs and total number of observations in the dataset. If you have some time then Review SAS macro in my blog as well: https://srinisasmacro.blogspot.com/

    • Leonid Batkhan

      Thank you, Satya, for sharing your implementation. However, I must correct you regarding “you used more steps than necessary to achieve the same result as mine”. It is quite the opposite as your macro uses 'n' data steps (generated by the macro loop) while macro %split() described in this post does all the splitting in a single data step. Some extra logic before that single processing data step is necessary in order to make this macro multi-functional: depending on SPLIT_DEF (=SETS or =NOBS) it can divide a dataset into SPLIT_NUM datasets or it can divide a dataset into several datasets of SPLIT_NUM observations each.

  2. Blessing Sibanda on

    Hi Leonid, thank you for this. So If I already have 3 data sets (eg Scores from three different Universities) all mixed up in a single excel sheet and with different number of scores. Eg Oxford sample with 4000 scores, Cambridge with 3200 scores, and Portsmouth with 1800 scores and I want to trim them so that the sample sizes match the university with the lowest number of scores (in this case 1800), what would the macros look like?

    • Leonid Batkhan

      You can use the following 3-step approach to randomly select equal samples for each university. 1) First, you would need to split your single excel sheet into 3 data sets (OXFORD, CAMBRIDGE, PORTSMOUTH). 2) Then determine the sample size as the lowest number (1800) observations, and 3) then call this macro (with random selection) on each of the 3 data sets:

      %split(SRC_DATASET=OXFORD, OUT_PREFIX=WORK.OXFORD, SPLIT_NUM=1800, SPLIT_DEF=NOBS);
      %split(SRC_DATASET=CAMBRIDGE, OUT_PREFIX=WORK.CAMBRIDGE, SPLIT_NUM=1800, SPLIT_DEF=NOBS);
      %split(SRC_DATASET=PORTSMOUTH, OUT_PREFIX=WORK.PORTSMOUTH, SPLIT_NUM=1800, SPLIT_DEF=NOBS);
      

      Your resulting datasets WORK.OXFORD_1, WORK.CAMBRIDGE_1 and WORK.PORTSMOUTH_1 will have each 1800 observations (equal sample size) randomly selected from your original samples. You may disregard WORK.OXFORD_2, WORK.OXFORD_3, WORK.CAMBRIDGE_2 data sets.

      Hope this helps.

  3. Hi Leonid, I knew that you did not need to specify a data set name in a DATA step, but did not know that you could produce multiple data sets this way. That's great, and your example (splitting the data randomly) is a perfect example of where it makes sense to do this. I always learn something from your blogs. Thanks!

  4. Jagdish Latke on

    Thanks for sharing the blog. It provided great information in splitting dataset. In clinical industry it's common thing to split the large data into smaller one when it exceeds specific storage capacity. It will help a lot. Thanks once again.

  5. Hi Leonid,

    I'm relatively new to SAS. I loved this method. I have never seen someone use a DO loop with SELECT...WHEN. very nice. I also like how your variable and dataset names don't interfere with whatever was in the original dataset.

    I had been thinking about randomly splitting datasets before I saw your post - specifically, the time required. If we sort, we need to use at least O(n log n) time. I believe that this can be improved to O(n). I modified your program below. There was a slight difference on datasets with a million observations. The difference became more pronounced at 10 million. (92 seconds vs. 169 seconds) I stopped there.

    The downside is that my method may interfere if they have any variables named R, J, I or obsLeft(n).

    John Berglund

    (I don't even know how to get the fancy box surrounding my code. 🙂
    __________________________________

    %macro randSplit2 (SRC_DATASET=, OUT_PREFIX=, SPLIT_NUM=, SPLIT_DEF=);
    	/* Parameters:
    	/*   SRC_DATASET - name of the source data set     */
    	/*   OUT_PREFIX - prefix of the output data sets   */
    	/*   SPLIT_NUM - split number                      */
    	/*   SPLIT_DEF - split definition (=SETS or =NOBS) */
     
       %local I K S N TLIST;
     
    	/* number of observations &K, number of smaller datasets &S */
    	data _null_;
    		if 0 then
    			set &SRC_DATASET nobs=N;
     
    		if upcase("&SPLIT_DEF")='NOBS' then
    			do;
    				call symputx('N', N);
    				call symputx('K', &SPLIT_NUM);
    				call symputx('S', ceil(N/&SPLIT_NUM));
    				put "***MACRO SPLIT: Splitting into datasets of no more than &SPLIT_NUM observations";
    			end;
    		else if upcase("&SPLIT_DEF")='SETS' then
    			do;
    				call symputx('N', N);
    				call symputx('S', &SPLIT_NUM);
    				call symputx('K', ceil(N/&SPLIT_NUM));
    				put "***MACRO SPLIT: Splitting into &SPLIT_NUM datasets";
    			end;
    		else
    			put "***MACRO SPLIT: Incorrect SPLIT_DEF=&SPLIT_DEF value. Must be either SETS or NOBS.";
    		stop;
    	run;
     
    	/* terminate macro if nothing to split */ 
     
    	%if (&K le 0) or (&S le 0) %then
    		%return;
     
    	/* generate list of smaller dataset names */ 
     
    	%do I=1 %to &S;
    		%let TLIST = &TLIST &OUT_PREFIX._&I;
    	%end;
     
    	/* split source dataset into smaller datasets */
    	/* generate random numbers, R */
    	data;
    		set &SRC_DATASET;
    		call streaminit(1234);
    		R=rand('uniform');
    	run;
     
    	/* split source dataset into smaller datasets */
    	data &TLIST;
    		array obsLeft {&S};
     
    		if _n_=1 then
    			do;
     
    				do I=1 to &S;
    					obsLeft{I}=&K*I;
    				end;
    				obsLeft{&S}=&N;
    			end;
    		set;
    		R=R*obsLeft{&S};
     
    		select;
     
    			%do I=1 %to &S;
    				when (R<obsLeft{&I}) 
    					do;
     
    						do J=&I to &S;
    							obsLeft{J}=obsLeft{J}-1;
    						end;
    						output &OUT_PREFIX._&I;
    					end;
    			%end;
    		end;
    		retain obsLeft;
    		drop R J I;
     
    		%do I=1 %to &S;
    			drop obsLeft&I;
    		%end;
    	run;
     
    %mend randSplit2;
    • Leonid Batkhan

      Thank you, John, for your comment and sharing your implementation for the random split. Your SAS code is quite impressive for someone "relatively new to SAS". When it comes to code implementation there is always room for improvement, and you have done just that. Kudos to you! My goal for the blog post was mostly educational so I chose the simplest way to illustrate the idea. And, by the way, in my macro implementation the R variable may interfere with the original data set, so I went ahead and changed it to _R_ which may also interfere, but with less probability :). Ideally, we would need to construct a unique name based on what variables already exist on the source data set. But that can be a topic for my next post...

    • Leonid Batkhan

      Hi John,
      I took a closer look at your code and found that it does something very different from the random splitting that I have described in this blog post. While my solution randomizes observation selection across the whole source data set, your solution only randomizes observation selection within small chunks of the the source data set. In essence, your randSplit2 macro processes the source data set sequentially in chunks of S observations (where S is a number of the output data sets) and only randomizes observation selection within each chunk. I would've called your algorithm sequential splitting with partial randomization. Still, it might be useful for some scenarios, but comparing the two by run time is only valid with understanding their profound differences.

      Thanks again for providing your alternative solution.

  6. Bartosz Jabłoński on

    Leonid, thanks for another cool article!

    Let me share two small examples I have. I named them "hashSplit" and "obsSplit", hope you like them. 🙂

    All the best
    Bart

     
    /* split dataset into small parts */
     
    data cars;
      set sashelp.cars;
      output;
      output;
    run;
     
    %macro hashSplit(
     ds = cars
    ,parts = 100 /* parts = ceil(nobs/5) */
    ,prefix = Hpart
    );
      data _null_;
        if _N_ = 1 then 
          do;
            declare hash H(dataset:"&ds.(obs=0)", multidata:"Y");
            H.defineKey(all: "YES");
            H.defineDone();
          end;
     
        do _N_ = 1 by 1 until(_N_ = &parts. or EOF);
          set &ds. nobs = nobs end = eof;
          H.add();
        end;
     
        _iorc_ + 1;
        H.output(dataset: cats("&prefix.",_iorc_));
        H.clear();
        if EOF then stop;
      run;
     
    %mend hashSplit;
     
    %hashSplit(ds = cars, parts = 100)
    %hashSplit(ds = cars, parts = ceil(nobs/5) )
     
     
    %macro obsSplit( 
     ds = cars
    ,parts = 100 /* parts = ceil(nobs/5) */
    ,prefix = Opart
    );
      data _null_;
        call symputX("step", &parts., "L");
        call symputX("loop", ceil(divide(nobs, &parts.)) , "L");
        stop;
        set &ds. nobs = nobs;
      run;
     
      %local i;
      %do i = 1 %to %eval(&loop.);
     
        data &prefix.&i.;
          set &ds.(
            firstobs = %eval((&i.-1)*&step. + 1) 
            obs      = %eval(&i.*&step.) 
          );
        run;
     
      %end;
     
    %mend obsSplit;
     
    %obsSplit(ds = cars, parts = 100)
    %obsSplit(ds = cars, parts = ceil(nobs/5) )
    • Leonid Batkhan

      Thank you, Bart, for your comment. Your implementation examples are a great addition to this blog post. Hope our readers will appreciate it too.
      Re your second macro, I was too considering using FIRST= and OBS=, but chose to use SELECT/WHEN for the sole reason that it uses just a single DATA step. Just imagine your SAS log if you are to split 10,000,000 obs data set into 1,000 parts 🙂

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top