How to split a raw file or a data set into many external raw files

14

Splitting external raw/text/flat files into multiple files

Recently, I worked on a cybersecurity project that entailed processing a staggering number of raw text files about web traffic. Millions of rows had to be read and parsed to extract variable values.

The problem was complicated by the varying records composition. Each external raw file was a collection of records of different structures that required different parsing programming logic. Besides, those heterogeneous records could not possibly belong to the same rectangular data tables with fixed sets of columns.

Solving the problem

To solve the problem, I decided to employ a "divide and conquer" strategy: to split the external file into many files, each with a homogeneous structure, then parse them separately to create as many output SAS data sets.

My plan was to use a SAS DATA Step for looping through the rows (records) of the external file, read each row, identify its type, and based on that, write it to a corresponding output file.

Like how we would split a data set into many:

 
data CARS_ASIA CARS_EUROPE CARS_USA;
   set SASHELP.CARS;
   select(origin);
      when('Asia')   output CARS_ASIA;
      when('Europe') output CARS_EUROPE;
      when('USA')    output CARS_USA;
   end;   
run;

But how do you switch between the output files? The idea came from SAS' Chris Hemedinger, who suggested using multiple FILE statements to redirect output to different external files.

Splitting an external raw file into many

As you know, one can use PUT statement in a SAS DATA Step to output a character string or a combination of character strings and variable values into an external file. That external file (a destination) is defined by a FILE statement. While that destination can only be one at a time (we can’t write into multiple raw file destinations simultaneously), there is no restriction on the number of FILE statements in a DATA Step; however, only one FILE statement at a time is active.

Whenever you define a new destination with the FILE statement, it will overwrite its previous definition, but that is exactly what we need.
Let’s say we need to split an external raw file input_file.txt into several raw files based on the value of some field REC_TYPE indicative of the record type. Then we can achieve that as shown in the following code example:

 
filename inf  'c:\temp\input_file.txt';
filename out1 'c:\temp\traffic.txt';
filename out2 'c:\temp\system.txt';
filename out3 'c:\temp\threat.txt';
filename out4 'c:\temp\other.txt';
 
data _null_;
   infile inf;
   input REC_TYPE $10.;
   select(REC_TYPE);
      when('TRAFFIC') file out1;
      when('SYSTEM')  file out2;
      when('THREAT')  file out3;
      otherwise       file out4;
   end;
   put _infile_;
run;

In this code, the INPUT statement retrieves the value of REC_TYPE. It may not be used exactly as written, but the point is you need to capture the filed(s) of interest.

But the INPUT statement does something else behind the scene. First, SAS creates an input buffer and loads the entire raw data line into it. The Input Buffer's value gets assigned to the _INFILE_ automatic variable. Then REC_TYPE variable value is moved to the Program Data Vector (PDV).

After that INPUT statement, two variables are available in the DATA Step - REC_TYPE and _INFILE_ (automatic variable).

Depending on the value of the REC_TYPE variable, SELECT block toggles the FILE definition between one of the four filerefs, out1, out2, out3, or out4.

Then the PUT statement outputs the _infile_ automatic variable value to the output file defined in the SELECT block.

Splitting a data set into several external files

Similar technique can be used to split a data table into several external raw files. Let’s combine the above two code samples to demonstrate how you can split a data set into several external raw files:

 
filename outasi 'c:\temp\cars_asia.txt';
filename outeur 'c:\temp\cars_europe.txt';
filename outusa 'c:\temp\cars_usa.txt';
 
data _null_;
   set SASHELP.CARS;
   select(origin);
      when('Asia')   file outasi;
      when('Europe') file outeur;
      when('USA')    file outusa;
   end;
   put (_all_)(=); 
run;

This code will read observations of the SASHELP.CARS data table, and depending on the value of ORIGIN variable, put (_all_)(=); statement will output all the variables in their physical order as named values (VARIABLE_NAME=VARIABLE_VALUE pairs) to one of the three external raw files specified by their respective file references (outasi, outeur, or outusa.)

NOTE: If you use put _all_; statement, then SAS will output all the variables including automatic variables _ERROR_ and _N_.

You can modify this code to produce delimited files with full control over which variables and in what order to output. For example, the following code sample produces 3 files with comma-separated values:

 
data _null_;
   set SASHELP.CARS;
   select(origin);
      when('Asia')   file outasi dlm=',';
      when('Europe') file outeur dlm=',';
      when('USA')    file outusa dlm=',';
   end;
   put make model type origin msrp invoice; 
run;

You may use different delimiters for your output files. In addition, rather than using mutually exclusive SELECT, you may use different logic for re-directing your output to different external files.

Bonus: How to zip your output files as you create them

For those readers who are patient enough to read to this point, here is another tip. As described in this series of blog posts by Chris Hemedinger, in SAS you can read your external raw files directly from zipped files without unzipping them first, as well as write your output raw files directly into zipped files. You just need to specify that in your filename statement. For example:

UNIX/Linux

 
filename outusa ZIP '/sas/data/temp/cars_usa.txt.gz' GZIP;

Windows

 
filename outusa ZIP 'c:\temp\cars.zip' member='cars_usa.txt';

Your turn

What is your experience with creating multiple external raw files? Could you please share with the rest of us?

Share

About Author

Leonid Batkhan

Leonid Batkhan, Ph.D. in Computer Science and Automatic Control Systems, has been a SAS user for more than 25 years. He came to work for SAS in 1995 and is currently a Senior Consultant with the SAS Federal Data Management and Business Intelligence Practice. During his career, Leonid has successfully implemented dozens of SAS applications and projects in various industries. All posts by Leonid Batkhan >>>

14 Comments

  1. Hi Leonid,
    I'm what you call a lazy programmer. Maintenance is boring and much disliked. Have you ever used the FILEVAR= option on the FILE statement. In this case doing so will allow the REC_TYPE to be any value and you won't have to change your program as REC_TYPE values change. The MOD option says to append to the file if it exists, so you may have to remove the created text files if you want fresh ones for each run. Additionally, _INFILE_ is available immediately after any INPUT statement is executed so no need for @ to hang onto the record. Try this.

    %let Folder=C:\Temp;
    filename inf  "&Folder.\input_file.txt";
     
    data _null_;
       infile inf;
       input REC_TYPE $10. ;
       whichFile="&Folder.\"||strip(REC_TYPE)||".txt";
       file dummy filevar=whichFile MOD;
       put _infile_;
    run;
    

    • Leonid Batkhan

      Hi Jerry,
      Thank you for stopping by and providing an alternative data-driven way of doing it. The FILEVAR= usage is described in the SAS Support Sample 24599: Create multiple files from one data set. There is a couple of considerations. According to that Sample, "When creating multiple files with the FILEVAR= option, the files must have the same layout. When the value of the FILEVAR= variable changes, SAS closes the current file and opens a new file." That is why you need MOD. If you use multiple FILE statements as described in this blog post, the files remain open, and you don't need MOD; you just switch between them for writing, and you don't have to remove them to start fresh between runs. Also, in your code example you use REC_TYPE for both, record type identifier and the output file name. In more general case of them being different you would end up with SELECT-END block (or its alternative) for re-coding REC_TYPE into file name. So, all in all, "lazy" programming may turn out to be more "expensive".

      Regarding your second point that "_INFILE_ is available immediately after any INPUT statement is executed so no need for @ to hang onto the record", - it is somewhat counter-intuitive in a sense that even when INPUT statement specifies reading a single first character from the raw file SAS reads in the entire row regardless. Is that efficient? But it seems to be working that way nevertheless. That also means that the second input statement is redundant. I will double check on this, and if it indeed works without limitations, this "hack" alone is worth a fortune. Thank you for sharing!

      • Hi Leonard,
        I think Sample 24599 can mislead when it says "the files must have the same layout" - it really means little more than you can't change other file specification options (such as RECFM), it doesn't mean you can't arrange data elements in different orders, create repeating groups or anything like that. Using FILEVAR= has the advantage that you don't need to know the 'rec-types' or number of files being created before-hand, as you would with constructs like SELECT ... WHEN ...
        In any case, if you're doing little else than PUTting the_INFILE_ variable to the output files then they'd effectively have the same 'layout'.
        I've used this many times to implement data-driven ETL.

        • Leonid Batkhan

          Hi David,
          Thank for your comment. I agree with you that the FILEVAR= option has an advantage in those cases where we don’t know the number of files being created, and that is where a data-driven approach is totally appropriate. However, I dare to say that “data-driven” is not always the best method. For the use case I describe in this blog post, where an external file is split based on the record type and where output files are known in advance, I prefer not to use a data-driven implementation for better code clarity and readability. I have also not seen any written material on the usage of multiple FILE statements in a SAS data step and wanted to shed light on this method in this post.

    • Leonid Batkhan

      Hi Jerry, based on your input I revised my original code

      data _null_;
         infile inf;
         input REC_TYPE $10. @;
         input;
         select(REC_TYPE);
         . . .
      

      to eliminate the trailing @ and consequently the second input statement:

      data _null_;
         infile inf;
         input REC_TYPE $10.;
         select(REC_TYPE);
         . . .
      

      and provided corresponding explanation.

      Thank you again for your INPUT (pun is coincidentally intended 🙂 .)

  2. Peter Lancashire on

    I know this is a blog about SAS, but I would question whether SAS is the right tool for this. The ancient unix awk tool is designed to do exactly this. Something like this one-line script would do what you want:
    print >> $1
    .
    Almost any structured text file editing is hard in SAS. I often use awk. The program file is 152kB in Windows. The GNU version is a bit bigger.
    .
    I imagine there is a clean and efficient way to do this in python too, but at the cost of more code bloat.

    • Leonid Batkhan

      Hi Peter. Thank you for your comment (non-SAS views a welcome here too.) While this particular task (splitting text files into many) can be solved by a variety different tools, it is just a part of a bigger picture where the power of SAS shines.

      By the way, your one-line script example is a bit of exaggeration; it does not do what is needed here as it lacks criteria of splitting the source file as well as specifying the output files.

      In SAS we can run any OS command using x-statement (x ‘any-operating-system-command’; ) without leaving SAS environment, which is not more complicated than running that command directly in a command window. However, because some IT policies prohibit using OS command in SAS we also rely on other ways of doing the same tasks.

      SAS is a great integrator tool; it allows a transparent access to multiple heterogeneous data source, as well as combining different languages into one automated process. You can run R or Python from within SAS as well.

      For the project I describe in this blog post, SAS is definitely the right tool for an automated process that includes reading external files, parsing them, validating and producing SAS tables that are loaded into SAS Visual Analytics for further analysis and modeling. I am just illustrating how seemingly non-SAS task can be solved by SAS to ensure smooth flow of execution.

      I might surprise you, but we can use SAS to create UNIX scripts too 🙂 .

  3. Great post. re "_INFILE_ is available immediately after any INPUT statement is executed so no need for @ to hang onto the record" I would agree. There are examples of it in Howard Schreier's paper https://www.lexjansen.com/nesug/nesug01/cc/cc4018bw.pdf. I did a look through the docs as well. The closest I could find is this walk through of how the data step works. https://documentation.sas.com/?docsetId=lrcon&docsetTarget=p18vk5t9cwort1n18g7zg2no6tr4.htm&docsetVersion=9.4&locale=en. I think the key point is that any INPUT statement reads a record into the input buffer referenced by _INFILE_ (even a null input;) , and that the input buffer is not cleared automatically. It is retained even across iterations of the data step loop. The only time the content of the input buffer is changed is when new data is read into it by another execution of an INPUT statement (assuming there was not trailing @/@@ on the prior INPUT statement). As I see it.

    • Leonid Batkhan

      Hi Quentin,
      Thank you for your comment and the references.
      1) In his paper, Howard Schreier uses _INFILE_ automatic variable after the following statement:
      input @;
      That is not after “any” input statement. This INPUT statement does not specify any variables to read, and that is exactly how I understood it and used in this blog post as the second (input; ) statement (I did not need trailing @ as there were no consequent input statements).
      2) Your second reference - SAS documentation does explain when input buffer is created: “If the input file contains raw data (as in the example above), SAS creates an input buffer to hold the data before moving the data to the program data vector (PDV).” Then “SAS reads the first data line into the input buffer.” However, the doc refers to the example of multiple variables specified in the INPUT statement where the whole record indeed needs to be read.
      What is counterintuitive (at least to me) is that even when you read just a single (even first) character, for example using
      input @1 firstchar $1;
      SAS will still read the whole record of up to 32KB long, load it into the record input buffer, and subsequently assign that whole record value to _INFILE_.
      I believe that the confusion could be dispelled by explicitly stating that any INPUT statement not following another INPUT statement with trailing @ or @@, always loads the entire data row into the input buffer and assigns its value to the _INFILE_ automatic variable, regardless of whether any variables or how many of them are specified with that INPUT statement. If an INPUT statement follows another INPUT statement with trailing @ or @@, then it skips reading the data row into the input buffer and updating the _INFILE_ automatic variable, and proceeds right to moving the data from the input buffer to the PDV.

      • I like your proposed addition to the documentation. Actually, it got me wondering about the LRECL option. There isn't a lot of documentation on that either. : ) Maybe one part of it is that the LRECL option (whether system option or option on the infile statement) determines the length of the input buffer (I think, I'm not sure!) and that determines the amount of data read into the input buffer by the INPUT statement. So even if the INPUT statement reads no variables into the PDV, it will still read LRECL characters into the input buffer and assign that value to _INFILE_.

        • Leonid Batkhan

          Thank you, Quentin. It seems logical that LRECL= determines the size of the input buffer, although since it is not stated explicitly in the documentation it is open to interpretation. As to how that LRECL value affects performance I will do some benchmarking.

  4. Hi Leonid,

    I've been looking for something like this but I'm wondering how can I amend your code for my needs.

    Basically, I have a .json file with 6.7million records. Currently it is too large to use the sas json mapper, therefore I need to split it out.

    filename json3 '\\C\Rec_20190402\total_json.json';
    filename out1 '\\C\Rec_20190402\paye_1.json';
    filename out2 '\\C\Rec_20190402\paye_1.json';
    filename out3 '\\C\Rec_20190402\paye_1.json';

    data _null_;
    infile json3;
    if _n_ ge 500000 then output out1;
    ****and so on
    put _infile_;
    run;

    Basically I would like to ouput the large files into smaller ones based on when _N_ is a value.

    I'm having trouble trying to get working code though and would appreciate some advice.

    • Leonid Batkhan

      Hi Sean, I see several places where your code went wrong. First, you are missing INPUT statement that reads a record; second, you use OUTPUT statement which is used for datasets, for files you need to use FILE statement. Your output files are defined identically. You also don't provide your IF-THEN logic ("****and so on" is not enough). The devil is in details. Here is how I would arrange your code:

      filename json3 'C:\Rec_20190402\total_json.json';
      filename out1 'C:\Rec_20190402\paye_1.json';
      filename out2 'C:\Rec_20190402\paye_2.json';
      filename out3 'C:\Rec_20190402\paye_3.json';
      
      data _null_;
         infile json3;
         input;
         if _n_ le  500000 then file out1; else
         if _n_ le 1000000 then file out2; else file out3;
         put _infile_;
      run;
      

      Adjust it as needed. Please let me know if it solves your problem.

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top