Recently, I worked on a cybersecurity project that entailed processing a staggering number of raw text files about web traffic. Millions of rows had to be read and parsed to extract variable values.
The problem was complicated by the varying records composition. Each external raw file was a collection of records of different structures that required different parsing programming logic. Besides, those heterogeneous records could not possibly belong to the same rectangular data tables with fixed sets of columns.
Solving the problem
To solve the problem, I decided to employ a "divide and conquer" strategy: to split the external file into many files, each with a homogeneous structure, then parse them separately to create as many output SAS data sets.
My plan was to use a SAS DATA Step for looping through the rows (records) of the external file, read each row, identify its type, and based on that, write it to a corresponding output file.
Like how we would split a data set into many:
data CARS_ASIA CARS_EUROPE CARS_USA; set SASHELP.CARS; select(origin); when('Asia') output CARS_ASIA; when('Europe') output CARS_EUROPE; when('USA') output CARS_USA; end; run;
But how do you switch between the output files? The idea came from SAS' Chris Hemedinger, who suggested using multiple FILE statements to redirect output to different external files.
Splitting an external raw file into many
As you know, one can use PUT statement in a SAS DATA Step to output a character string or a combination of character strings and variable values into an external file. That external file (a destination) is defined by a FILE statement. While that destination can only be one at a time (we can’t write into multiple raw file destinations simultaneously), there is no restriction on the number of FILE statements in a DATA Step; however, only one FILE statement at a time is active.
Whenever you define a new destination with the FILE statement, it will overwrite its previous definition, but that is exactly what we need.
Let’s say we need to split an external raw file input_file.txt into several raw files based on the value of some field REC_TYPE indicative of the record type. Then we can achieve that as shown in the following code example:
filename inf 'c:\temp\input_file.txt'; filename out1 'c:\temp\traffic.txt'; filename out2 'c:\temp\system.txt'; filename out3 'c:\temp\threat.txt'; filename out4 'c:\temp\other.txt'; data _null_; infile inf; input REC_TYPE $10.; select(REC_TYPE); when('TRAFFIC') file out1; when('SYSTEM') file out2; when('THREAT') file out3; otherwise file out4; end; put _infile_; run;
In this code, the INPUT statement retrieves the value of REC_TYPE. It may not be used exactly as written, but the point is you need to capture the filed(s) of interest.
But the INPUT statement does something else behind the scene. First, SAS creates an input buffer and loads the entire raw data line into it. The Input Buffer's value gets assigned to the _INFILE_ automatic variable. Then REC_TYPE variable value is moved to the Program Data Vector (PDV).
After that INPUT statement, two variables are available in the DATA Step - REC_TYPE and _INFILE_ (automatic variable).
Depending on the value of the REC_TYPE variable, SELECT block toggles the FILE definition between one of the four filerefs, out1, out2, out3, or out4.
Then the PUT statement outputs the _infile_ automatic variable value to the output file defined in the SELECT block.
Splitting a data set into several external files
Similar technique can be used to split a data table into several external raw files. Let’s combine the above two code samples to demonstrate how you can split a data set into several external raw files:
filename outasi 'c:\temp\cars_asia.txt'; filename outeur 'c:\temp\cars_europe.txt'; filename outusa 'c:\temp\cars_usa.txt'; data _null_; set SASHELP.CARS; select(origin); when('Asia') file outasi; when('Europe') file outeur; when('USA') file outusa; end; put _all_; run;
This code will read observations of the SASHELP.CARS data table, and depending on the value of ORIGIN variable, put _all_ will output all the variables (including automatic variables _ERROR_ and _N_) as named values (VARIABLE_NAME=VARIABLE_VALUE pairs) to one of the three external raw files specified by their respective file references (outasi, outeur, or outusa.)
You can modify this code to produce delimited files with full control over which variables and in what order to output. For example, the following code sample produces 3 files with comma-separated values:
data _null_; set SASHELP.CARS; select(origin); when('Asia') file outasi dlm=','; when('Europe') file outeur dlm=','; when('USA') file outusa dlm=','; end; put make model type origin msrp invoice; run;
You may use different delimiters for your output files. In addition, rather than using mutually exclusive SELECT, you may use different logic for re-directing your output to different external files.
Bonus: How to zip your output files as you create them
For those readers who are patient enough to read to this point, here is another tip. As described in this series of blog posts by Chris Hemedinger, in SAS you can read your external raw files directly from zipped files without unzipping them first, as well as write your output raw files directly into zipped files. You just need to specify that in your filename statement. For example:
filename outusa ZIP '/sas/data/temp/cars_usa.txt.gz' GZIP;
filename outusa ZIP 'c:\temp\cars.zip' member='cars_usa.txt';
What is your experience with creating multiple external raw files? Could you please share with the rest of us?