Reading and writing GZIP files with SAS

7

Remember when 100MB was large?

SAS 9.4 Maintenance 5 includes new support for reading and writing GZIP files directly. GZIP files, usually found with a .gz file extension, are a different format than ZIP files. Although both are forms of compressed files, a GZIP file is usually a compressed copy of a single file, whereas a ZIP file is an "archive" -- a collection of files in a compressed virtual folder. GZIP tools are built into Unix/Linux platforms and are commonly used to save space when storing large text-based files that you're not ready to part with: log files, csv files, and more. The algorithm used to compress GZIP files performs especially well with text files, although you can technically GZIP any file that you want.

I've written extensively about using FILENAME ZIP to read and write ZIP archives with SAS. The latest version of SAS adds support for GZIP by extending the FILENAME ZIP method. When working with GZIP files, simply add the GZIP keyword to the FILENAME statement. Example:

filename my_gz ZIP "path-to-file/compressedfile.txt.gz" GZIP;

Here's an example that creates a compressed version of a log file:

filename source "C:\Logs\SEGuide_log.10168.txt";
filename tozip ZIP "C:\Logs\SEGuide_log.10168.txt.gz" GZIP;
 
data _null_;   
   infile source;
   file tozip ;
   input;
   put _infile_ ;
run;

In my test here, the result represents a significant size difference, with the compressed file occupying just 14% of the space.


To "re-inflate" the compressed file, we can perform the opposite operation. (I added the ENCODING option here because I know my log file was UTF-8 encoded.)

filename target "C:\LogsExpanded\SEGuide_log.10168.txt" encoding='utf-8';
filename fromzip ZIP "C:\Logs\SEGuide_log.10168.txt.gz" GZIP;
 
data _null_;   
   infile fromzip;
   file target ;
   input;
   put _infile_ ;
run;

You don't have to explicitly expand a compressed text file in order to read it with SAS. You can use the GZIP method to read and parse a .gz file directly, similar to the zcat command that you might be familiar with from the Unix shell:

filename fromzip ZIP "C:\Logs\SEGuide_log.10168.txt.gz" GZIP;
data logdata;   
   infile fromzip; /* read directly from compressed file */
   input  date : yymmdd10. time : anydttme. ;
   format date date9. time timeampm.;
run;

If your file is in a binary format such as a SAS data set (sas7bdat) or Excel (XLS or XLSX), you probably will need to expand the file completely before reading it as data. These files are read using special drivers that don't process the bytes sequentially, so you need the entire file available on disk.

Note: Because each GZIP file represents just one compressed file, the MEMBER= option doesn't apply. When dealing with ZIP file archives that contain multiple files, you could use the MEMBER= option on FILENAME ZIP to address a specific file that you want. My recent example about FINFO and file details relies heavily on that approach. However, the GZIP option and MEMBER= options are mutually exclusive. In that way, it's much simpler...just like its Unix shell equivalent.


* ZIP drive image By © Raimond Spekking / CC BY-SA 4.0 (via Wikimedia Commons), CC BY-SA 4.0, Link  

Share

About Author

Chris Hemedinger

Senior Manager, SAS Online Communities

+Chris Hemedinger is the manager of SAS Online Communities. He’s also co-author of the popular SAS for Dummies book, author of Custom Tasks for SAS Enterprise Guide using Microsoft .NET, and a frequent participant on the SAS Enterprise Guide discussion forum.

Related Posts

7 Comments

  1. Chris,

    Are other zip programs supported? Such as bz2 (.bz), xz (.xz), compress (.Z)? I could see how this could be useful in ODA since we have users that want to copy a large number of files from their home directory but SAS Studio only allows you to do one at a time. If they zip them first, that would let them grab their data in bulk...back up their EM projects, etc. hmmm, bet that would make a good blog post :-).

    • Chris Hemedinger
      Chris Hemedinger on

      Hey Galen, standard ZIP file archives are supported with FILENAME ZIP (read and write) and also ODS PACKAGE (write only) -- you can use either of those methods to create a single compressed archive from a collection of files. I've got plenty of blog posts on all of those methods.

    • Chris Hemedinger
      Chris Hemedinger on

      Portability (works the same on every system, regardless of operating system), and not every SAS environment is set up to allow X command. Most SAS Enterprise Guide and SAS Studio users don't have access to the operating system shell for commands.

Leave A Reply

Back to Top