Reading and writing GZIP files with SAS

25
Remember when 100MB was large?
SAS 9.4 Maintenance 5 includes new support for reading and writing GZIP files directly. GZIP files, usually found with a .gz file extension, are a different format than ZIP files. Although both are forms of compressed files, a GZIP file is usually a compressed copy of a single file, whereas a ZIP file is an "archive" -- a collection of files in a compressed virtual folder. GZIP tools are built into Unix/Linux platforms and are commonly used to save space when storing large text-based files that you're not ready to part with: log files, csv files, and more. The algorithm used to compress GZIP files performs especially well with text files, although you can technically GZIP any file that you want.

I've written extensively about using FILENAME ZIP to read and write ZIP archives with SAS. The latest version of SAS adds support for GZIP by extending the FILENAME ZIP method. When working with GZIP files, simply add the GZIP keyword to the FILENAME statement. Example:

filename my_gz ZIP "path-to-file/compressedfile.txt.gz" GZIP;

Here's an example that creates a compressed version of a log file:

filename source "C:\Logs\SEGuide_log.10168.txt";
filename tozip ZIP "C:\Logs\SEGuide_log.10168.txt.gz" GZIP;
 
data _null_;   
   infile source;
   file tozip ;
   input;
   put _infile_ ;
run;

In my test here, the result represents a significant size difference, with the compressed file occupying just 14% of the space.


To "re-inflate" the compressed file, we can perform the opposite operation. (I added the ENCODING option here because I know my log file was UTF-8 encoded.)

filename target "C:\LogsExpanded\SEGuide_log.10168.txt" encoding='utf-8';
filename fromzip ZIP "C:\Logs\SEGuide_log.10168.txt.gz" GZIP;
 
data _null_;   
   infile fromzip;
   file target ;
   input;
   put _infile_ ;
run;

You don't have to explicitly expand a compressed text file in order to read it with SAS. You can use the GZIP method to read and parse a .gz file directly, similar to the zcat command that you might be familiar with from the Unix shell:

filename fromzip ZIP "C:\Logs\SEGuide_log.10168.txt.gz" GZIP;
data logdata;   
   infile fromzip; /* read directly from compressed file */
   input  date : yymmdd10. time : anydttme. ;
   format date date9. time timeampm.;
run;

If your file is in a binary format such as a SAS data set (sas7bdat) or Excel (XLS or XLSX), you probably will need to expand the file completely before reading it as data. These files are read using special drivers that don't process the bytes sequentially, so you need the entire file available on disk.

Note: Because each GZIP file represents just one compressed file, the MEMBER= option doesn't apply. When dealing with ZIP file archives that contain multiple files, you could use the MEMBER= option on FILENAME ZIP to address a specific file that you want. My recent example about FINFO and file details relies heavily on that approach. However, the GZIP option and MEMBER= options are mutually exclusive. In that way, it's much simpler...just like its Unix shell equivalent.


* ZIP drive image By © Raimond Spekking / CC BY-SA 4.0 (via Wikimedia Commons), CC BY-SA 4.0, Link  

Share

About Author

Chris Hemedinger

Director, SAS User Engagement

+Chris Hemedinger is the Director of SAS User Engagement, which includes our SAS Communities and SAS User Groups. Since 1993, Chris has worked for SAS as an author, a software developer, an R&D manager and a consultant. Inexplicably, Chris is still coasting on the limited fame he earned as an author of SAS For Dummies

25 Comments

  1. Chris,

    Are other zip programs supported? Such as bz2 (.bz), xz (.xz), compress (.Z)? I could see how this could be useful in ODA since we have users that want to copy a large number of files from their home directory but SAS Studio only allows you to do one at a time. If they zip them first, that would let them grab their data in bulk...back up their EM projects, etc. hmmm, bet that would make a good blog post :-).

    • Chris Hemedinger
      Chris Hemedinger on

      Hey Galen, standard ZIP file archives are supported with FILENAME ZIP (read and write) and also ODS PACKAGE (write only) -- you can use either of those methods to create a single compressed archive from a collection of files. I've got plenty of blog posts on all of those methods.

    • Chris Hemedinger
      Chris Hemedinger on

      Portability (works the same on every system, regardless of operating system), and not every SAS environment is set up to allow X command. Most SAS Enterprise Guide and SAS Studio users don't have access to the operating system shell for commands.

  2. Julia Cohen (jcinma) on

    How can I adapt the "filename source" line up above to write a temporary sas7bdat file in my Work library to a permanent sas7bdat.gz file? Thanks!

    • Chris Hemedinger
      Chris Hemedinger on

      Here's an example that shows a "round trip" -- from data file, to GZ file, then expanded back to a sas7bdat.

      /* get a data set into a library */
      libname lib "c:\temp";
      data lib.cars;
        set sashelp.cars;
      run;
       
      filename raw "%sysfunc(pathname(lib))/cars.sas7bdat";
      filename zipped ZIP "c:\temp\cars.sas7bdat.gz" GZIP;
       
      /* Compress the data set into a GZ file */
      data _null_;
         infile raw 
             lrecl=256 recfm=F length=length eof=eof unbuf;
         file   zipped lrecl=256 recfm=N;
         input;
         put _infile_ $varying256. length;
         return;
       eof:
         stop;
      run;
       
      /* Delete the original uncompressed file */
      data _null_;
       rc=fdelete('raw');
      run;
       
      /* put the file back in the directory, expanded */
      data _null_;
         infile zipped
             lrecl=256 recfm=F length=length eof=eof unbuf;
         file raw lrecl=256 recfm=N;
         input;
         put _infile_ $varying256. length;
         return;
       eof:
         stop;
      run;
  3. Dear Chris, thank you very much for your code. It works fine. However, I failed when I replace the sas7bdat.gz file with the file I need to work on.
    The url link of the data: http://scholar.rhsmith.umd.edu/sites/default/files/sbrown/files/pins_vdj_ann.sas7bdat.gz?m=1467366849
    My code:
    libname pinlib "E:\data\PIN_Stephen_Brown";
    filename target "%sysfunc(pathname(pinlib))/test.sas7bdat";
    filename fromzip ZIP "E:\data\PIN_Stephen_Brown\test.sas7bdat.gz";
    data _null_;
    infile fromzip
    lrecl=256 recfm=F length=length eof=eof unbuf;
    file target lrecl=256 recfm=N;
    input;
    put _infile_ $varying256. length;
    return;
    eof:
    stop;
    run;
    It reports the error message as :
    ERROR: The file "E:\data\PIN_Stephen_Brown\test.sas7bdat.gz" exists and is not a zip
    file. The output file must be a zip file.
    I appreciate your help very much if you could help me with this case.

    • Chris Hemedinger
      Chris Hemedinger on

      Here's a SAS program that breaks the problem down to the necessary steps. First step: downloads the file. Next step, assigns a FILENAME ZIP (GZIP) fileref to the downloaded file. Then a DATA step to copy its contents to a WORK data set.

      /* This gets the data in GZ format */
      filename webdata "%sysfunc(getoption(WORK))/pins_vdj_ann.sas7bdat.gz";
      proc http
       url="http://scholar.rhsmith.umd.edu/sites/default/files/sbrown/files/pins_vdj_ann.sas7bdat.gz?m=1467366849"
       out=webdata
       method="GET";
      run;
      filename webdata clear;
       
      /* The expands the GZ data to a WORK data set */
      filename zipdata ZIP "%sysfunc(getoption(WORK))/pins_vdj_ann.sas7bdat.gz" GZIP;
      filename unzip "%sysfunc(getoption(WORK))/pins_vdj_ann.sas7bdat";
       
      data _null_;
         infile zipdata
             lrecl=256 recfm=F length=length eof=eof unbuf;
         file unzip lrecl=256 recfm=N;
         input;
         put _infile_ $varying256. length;
         return;
       eof:
         stop;
      run;
       
      proc print data=work.pins_vdj_ann (obs=5);
      run;
  4. Chris,
    I have a variation of the gz zip that I am trying to resolve. My gz file contains many .tar files containing xml that I need to process individually. Using the solution shown above I get everything decompressed as a single file instead of the individual files structure that I need. Is there a solution for this?

    • Chris Hemedinger
      Chris Hemedinger on

      My understanding is that a gz file can contain just one entry -- it's a single file, compressed. But that could be a tar file, which is a "tarball" collection of several files together. So uncompressing these is a two step process. Use the FILENAME ZIP with GZIP to get the tarball (.tar file). Then (and I haven't tried this) you might be able to use FILENAME ZIP (not GZIP) to get to the individual tarball members. If not, you might need to use FILENAME PIPE (if you can) to untar the members.

        • Chris Hemedinger
          Chris Hemedinger on

          A tar file is a bundling of files, but not compressed. The gzip action compresses this single bundle. So in concept, tar plus gzip is like creating a zip file, which is an archive of compressed files.

          But zip is not the same as tar or tar.gz, so the FILENAME ZIP method can't uncompress these. I think you'll have to use the operating system untar command to do that. I'm not aware of any SAS function or method to perform that step.

  5. Pingback: How to split a raw file or a data set into many external raw files - SAS Users

  6. Hi Chris, I wonder if you have chance to help me with this problem? thanks.

    I got hundreds of file need to unzip. The main code for unzip in sas works for me, however, there are two major problems;
    1, input: right now, 'input;' can function, but does not work properly, only work when you specify the variables and length and format, this causes lots of trouble, as we have to re-define all the variables and format. Especially when dealing with different table, it's a disaster.

    I hope to have a simple command to universally treat input;

    2, Any way to unzip a batch of files? Got hundreds, batch by batch and I don't want to do it one by one.
    https://communities.sas.com/t5/SAS-Programming/unzip-sas-problems-input-and-unzip-files-by-batch/m-p/669603/highlight/false#M200898

    • Chris Hemedinger
      Chris Hemedinger on

      Jianlong,

      Regarding the "Input" issue, I've seen other processes manage this by using PROC IMPORT, which does not need to know the data schema ahead of time. Here's an example with the John Hopkins COVID data, which has periodically changed schemas over time. (No ZIP files here, just daily files for all COVID cases.)

      The SAS statements don't allow for a single statement to handle a batch of files. You would need to use macro or CALL EXECUTE perhaps to iterate over a set of files, unzipping each in turn.

  7. Chris-
    We are running into some I/O issues which I'm beginning to think could be related to the size of the .GZ files that we are using. We have about 25 of these files, each one is 600-800MB as is.

    When SAS is "opening" the .GZ file, does it create a file in a temporary space on our SAS server?

    We have a macro that is invoked 25 times that opens the GZ, reads in the raw flat file and creates a SAS dataset. It then moves on to the next GZ file. If we were to let this entire process run at one time, would it create 25 temporary files that would only be cleared out when we shut down SAS? Or, does it create one temp file, and then overwrite it when the next file is read in?

    Thanks

    • Chris Hemedinger
      Chris Hemedinger on

      I suspect it might be creating temp files and not removing them until end of session. To mitigate, be sure to issue a FILENAME CLEAR statement at each iteration of the macro loop when it finishes with a file.

        • Hello Chris,
          Is it possible to read just a single observation from a gzipped SAS dataset?
          In particular, I need to run proc contents on multiple large gzipped datasets snd I would like to avoid unzipping them.
          Thanks!

          • Chris Hemedinger
            Chris Hemedinger on

            A binary file like the SAS data set would need to be completely unzipped/extracted in order to read it. A text file is different, at least logically you can use INFILE to read it as a stream (though it might "gunzip" behind the scenes.

  8. Hi Chris,

    We receive gziped tab delimited txt file from Widows server over SFTP into Linux server where SAS is installed. We use SAS EG client on our pc which connects to the Linux server. I am able to unzip the file and use INFILE to read it into a SAS dataset. However when I read it directly from gziped file, it gets all row but rows are coming blank. Below is the code.

    If I do INFILE from target it reads the rows with data, but if I do INFILE from fromzip it reads all rows but values coming up blank.

    Any help would be much appreciated!

    filename target "/mydata/inbox/unzipped/mytext.txt";
    filename fromzip ZIP "/mydata/inbox/ mytext.txt.gz" GZIP termstr=crlf ;

    infile fromzip
    LRECL=32767 DLM='09'x MISSOVER DSD TERMSTR=CRLF;

    • Chris Hemedinger
      Chris Hemedinger on

      So it's reading all of the lines, but they are written as empty to a data set? I recommend using the DATA step debugger in EG to see what you're really reading in, step by step.

Back to Top