SAS programmers often resort to using the X command to list the contents of file directories and to process the contents of ZIP files. In centralized SAS environments, the X command is unavailable to most programmers. NOXCMD is the default setting for these environments (disallowing shell commands), and SAS admins are reluctant to change it.
In this article, I'll share a SAS program that can retrieve the contents of a file directory (all of the file names), and then also report on the contents of every ZIP file within that directory -- without using any shell commands. The program uses two lesser-known tricks to retrieve the information:
- The FILENAME statement can be applied to a directory, and then the DOPEN, DNUM, DREAD, and DCLOSE functions can be used to retrieve information about that directory. (Check SAS Note 45805 for a better example of just this - click the Full Code tab.)
- The FILENAME ZIP method (added in SAS 9.4) can retrieve the names of the files within a compressed archive (ZIP files). For more information, see all of my previous articles about the FILENAME ZIP access method.
I wrote the program as a SAS macro so that it should be easy to reuse. And I tried to be liberal with the comments, providing a view into my thinking and maybe some opportunities for improvement.
%macro listzipcontents (targdir=, outlist=); filename targdir "&targdir"; /* Gather all ZIP files in a given folder */ /* Searches just one folder, not subfolders */ /* for a fancier example see */ /* http://support.sas.com/kb/45/805.html (Full Code tab) */ data _zipfiles; length fid 8; fid=dopen('targdir'); if fid=0 then stop; memcount=dnum(fid); /* Save just the names ending in ZIP*/ do i=1 to memcount; memname=dread(fid,i); /* combo of reverse and =: to match ending string */ /* Looking for *.zip files */ if (reverse(lowcase(trim(memname))) =: 'piz.') then output; end; rc=dclose(fid); run; filename targdir clear; /* get the memnames into macro vars */ proc sql noprint; select memname into: zname1- from _zipfiles; %let zipcount=&sqlobs; quit; /* for all ZIP files, gather the members */ %do i = 1 %to &zipcount; %put &targdir/&&zname&i; filename targzip ZIP "&targdir/&&zname&i"; data _contents&i.(keep=zip memname); length zip $200 memname $200; zip="&targdir/&&zname&i"; fid=dopen("targzip"); if fid=0 then stop; memcount=dnum(fid); do i=1 to memcount; memname=dread(fid,i); /* save only full file names, not directory names */ if (first(reverse(trim(memname))) ^='/') then output; end; rc=dclose(fid); run; filename targzip clear; %end; /* Combine the member names into a single data set */ /* the colon notation matches all files with "_contents" prefix */ data &outlist.; set _contents:; run; /* cleanup temp files */ proc datasets lib=work nodetails nolist; delete _contents:; delete _zipfiles; run; %mend; |
Use the macro like this:
%listzipcontents(targdir=c:\temp, outlist=work.allfiles); |
Here's an example of the output.
Experience has taught me that savvy SAS programmers will scrutinize my example code and offer improvements. For example, they might notice my creative use of the REVERSE function and "=:" operator to simulate and "ends with" comparison function -- and then suggest something better. If I don't receive at least a few suggestions for improvements, I'll know that no one has read the post. I hope I'm not disappointed!
20 Comments
Just so that you know somebody read your post. Thank you for sharing this code.
Awww, thanks!
just to add a comment, consider the scan function - allowing negative values for 2nd parameter, easily extracts the file type when separated by a dot
Instead of
/* combo of reverse and =: to match ending string */
/* Looking for *.zip and *.gz files */
if (reverse(lowcase(trim(memname))) =: 'piz.') OR
(reverse(lowcase(trim(memname))) =: 'zg.') then
consider what follows - it almost doesn't need the clarifying comment
/* Looking for *.zip and *.gz files */
if lowcase( scan( memname,-1,'.') in( 'zip', 'gz') then
That -1 directs scanning for the last "word" delimited by third parameter ('.')
brilliant! I knew you'd come through, Peter.
To extract the extension from a filename use the SCAN() function. But watch out for filenames without any periods or file names that start with period (hidden files on Unix).
if index(filename,'.') > 1 then extenstion=lowcase(scan(filename,-1,'.'));
Hi, Chris
I would like to thank you first for sharing this blog, it's really wonderful. As you say, I have another more straightforward way to find the .gz and .zip files as shown below:
IF UPCASE(SUBSTR(memname, LENGTH(memname) - 2 )) = ".GZ"
OR UPCASE(SUBSTR(memname, LENGTH(memname) - 3)) = ".ZIP";
Hi, Chris
Is it relevant for Windows only? Will it work on Unix?
Works on UNIX too! And in SAS University Edition.
Thank you, thank you , thank you. we are moving to SAS Grid and I really needed the heads up that Admins might set NOXCMD. Anything else I should know?
This is something I do all the time and so I am likely to need this alternative code.
Glad that you found this helpful. Usually people moving to SAS Grid are actually experiencing a number of changes: moving off of PC SAS, perhaps moving SAS from Windows to Linux (resulting in file path differences), having data and files on a remote server instead of local. This comprehensive paper can help to prepare.
This works great with one possible exception. I've found that when my ZIP file is larger than 4GB it doesn't find any members. I get a note if using the DEBUG option in the FILENAME statement stating: "ERROR: The system could not find the environment option that was entered."
Does anyone else find this or have a work around for reading large ZIP files?
In the initial release of 9.4 there was a limitation when your FILENAME ZIP archive file (the compressed ZIP file) exceeds 4GB -- the resulting file could not be read. That has been fixed in 9.4 Maint 2.
Ahhhhhhh...I'm running M1. Time to contact my IT group and get the updated release! Thanks, this has been bugging me for months now.
Chris,
Thanks for this post. I've been looking for a way to do this on gzipped datasets for a long time. The only thing I've found prior was this, but it requires writing out the compressed dataset from SAS. You can't do the gzip command from unix command line, which is common practice.
http://support.sas.com/kb/25/214.html
I've tried your macro above, and while it does list the compressed files in the directory, unfortunately, it does not list the contents in the _Contents1 dataset. There are 0 obs. I have seen your Filename ZIP blog post as well. Hopefully, I can get this working. I would love to leave my very large datasets compressed (350GB+ uncompressed, <70GB compressed).
I'm using 9.4 M4. Any thoughts?
Regards,
Cory Vandenberg
I assume you read this post about using FILENAME ZIP to read data from a zipped file. You might be running into trouble with limited space (unzipping a large file will require a large amount of temp space) or some other limitation with the FILENAME ZIP method. You might need to contact SAS Technical Support for some guidance if you can't get it working.
Pingback: Using FILENAME ZIP and FINFO to list the details in your ZIP files - The SAS Dummy
Chris,
Thank you for these very interesting and informative post. I have been getting mileage out of them since I am working on LSAF. This caught my attention:
if (first(reverse(trim(memname))) ^='/') then
I used
isFolder = prxmatch( "/\/$/" , trim( memname )) > 0 ;
The issue is that zip archives in zip archives are "folders", no? At any rate, I think I need to handle them differently than files for my purposes.
HTH,
Kevin
Thanks for sharing!
Hi Chris, Love your work! From a usefulness perspective you might consider splitting this into two macros. If you make the step that does the contents on a single zip file, passing it arguments targdir and zname, then that macro as a separate macro is usable on a specific zip file within a specific directory on a stand alone basis. The second macro is then just a looper macro that follows that very very common pattern of looping over a number of instances series. It calls the first macro for each new zip file in the loop. One benefit of coding macros this way is they are easier to develop and test. Develop and test the one that does the unit of work, then develop and test the controller macro that calls unit of work macro. The controller can be tested by simply having it emit the arguments it must pass the inner unit of work macro. If it does that successfully, then when you plug in the unit of work macro, they will work in tandem.
Regards
Mike
#EG4EVA
Good points! Thanks for the feedback.