When reading a text file (common extensions: TXT, DAT; or, for the adventurous: HTML) with the DATA STEP, you should always view several lines from the text file, and compare to the record layout, before completing the INPUT statement. There are many ways to view a text file. I use the DATA STEP as a file viewer, which is convenient, since I use the DATA STEP to read the text file anyway.
Advantages of using the DATA STEP as a text file viewer:
- Platform independence: Windows, Unix, or Z/os
- Hex code for unprintable characters
- Column ruler
It's super easy! Check out this SAS log:
95 data _null_; 96 infile 'c:\myfile1.txt' obs=10; 97 input; 98 list; 99 run; NOTE: The infile 'c:\myfile1.txt' is: Filename=c:\myfile1.txt, RECFM=V,LRECL=32767,File Size (bytes)=19, Last Modified=27May2015:14:30:25, Create Time=27May2015:11:18:52 RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7 1 1,a 3 2 234,bcd 7 3 56,ef 5 NOTE: 3 records were read from the infile 'c:\myfile1.txt'. The minimum record length was 3. The maximum record length was 7. |
The above file has two fields, comma delimited. There are only 3 records in this silly file, but the OBS=10 INFILE statement option gives me a quick peek at any size file. The null DATA statement does not create a SAS dataset. The null INPUT statement does not create variables. The LIST statement copies each record into the log. The numbers 3, 7, and 5 following each record represent the record lengths.
Now let's try a similar, but different, file. Check out the new SAS log:
100 data _null_; 101 infile 'c:\myfile2.txt' obs=10; 102 input; 103 list; 104 run; NOTE: The infile 'c:\myfile2.txt' is: Filename=c:\myfile2.txt, RECFM=V,LRECL=32767,File Size (bytes)=19, Last Modified=27May2015:14:31:03, Create Time=27May2015:11:15:55 RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7 1 CHAR 1.a 3 ZONE 306 NUMR 191 2 CHAR 234.bcd 7 ZONE 3330666 NUMR 2349234 3 CHAR 56.ef 5 ZONE 33066 NUMR 56956 NOTE: 3 records were read from the infile 'c:\myfile2.txt'. The minimum record length was 3. The maximum record length was 7. |
Each record from the text file is now represented by 3 lines in the log: CHAR, ZONE, NUMR. CHAR represents the actual characters, identical to the prior example, except for the delimiter. ZONE and NUMR represent the two halves of the ASCII hexadecimal code. Why did hex codes suddenly appear? And why should I care about hex codes? Before answering these questions, notice the different delimiter. It's a period, instead of a comma, right? Wrong! This is a tab-delimited file. The period is a placeholder for a tab character, which is unprintable. How can I tell? Notice ZONE and NUMR are 0 and 9.
Google "ascii table" or visit this page and you will find something like the partial table below.
Whenever an unprintable character appears in a file, the LIST statement automatically displays hex codes, as shown in the partial ASCII table above.
How do you read a tab-delimted file? After viewing the file, make these five changes:
- Change the DATA statement to specify a dataset name.
- Add DLM='09'x to the INFILE statement.
- Add variables to the INPUT statement.
- Remove the LIST statement.
- After testing with the above changes, remove OBS=10 from the INFILE statement.
Your final DATA STEP will look something like this:
data mySASfile2; infile 'c:\myfile2.txt' dlm='09'x; input Number Text: $3.; run; |
Watch the SAS Training Post blog for more tips and tricks!
3 Comments
Thanks for sharing some handy debugging tips... Hadn't seen the list statement before as I've used the put or putlog statement instead. I like the comparison table in the documentation of list and put statements:
http://support.sas.com/documentation/cdl/en/lestmtsref/67407/HTML/default/viewer.htm#n0drxaqb7f0w0tn18viaj8dtjwud.htm
Will keep this statement in the memory banks ;-)
Thanks, Michelle, for the helpful link. Additional details: The PUT statement writes from the PDV, while the LIST statement writes from the input buffer, placing a snapshot of the input buffer into the log. The LIST statement therefore writes pure text, before it is brought into the PDV, while the PUT statement writes data following interpretation by the informat and placement into the PDV. Both are useful!
Thanks for the differentiation on the PUT and LIST statement. I will certainly keep that in mind! I have used the hex method to "view" unprintable characters from mainframe flat files moved to my PC. In fact, I typically move my mainframe data using the binary FTP option and use the SAS PC capabilities to read EBCDIC. Or read on the mainframe and use Proc Export and Import.