Reading an external file to create a SAS data set can either be easy or challenging depending on which tool outside of SAS created the external file. This post covers the two most common ways to read an external file and provides solutions for seven scenarios that you might encounter when reading external data.
Two common methods for reading external files
The two most common ways to read an external file are using the IMPORT procedure or using a DATA step with INFILE and INPUT statements. For more information about reading external files using PROC IMPORT, see my earlier post, Tips for using the IMPORT procedure to read files that contain delimiters.
DATA step with INFILE and INPUT statements
Here is example code to read a comma-delimited file using the DATA step with INFILE and INPUT statements:
data one; infile 'c:\cars.csv' dlm=',' dsd firstobs=2 truncover; input make : $10. model : $20. type $ origin $ releasedt : mmddyy10. drivetrain $ msrp : comma10.; run;
Here are details about the options in the above INFILE statement:
- The DLM= option indicates that the file is comma delimited.
- The DSD option specifies that multiple commas in a row should be treated as a missing value.
- The FIRSTOBS=2 option designates that the first row of data is the header record and should be skipped.
- The TRUNCOVER option indicates that the program should not flow over to the next line to look for missing values that are not found on the current line being read.
Note: This code did not include the LRECL= option because the default record length in SAS® 9.4 is 32,767. My external file did not exceed a record length of 32,767, so I did not need to use that option. For more information about the options available to the INFILE statement, see the Infile Options section in SAS® 9.4 and SAS® Viya® 3.5 Programming Documentation.
A few caveats and clarifications
In the example INPUT statement, I used modified list input and list input based on the value of the variable that I am reading in. If the value is:
- Numeric, I do not have to put anything after the variable name.
- A character eight characters or less, then I use a $ after the variable name to indicate that it is a character variable.
- A character greater than eight characters, I use the colon (:) followed by $w.informat to indicate the maximum width of the value. The colon indicates that the value should be read from the next non-blank value until either the delimiter or the width specified by the informat is reached, whichever comes first.
For the variable RELEASEDT, the value is a date in the form of mm/dd/yyyy, so I use the colon and the MMDDYY10. informat to read in the value.
For the MSRP variable, the value is in the form of $35,345, so I use the colon and the COMMAw. informat to read it in. In both cases, the value can vary in length, so I used the colon and the maximum width for the w. portion of the informat.
Now, let’s talk about some troubleshooting that you might need to do when reading an external file using the DATA step with INFILE and INPUT statements.
Scenario 1: Specifying a tab-delimited file
Issue: How do I specify that I have a tab-delimited file instead of a comma-delimited file?
Solution: Indicating that a file is tab-delimited is different from specifying files that are comma- or pipe- (|) delimited. You cannot type a tab when using the DLM= option, so you must use the hexadecimal representation of the tab. If you are using SAS on Microsoft Windows, UNIX, or a Mac, use the following syntax when reading a tab-delimited file:
Infile ‘host-specific-path’ dlm=’09’x dsd ;
If you are using SAS on MVS or z/OS, use the following syntax when reading a tab-delimited file:
Infile ‘host-specific-path’ dlm=’05’x dsd ;
Scenario 2: Reading in all the records
Issue: The external file that I am reading contains 1,252 records. When I read in the external file using the DATA step with INFILE and INPUT statements, SAS creates a data set with 531 observations. Why didn’t SAS read all the records in the external file?
Solution: This situation is unique to the Windows operating system. Windows inserts three special characters to signal the end of the file. The characters are called an end-of-file (EOF) marker and the hexadecimal representation for the EOF marker is ‘1A’x. Sometimes an EOF marker can appear in the middle of your data instead of at the end of the external file. When SAS encounters the hexadecimal 1A, SAS considers this the end of the file and stops reading the data. You can use the IGNOREDOSEOF option on the INFILE statement or FILENAME statement to indicate that those premature EOF markers in the middle of your data can be ignored. Here is example code:
Infile ‘c:\abc.txt’ ignoredoseof ;
Scenario 3: Reading in files that contain multiple records
Issue: I’m reading in an external file that contains multiple records, but the SAS log shows the following:
NOTE: 1 record was read from the infile 'c:\temp\classfixed.txt'. The minimum record length was 380. The maximum record length was 380. NOTE: The data set WORK.ONE has 1 observations and 5 variables.
How can I read in all my records in the external file?
Solution: From the information in the SAS log, you have fixed records within one long stream of data. To address the issue, you need to use two options in the INFILE statement: RECFM=F and LRECL=xxx where xxx is the exact record length. When you specify these two options, SAS reads the first xxx characters based on the LRECL= value for the first observation. The next xxx characters will be the next observation and so on. In your INPUT statement, you need to list each variable and the number of columns (width) for each variable to ensure that it matches the length specified for the LRECL= value.
Here is an example:
data one; infile ' host-specific-path' recfm=f lrecl=25; input name $10. age 2. class $10. room 3.; run;
If you add up the widths specified after the NAME, AGE, CLASS, and ROOM variables, it equals 25, which is the value used with the LRECL= option.
Scenario 4: Preventing truncated values
Issue: I’m reading an external file with a long record length, which is causing some of my data values to be truncated. What do the following messages in the SAS log mean?
The minimum record length was 106. The maximum record length was 32767. One or more lines were truncated.
Solution: These notes indicate that some of the records in your external file exceed the default LRECL= value of 32,767. As a result, any data values past column 32,767 are not read in.
Add the LRECL= option to the INFILE statement and use a value greater than 32,767. Here is example code:
Infile ‘host-specific-path’ lrecl=50000 ;
Beginning in SAS 9.4, the default LRECL= value is 32,767. The LRECL= value can range from 1 to 1,073,741,823 (1 gigabyte) on Windows and UNIX.
Scenario 5: Preventing flowover
Issue: What does the following note in the SAS log mean?
NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
Solution: This note indicates that SAS is flowing over to the next line to look for values not found on the current line being read. This is the default behavior when SAS is not able to find the values being read on the current record. To prevent a flowover, use the TRUNCOVER option in the INFILE statement, as shown below:
Infile ‘host-specific-path’ truncover ;
Scenario 6: Avoiding missing data
Issue: The last variable in my external file is numeric. When SAS reads in the variable, it is missing for every observation. Here is what I see in the SAS log:
NOTE: Invalid data for d in line 1 7-8. RULE: ----+----1----+----2----+----3 1 CHAR 1,2,3,4. 8 ZONE 32323230 NUMR 1C2C3C4D a=1 b=2 c=3 d=. _ERROR_=1 _N_=1
How do I read in the data correctly?
Solution: The most common cause for this situation is reading a Windows file on a UNIX operating system. A file that is created on Windows operating system has a Carriage Return/Line Feed (CR/LF) at the end of each record. A file that is created on UNIX operating system has a Line Feed (LF) at the end of each record. Therefore, a file that is created on Windows and transferred over to UNIX still has a CR/LF at the end of each record.
The Carriage Return (CR) becomes part of the value of the variable. Since the variable was being read in as numeric, you receive the above "Invalid data" message in the SAS log. In the SAS log above the message, the ruler, and the CHAR/ZONE/NUMR group show a period at the end of your number. Below the period in the ZONE line is a 0; below that in the NUMR line is a D. These are elements of ‘0D’x (the hexadecimal representation of a CR), but SAS is reading the hexadecimal value as actual data.
To address the issue, add the TERMSTR= option to the INFILE statement. Specifying TERMSTR=CRLF, as shown below, indicates that the file has the Windows End-of-Record (EOR) markers instead of UNIX EOR markers:
Infile 'host-specific-path' termstr=crlf ;
The TERMSTR= option is valid in the INFILE statement or the FILENAME statement.
Beginning in SAS 9.4, TERMSTR= is set to CRLF by default when running on UNIX. For more information, see SAS Note 66323, "A UNIX operating system can now read a Windows file without specifying the TERMSTR= option."
Scenario 7: Getting the correct number of observations
Issue: The comma-delimited file that I read in shows more observations in the SAS data set than the number of records in the external file. Also, some of my variable values in a new observation have values for variables from the previous observation.
Solution: You could have a quoted string for one of your variable values. The quoted string contains the EOR marker, such as a CR/LF or a LF, in the middle of the string as part of a data value. The hexadecimal representation for CR/LF is '0D0A'x. The hexadecimal representation for LF is '0A'x. SAS considers CR/LF or LF to represent the end of a record. Therefore, you end up with more observations and/or many "Invalid data" messages in the SAS log.
For example, sometimes a long comment field stops in the middle and continues to the next record in a CSV file. This problem occurs mostly in files that are created from Microsoft Excel. In Excel, columns contain soft returns to help with formatting. When you save the worksheet as a CSV file, the soft returns appear to SAS as EOR markers. This, in turn, causes the records to be split incorrectly.
If your data value that contains the EOR marker is enclosed in quotation marks, you can use the sample code in SAS Note 26065 to remove the erroneous CR/LF or LF that is part of the quoted value. After you use the code in the SAS Note to pre-process the external file and remove the erroneous CR/LF or LF, then you should be able to read the external file and return the correct number of observations.
If your data value that contains the EOR marker is NOT enclosed in quotation marks, you need to use a tool outside of SAS to pre-process the external file to remove the erroneous CR/LF or LF characters. In this situation, SAS cannot determine what is a true EOR marker versus an erroneous EOR marker.
The best way to look at the CSV file to determine whether your data values that contain the EOR marker are quoted is to use a tool like Notepad or TextPad. Excel does not show the data in a format that displays the erroneous EOR marker.
If you encounter an issue when reading in an external file, examine the SAS log for any notes, errors, or warnings. Hopefully, one of the scenarios above matches what you are encountering. If not, Technical Support is here to help—just give us a call or send us an email! If you send an email, include your site number, SAS Release including maintenance level, operating system, a brief description of the problem, your SAS log as an attachment from running the code in a new SAS session, and a sample of the external file (if possible) as an attachment.CONTACT TECH SUPPORT | CHOOSE A CHANNEL
WANT MORE GREAT INSIGHTS MONTHLY? | SUBSCRIBE TO THE SAS LEARNING REPORT