Many SAS users receive data in the form of CSV (Comma Separated Value) files, and need to convert them to SAS data sets. A typical record in a CSV file might look like this:
Note the following about the record above:
- There is no data for the fourth field, so there are two commas between the values for third and fifth fields.
- Because the value for last field contains embedded commas, it is enclosed in quotes.
If you were to open this file in Microsoft Excel, each field would appear in a separate column, with the column for End_Year being blank, and the column for Salary containing the value 22600000.
Assume you want to read records from this file into a SAS data set and create the variables Last ('Jeter'), First ('Derek'), Start_Year (1995), End_Year (missing), Home_Runs (234), and Salary (22600000).
To create a SAS data set from the file, you might use the following code:
data players; infile 'baseball.csv' dlm=','; input Last $ First $ Start_Year End_Year Home_Runs Salary : comma.; run;
In the program above, the DLM= option on the INFILE statement tells SAS that the fields are separated by commas. The INPUT statement uses list input to tell SAS to read the fields from delimiter to delimiter. For the variable Salary, a colon and the comma informat are added to indicate that the number in the field contains commas, which should be ignored when interpreting the value in the field.
If we run the program to create the data set and run a PROC PRINT, we get the following results:
Obs Last First Start_Year End_Year Home_Runs Salary 1 Jeter Derek 1995 234 . 600
The values for First, Last and Start_Year are correct, but the values for the remaining variables are not.
What happened? By default SAS handles comma-delimited files differently than other applications, such as Excel. Specifically:
- Most applications treat two consecutive commas as a blank field, which would cause End_Year to be interpreted as a blank, or missing. However SAS treats two consecutive delimiters (commas in this case) as a single delimiter. This means that SAS reads the value for End_Year from the field containing the value for Home_Runs.
- By default, SAS treats delimiters (commas) within quotes as delimiters, unlike Excel and other applications. So, when SAS attempts to read the value of Home_Runs, it reads the characters (“22) between the next set of commas, rather than reading everything within quotes. This is invalid numeric data, so Home_Runs is set to missing. Finally, when SAS reads Salary, the value between the next set of commas is 600. This is the end of the INPUT statement, so the last field (000) is not read.
So, that’s what happened. What we want SAS to do is to treat the two consecutive commas as a field with a missing value, and to not interpret commas within quotes as delimiters. How do we do this?
The answer is surprisingly easy—simply add the option DSD to the INFILE statement, and remove the DLM= option. DSD stands for “delimiter sensitive data” and tells SAS that the input file is a “standard” comma delimited file, and that SAS should interpret it the same way as Excel and other applications do. Specifically, the DSD option does the following:
- Sets the default delimiter for the input file to a comma (so the DLM= option isn’t needed)
- Treats multiple consecutive commas as empty fields
- Ignores commas within quotes
So, if we modify the program thusly:
data players; infile 'baseball.csv' <strong>DSD</strong>; input Last $ First $ Start_Year End_Year Home_Runs Salary : comma.; run;
we get the expected results from PROC PRINT:
Obs Last First Start_Year End_Year Home_Runs Salary 1 Jeter Derek 1995 . 234 22600000
The really nice thing about the DSD option is that it will give you correct results whether or not you have consecutive delimiters or commas within quotes. So, it’s a good idea to specify the DSD option on your INFILE statement whenever you are reading a CSV file.