Why Can't SAS Read My CSV file?

Many SAS users receive data in the form of CSV (Comma Separated Value) files, and need to convert them to SAS data sets. A typical record in a CSV file might look like this:

Jeter,Derek,1995,,234,”22,600,000”

Note the following about the record above:

  • There is no data for the fourth field, so there are two commas between the values for third and fifth fields.
  • Because the value for last field contains embedded commas, it is enclosed in quotes.

If you were to open this file in Microsoft Excel, each field would appear in a separate column, with the column for End_Year being blank, and the column for Salary containing the value 22600000.

Assume you want to read records from this file into a SAS data set and create the variables Last ('Jeter'), First ('Derek'), Start_Year (1995), End_Year (missing), Home_Runs (234), and Salary (22600000).
To create a SAS data set from the file, you might use the following code:

data players;
   infile 'baseball.csv' dlm=',';
   input Last $ First $ Start_Year End_Year Home_Runs Salary : comma.;
run;

In the program above, the DLM= option on the INFILE statement tells SAS that the fields are separated by commas. The INPUT statement uses list input to tell SAS to read the fields from delimiter to delimiter. For the variable Salary, a colon and the comma informat are added to indicate that the number in the field contains commas, which should be ignored when interpreting the value in the field.

If we run the program to create the data set and run a PROC PRINT, we get the following results:

Obs    Last    First    Start_Year  End_Year  Home_Runs    Salary
  1    Jeter   Derek       1995        234        .        600

The values for First, Last and Start_Year are correct, but the values for the remaining variables are not.

What happened? By default SAS handles comma-delimited files differently than other applications, such as Excel. Specifically:

  • Most applications treat two consecutive commas as a blank field, which would cause End_Year to be interpreted as a blank, or missing. However SAS treats two consecutive delimiters (commas in this case) as a single delimiter. This means that SAS reads the value for End_Year from the field containing the value for Home_Runs.
  • By default, SAS treats delimiters (commas) within quotes as delimiters, unlike Excel and other applications. So, when SAS attempts to read the value of Home_Runs, it reads the characters (“22) between the next set of commas, rather than reading everything within quotes. This is invalid numeric data, so Home_Runs is set to missing. Finally, when SAS reads Salary, the value between the next set of commas is 600. This is the end of the INPUT statement, so the last field (000) is not read.

 

So, that’s what happened. What we want SAS to do is to treat the two consecutive commas as a field with a missing value, and to not interpret commas within quotes as delimiters. How do we do this?

The answer is surprisingly easy—simply add the option DSD to the INFILE statement, and remove the DLM= option. DSD stands for “delimiter sensitive data” and tells SAS that the input file is a “standard” comma delimited file, and that SAS should interpret it the same way as Excel and other applications do. Specifically, the DSD option does the following:

  • Sets the default delimiter for the input file to a comma (so the DLM= option isn’t needed)
  • Treats multiple consecutive commas as empty fields
  • Ignores commas within quotes

So, if we modify the program thusly:

data players;
   infile 'baseball.csv' <strong>DSD</strong>;
   input Last $ First $ Start_Year End_Year Home_Runs Salary : comma.;
run;

we get the expected results from PROC PRINT:

Obs    Last    First    Start_Year  End_Year  Home_Runs    Salary
  1    Jeter   Derek       1995        .        234        22600000

The really nice thing about the DSD option is that it will give you correct results whether or not you have consecutive delimiters or commas within quotes. So, it’s a good idea to specify the DSD option on your INFILE statement whenever you are reading a CSV file.

tags: programming, tips and tricks

4 Comments

  1. rasal
    Posted March 1, 2012 at 11:15 am | Permalink

    Even I can't see it now.

    SAS still treat the two consecutive commas as a field with a missing value, and does not interpret commas within quotes as delimiters.

  2. Mike Kalt Mike Kalt
    Posted March 6, 2012 at 8:01 pm | Permalink

    I believe the problem is that when you downloaded the example from my blog, the double quotes in the CSV file were translated to "Smart Quotes", which means that the quote characters in the data have different hexadecimal values than what SAS was expecting. You can correct the problem by editing the CSV file in Windows Notepad (or any other ASCII text editor) and replacing the quote characters around 22,000,000 with the double quote characters from your keyboard. This fixed the problem for me.

    Let me know if that doesn't work.

  3. rasal
    Posted March 7, 2012 at 2:50 am | Permalink

    ya.its worked! Thank you for quick reply and support.

  4. raghu
    Posted November 12, 2013 at 8:11 am | Permalink

    I have a field in csv with two lines of string in a cell. When importing to sas using dsd dlm and missover option. The sas reads only the first line and followed by all the columns with blank values. So it doesn read the 2nd line of the cell.

    Solution plz.

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <p> <pre lang="" line="" escaped=""> <q cite=""> <strike> <strong>