Why Can't SAS Read My CSV file?

7

Many SAS users receive data in the form of CSV (Comma Separated Value) files, and need to convert them to SAS data sets. A typical record in a CSV file might look like this:

Jeter,Derek,1995,,234,”22,600,000”

Note the following about the record above:

  • There is no data for the fourth field, so there are two commas between the values for third and fifth fields.
  • Because the value for last field contains embedded commas, it is enclosed in quotes.

If you were to open this file in Microsoft Excel, each field would appear in a separate column, with the column for End_Year being blank, and the column for Salary containing the value 22600000.

Assume you want to read records from this file into a SAS data set and create the variables Last ('Jeter'), First ('Derek'), Start_Year (1995), End_Year (missing), Home_Runs (234), and Salary (22600000).
To create a SAS data set from the file, you might use the following code:

data players;
   infile 'baseball.csv' dlm=',';
   input Last $ First $ Start_Year End_Year Home_Runs Salary : comma.;
run;

In the program above, the DLM= option on the INFILE statement tells SAS that the fields are separated by commas. The INPUT statement uses list input to tell SAS to read the fields from delimiter to delimiter. For the variable Salary, a colon and the comma informat are added to indicate that the number in the field contains commas, which should be ignored when interpreting the value in the field.

If we run the program to create the data set and run a PROC PRINT, we get the following results:

Obs    Last    First    Start_Year  End_Year  Home_Runs    Salary
  1    Jeter   Derek       1995        234        .        600

The values for First, Last and Start_Year are correct, but the values for the remaining variables are not.

What happened? By default SAS handles comma-delimited files differently than other applications, such as Excel. Specifically:

  • Most applications treat two consecutive commas as a blank field, which would cause End_Year to be interpreted as a blank, or missing. However SAS treats two consecutive delimiters (commas in this case) as a single delimiter. This means that SAS reads the value for End_Year from the field containing the value for Home_Runs.
  • By default, SAS treats delimiters (commas) within quotes as delimiters, unlike Excel and other applications. So, when SAS attempts to read the value of Home_Runs, it reads the characters (“22) between the next set of commas, rather than reading everything within quotes. This is invalid numeric data, so Home_Runs is set to missing. Finally, when SAS reads Salary, the value between the next set of commas is 600. This is the end of the INPUT statement, so the last field (000) is not read.

 

So, that’s what happened. What we want SAS to do is to treat the two consecutive commas as a field with a missing value, and to not interpret commas within quotes as delimiters. How do we do this?

The answer is surprisingly easy—simply add the option DSD to the INFILE statement, and remove the DLM= option. DSD stands for “delimiter sensitive data” and tells SAS that the input file is a “standard” comma delimited file, and that SAS should interpret it the same way as Excel and other applications do. Specifically, the DSD option does the following:

  • Sets the default delimiter for the input file to a comma (so the DLM= option isn’t needed)
  • Treats multiple consecutive commas as empty fields
  • Ignores commas within quotes

So, if we modify the program thusly:

data players;
   infile 'baseball.csv' <strong>DSD</strong>;
   input Last $ First $ Start_Year End_Year Home_Runs Salary : comma.;
run;

we get the expected results from PROC PRINT:

Obs    Last    First    Start_Year  End_Year  Home_Runs    Salary
  1    Jeter   Derek       1995        .        234        22600000

The really nice thing about the DSD option is that it will give you correct results whether or not you have consecutive delimiters or commas within quotes. So, it’s a good idea to specify the DSD option on your INFILE statement whenever you are reading a CSV file.

Share

About Author

Mike Kalt

Technical Training Specialist

Mike Kalt is an instructor and course developer at SAS Institute. His most recent course development project is "Producing Maps with SAS/GRAPH", which he highly recommends.

Related Posts

7 Comments

  1. Chandan Kumar Rai on

    When I am trying to access csv file it give the below error. I am using University Edition. Pls help

    ERROR: Physical file does not exist, /opt/sasinside/SASConfig/Lev1/SASApp/D:\folders\myfolders\prg1\sales3inv.csv.
    NOTE: The SAS System stopped processing this step because of errors.
    WARNING: The data set WORK.SALES may be incomplete. When this step was stopped there were 0 observations and 6 variables.
    NOTE: DATA statement used (Total process time):
    real time 0.02 seconds
    cpu time 0.02 seconds

  2. I am having some problem: The SAS could not read two variables from the data set (.dat file). I have enclosed one dimension of the data set below:
    Faculty Charles Porter charles.porter@csu.edu 555-8511 AERO
    out of 6 variables above(Group FirstName LastName EmailAddress PhoneNumber and Dept)
    EmailAddress and PhoneNumber donot show up in the output results. Under the EmailAddress column- output show like this:email:ch and PhoneNumber is blank.
    Please let me know what I have modify in the code to read these two variables correctly.
    Thank you,
    Sahib

  3. Very clear explaination!

    I have the same problem and more. I need to read many such csv files and I am lazy. I'd like to put a Proc Import in a macro so I don't need to define the varaibles. Is htere any trick or opitons to use?

  4. I have a field in csv with two lines of string in a cell. When importing to sas using dsd dlm and missover option. The sas reads only the first line and followed by all the columns with blank values. So it doesn read the 2nd line of the cell.

    Solution plz.

  5. Mike Kalt

    I believe the problem is that when you downloaded the example from my blog, the double quotes in the CSV file were translated to "Smart Quotes", which means that the quote characters in the data have different hexadecimal values than what SAS was expecting. You can correct the problem by editing the CSV file in Windows Notepad (or any other ASCII text editor) and replacing the quote characters around 22,000,000 with the double quote characters from your keyboard. This fixed the problem for me.

    Let me know if that doesn't work.

  6. Even I can't see it now.

    SAS still treat the two consecutive commas as a field with a missing value, and does not interpret commas within quotes as delimiters.

Back to Top