Old reliable: DATA step tips and tricks

The DATA step has been around for many years and regardless of how many new SAS® products and solutions are released, the DATA step remains a popular way to create and manipulate SAS data sets. The SAS language contains a wide variety of approaches that provide an endless number of ways to accomplish a goal. Whether you are reshaping a data set entirely or simply assigning values to a new variable, there are numerous tips and tricks that you can use to save time and keystrokes. Here are a few that Technical Support offers to customers on a regular basis.

Writing file contents to the SAS® log

Perhaps you are reading a file and seeing “invalid data” messages in the log or you are not sure which delimiter is used between variables. You can use the null INPUT statement to read a small sample of the data. Specify the amount of data that you want to read by adjusting options in the INFILE statement. This example reads one record with 500 bytes:

   data _null_;
      infile 'path' recfm=f lrecl=500 obs=1;
      input;
      list;
   run;

The LIST statement writes the current record to the log. In addition, it includes a ruled line above the record so that it is easier to see the data for each column of the file. If the data contains at least one unprintable character, you will see three lines of output for each record written.

For example:

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+
 
1   CHAR  this is a test.123.dog.. 24
    ZONE  766726726276770333066600
    NUMR  48930930104534912394F7DA

The line beginning with CHAR describes the character that is represented by the two hexadecimal characters underneath. When you see a period in that line, it might actually be a period, but it often means that the hexadecimal characters do not map to a specific key on the keyboard. Also, column 15 for this data is a tab ('09'x). By looking at this record in the log, you can see that the first variable contains spaces and that the file is tab delimited.

Writing hexadecimal output for all input

Since SAS® 9.4M6 (TS1M6), the HEXLISTALL option in the LIST statement enables all lines of input data to be written in hexadecimal format to the SAS log, regardless of whether there are unprintable characters in the data:

   data mylib.new / hexlistall;

Removing “invalid data” messages from the log

When SAS reads data into a data set, it is common to encounter values that are invalid based on the variable type or the informat that is specified for a variable. You should examine the SAS log to assess whether the informat is incorrect or if there are some records that are actually invalid. After you ensure that the data is being read correctly, you can dismiss any “invalid data” messages by using double question mark (??) modifiers after the variable name in question:

   input date ?? mmddyy10.;

You can also use the question mark modifiers when you convert a character variable to numeric with the INPUT function:

   newdate=input(olddate,?? mmddyy10.);

The above syntax reads all values with the MMDDYY10. informat and then dismisses the notes to the log when some values for the OLDDATE variable are invalid.

Sharing files between UNIX and Microsoft Windows operating systems

The end-of-record markers in text files are different for UNIX and Windows operating systems. If you needed to share a file between these systems, the file used to need preprocessing in order to change the markers to the desired type or perhaps specify a delimiter. Now, the TERMSTR= option in the INFILE statement enables you to specify the end-of-record marker in the incoming file.

If you are working in a UNIX environment and you need to read a file that was created in a Windows environment, use the TERMSTR=CRLF option:

   infile 'file-specification'  termstr=crlf ;

If you are in a Windows environment and you need to read a file that was created in a UNIX environment, use this syntax:

   infile 'file-specification'  termstr=lf ;

Adapting array values from an ARRAY statement

The VNAME function makes it very convenient to use the variable names from an ARRAY statement. You most often use an ARRAY statement to obtain the values from numerous variables. In this example, the SQL procedure makes it easy to store the unique values of the variable Product into the macro variable &VARLIST and that number of values into the macro variable &CT (another easy tip). Within the DO loop, you obtain the names of the variables from the array, and those values from the array then become variable names.

   proc sql noprint;
      select distinct(product) into :varlist separated by ' '
      from one;
      select count(distinct product) into :ct
      from one;
   quit;
 
   …more DATA step statements…
   array myvar(&ct) $ &varlist;
      do i=1 to &ct;
         if product=vname(myvar(i)) then do;
         myvar(i)=left(put(contract,8.));
   end;
   …more DATA step statements…

Using a mathematical equation in a DO loop

A typical DO loop has a beginning and an end, both represented by integers. Did you know you can use an equation to process data more easily? Suppose that you want to process every four observations as one unit. You can run code similar to the following:

   …more DATA step statements…
   J+1;
   do i=((j*4)-3) to (j*4);
      set data-set-name point=I;
      …more DATA step statements…
   end;

Using an equation to point to an array element

With small data sets, you might want to put all values for all observations into a single observation. Suppose that you have a data set with four variables and six observations. You can create an array to hold the existing variables and also create an array for the new variables. Here is partial code to illustrate how the equation dynamically points to the correct new variable name in the array:

   array old(4) a b c d;
   array test (24) var1-var24;
   retain var1-var24;
   do i=1 to nobs;
      set w nobs=nobs point=i;
      do j=1 to 4;
      test(((i-1)*4)+j)=old(j);
      end;
   end;

Creating a hexadecimal reference chart

How often have you wanted to know the hexadecimal equivalent for a character or vice versa? Sure, you can look up a reference chart online, but you can also create one with a short program. The BYTE function returns values in your computer’s character set. The PUT statement writes the decimal value, hexadecimal value, and equivalent character to the SAS log:

   data a;
      do i=0 to 255;
      x=byte(i);
      put i=z3. i=hex2. x=;
   end;
   run;

When the resulting character is unprintable, such as a carriage return and line feed (CRLF) character, the X value is blank.

Conclusion

Hopefully, these new tips will be useful in your future programming. If you know some additional tips, please comment them below so that readers of this blog can add even more DATA step tips to their arsenal! If you would like to learn more about DATA step, check out these other blogs:

Blogs