Lengths and formats: the long and short of it

9

A reader from Bejing commented on a recent post with a question about data lengths and formats. While that wasn't really related to my post, I thought I'd attempt to answer in a new entry, here.

The question is basically this: when I combine two data sets with a common-named column, why does the resulting data set seem to cut the length short on the shared column?

Let's start with some definitions related to columns.

Length: The column length, in SAS terms, is the amount of storage allocated in the data set to hold the column values. The length is specified in bytes. For numeric columns, the valid lengths are usually 3 through 8. The longer the length, the greater the precision allowed within the column values. For character columns, the length can be 1 through 32767. For single-byte data values, that equates to the number of characters the column can hold. For multibyte data values (DBCS, Unicode, or UTF-8), where a character can occupy more than one byte, the number of characters that fit might be less than the length value of the column.

Format: The column format, in SAS terms, is a basically an instruction for how to transform a raw value into an appearance that is suitable for a given purpose. A basic attribute of a format is the format length, which controls how much of the value is displayed. For example, a character column might have a storage length of 10 bytes, but a format length of 5 characters ($5. format), so when you see the formatted values you will see at most 5 characters for each record.

Another attribute of the format is the precision. For example, the DOLLAR8. format will show you up to 8 characters of a value (including a currency symbol and thousands separator) in dollars, but no cents. DOLLAR8.2 will show you the decimal point and the value to the nearest cent (2 decimal spaces). In each case, the value displayed will not exceed 8 characters.

What happens if the formatted value runs into trouble squeezing into 8 characters? SAS drops the thousands separator and currency symbol to free up 2 slots first. If that's not enough, SAS begins to drop precision by lopping off the pennies, then the dimes. If that's still not enough, SAS rounds the value and uses scientific notation and other tricks to save space -- all the while keeping the integrity of your original value for use in any calculations.

For example, run this program:

data dollars;
 length val1 8 val2 8 val3 8 val4 8;
 format val1 DOLLAR8. val2 DOLLAR8.2 val3 DOLLAR4.2 val4 DOLLAR10.2;
 val1=12345.67;
 val2=12345.67;
 val3=12345.67;
 val4=12345.67;
 
run;
proc print;
run;

You see these values:

val1    $12,346
val2    12345.67
val3    12E3
val4    $12,345.67

In the case where you are combining two data sources that have a common column, the results depend on the sequence of things. For example, consider this SAS program:

If you SET the table with the short length and format first, the resulting table assumes the shorter format -- and thus you "lose" precision. But your SAS log warns you:

WARNING: Multiple lengths were specified for the variable name by input data set(s). This may cause truncation of data.

data long;
  length name $ 8;
  format name $8.;
  infile datalines dsd;
  input name;
datalines4;
aaaaaaaa
bbbbbbbb
cccccccc
dddddddd
;;;;
run; 
 
data short;
  length name $ 4;
  format name $4.;
  infile datalines dsd;
  input name;
datalines4;
eeee
ffff
gggg
hhhh
;;;;
run;
 
data longfirst;
  set long short;
run;
proc contents; run;
proc print; run;
 
data shortfirst;
  set short long;
run;
proc contents; run;
proc print; run;

If you SET the table with the longer length and format first, then the result has the longer format. No problems or warnings with that, because you can always fit the smaller values into a larger space, just not vice versa.

SAS formats are extremely powerful. They allow you to reuse data values for all sorts of reporting and categorization purposes without having to actually change the underlying values. My simple explanations here only scrape the surface of the flexibility and nuances of SAS formats. For more information, you might check out The Power of PROC FORMAT or The Little SAS Book.

Share

About Author

Chris Hemedinger

Senior Manager, SAS Online Communities

+Chris Hemedinger is the manager of SAS Online Communities. He's also co-author of the popular SAS for Dummies book, author of Custom Tasks for SAS Enterprise Guide using Microsoft .NET, and a frequent participant on the SAS Enterprise Guide discussion forum.

Related Posts

9 Comments

  1. Pingback: Do me a favor: stop doing me favors - The SAS Dummy

  2. Pingback: Creating a SAS format from a data set - The SAS Dummy

  3. Pingback: Greatest hits from 2009 - The SAS Dummy

  4. Pingback: Do you mind if we dance with your DATEs (or DATETIMEs)? - The SAS Dummy

  5. Pingback: Numerical precision in SAS

  6. Pingback: Myths about 64-bit computing on Windows - The SAS Dummy

  7. Your example does not cover how the FORMAT is determined when one or more dataset has a format attached and others do not. SAS will use the first non empty FORMAT for the variable. This can result in the confusing situation where the length of the variable is long enough to store the data, but the format is short. Normal tools for displaying the data will make it look truncated.

Leave A Reply

Back to Top