Lengths and formats: the long and short of it

A reader from Bejing commented on a recent post with a question about data lengths and formats. While that wasn't really related to my post, I thought I'd attempt to answer in a new entry, here.

The question is basically this: when I combine two data sets with a common-named column, why does the resulting data set seem to cut the length short on the shared column?

Let's start with some definitions related to columns.

Length: The column length, in SAS terms, is the amount of storage allocated in the data set to hold the column values. The length is specified in bytes. For numeric columns, the valid lengths are usually 3 through 8. The longer the length, the greater the precision allowed within the column values. For character columns, the length can be 1 through 32767. For single-byte data values, that equates to the number of characters the column can hold. For multibyte data values (DBCS, Unicode, or UTF-8), where a character can occupy more than one byte, the number of characters that fit might be less than the length value of the column.

Format: The column format, in SAS terms, is a basically an instruction for how to transform a raw value into an appearance that is suitable for a given purpose. A basic attribute of a format is the format length, which controls how much of the value is displayed. For example, a character column might have a storage length of 10 bytes, but a format length of 5 characters ($5. format), so when you see the formatted values you will see at most 5 characters for each record.

Another attribute of the format is the precision. For example, the DOLLAR8. format will show you up to 8 characters of a value (including a currency symbol and thousands separator) in dollars, but no cents. DOLLAR8.2 will show you the decimal point and the value to the nearest cent (2 decimal spaces). In each case, the value displayed will not exceed 8 characters.

What happens if the formatted value runs into trouble squeezing into 8 characters? SAS drops the thousands separator and currency symbol to free up 2 slots first. If that's not enough, SAS begins to drop precision by lopping off the pennies, then the dimes. If that's still not enough, SAS rounds the value and uses scientific notation and other tricks to save space -- all the while keeping the integrity of your original value for use in any calculations.

For example, run this program:

data dollars;
 length val1 8 val2 8 val3 8 val4 8;
 format val1 DOLLAR8. val2 DOLLAR8.2 val3 DOLLAR4.2 val4 DOLLAR10.2;
 val1=12345.67;
 val2=12345.67;
 val3=12345.67;
 val4=12345.67;
 
run;
proc print;
run;

You see these values:

val1    $12,346
val2    12345.67
val3    12E3
val4    $12,345.67

In the case where you are combining two data sources that have a common column, the results depend on the sequence of things. For example, consider this SAS program:

If you SET the table with the short length and format first, the resulting table assumes the shorter format -- and thus you "lose" precision. But your SAS log warns you:

WARNING: Multiple lengths were specified for the variable name by input data set(s). This may cause truncation of data.

data long;
  length name $ 8;
  format name $8.;
  infile datalines dsd;
  input name;
datalines4;
aaaaaaaa
bbbbbbbb
cccccccc
dddddddd
;;;;
run; 
 
data short;
  length name $ 4;
  format name $4.;
  infile datalines dsd;
  input name;
datalines4;
eeee
ffff
gggg
hhhh
;;;;
run;
 
data longfirst;
  set long short;
run;
proc contents; run;
proc print; run;
 
data shortfirst;
  set short long;
run;
proc contents; run;
proc print; run;

If you SET the table with the longer length and format first, then the result has the longer format. No problems or warnings with that, because you can always fit the smaller values into a larger space, just not vice versa.

SAS formats are extremely powerful. They allow you to reuse data values for all sorts of reporting and categorization purposes without having to actually change the underlying values. My simple explanations here only scrape the surface of the flexibility and nuances of SAS formats. For more information, you might check out The Power of PROC FORMAT or The Little SAS Book.

tags: formats, informats, SAS tips

One Comment

  1. Shachindra pandey
    Posted February 2, 2009 at 5:46 am | Permalink

    the experience was cool and now concept is also clear for me.

6 Trackbacks

  1. By Do me a favor: stop doing me favors - The SAS Dummy on September 19, 2011 at 4:29 pm

    [...] FMTERR|NOFMTERR option specifies whether SAS should report an error when you attempt to reference a data column that has a SAS format applied, but the format definition cannot be [...]

  2. [...] to do before they can analyze and report on data. In SAS, the most elegant way to handle this is by applying a SAS format. A SAS format allows you to "bucket" a bunch of raw data values into more managable categories. For [...]

  3. By Greatest hits from 2009 - The SAS Dummy on October 3, 2011 at 3:28 pm

    [...] Lengths and Formats: the long and short of it What's the difference between LENGTH and FORMAT? This topic is so popular I've made sure to work it into SAS 9.2 for Dummies (due out this spring). [...]

  4. [...] person who wrote the above code forgot about the power of SAS formats. Using SAS formats, you can "recast" your variable for analysis to create a category from values [...]

  5. By Numerical precision in SAS on March 1, 2012 at 2:13 pm

    [...] Lengths and Formats: the long and short of it [...]

  6. [...] languages offered intrinsic data types with a 64-bit storage length. This is true in SAS, where the LENGTH of a numeric variable determines its capacity. The default length for a SAS numeric variable is 8 bytes, regardless of platform and bitness. 8 [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <p> <pre lang="" line="" escaped=""> <q cite=""> <strike> <strong>