A reader from Bejing commented on a recent post with a question about data lengths and formats. While that wasn't really related to my post, I thought I'd attempt to answer in a new entry, here.
The question is basically this: when I combine two data sets with a common-named column, why does the resulting data set seem to cut the length short on the shared column?
Let's start with some definitions related to columns.
Length: The column length, in SAS terms, is the amount of storage allocated in the data set to hold the column values. The length is specified in bytes. For numeric columns, the valid lengths are usually 3 through 8. The longer the length, the greater the precision allowed within the column values. For character columns, the length can be 1 through 32767. For single-byte data values, that equates to the number of characters the column can hold. For multibyte data values (DBCS, Unicode, or UTF-8), where a character can occupy more than one byte, the number of characters that fit might be less than the length value of the column.
Format: The column format, in SAS terms, is a basically an instruction for how to transform a raw value into an appearance that is suitable for a given purpose. A basic attribute of a format is the format length, which controls how much of the value is displayed. For example, a character column might have a storage length of 10 bytes, but a format length of 5 characters ($5. format), so when you see the formatted values you will see at most 5 characters for each record.
Another attribute of the format is the precision. For example, the DOLLAR8. format will show you up to 8 characters of a value (including a currency symbol and thousands separator) in dollars, but no cents. DOLLAR8.2 will show you the decimal point and the value to the nearest cent (2 decimal spaces). In each case, the value displayed will not exceed 8 characters.
What happens if the formatted value runs into trouble squeezing into 8 characters? SAS drops the thousands separator and currency symbol to free up 2 slots first. If that's not enough, SAS begins to drop precision by lopping off the pennies, then the dimes. If that's still not enough, SAS rounds the value and uses scientific notation and other tricks to save space -- all the while keeping the integrity of your original value for use in any calculations.
For example, run this program:
data dollars; length val1 8 val2 8 val3 8 val4 8; format val1 DOLLAR8. val2 DOLLAR8.2 val3 DOLLAR4.2 val4 DOLLAR10.2; val1=12345.67; val2=12345.67; val3=12345.67; val4=12345.67; run; proc print; run;
You see these values:
val1 $12,346 val2 12345.67 val3 12E3 val4 $12,345.67
In the case where you are combining two data sources that have a common column, the results depend on the sequence of things. For example, consider this SAS program:
WARNING: Multiple lengths were specified for the variable name by input data set(s). This may cause truncation of data.
data long; length name $ 8; format name $8.; infile datalines dsd; input name; datalines4; aaaaaaaa bbbbbbbb cccccccc dddddddd ;;;; run; data short; length name $ 4; format name $4.; infile datalines dsd; input name; datalines4; eeee ffff gggg hhhh ;;;; run; data longfirst; set long short; run; proc contents; run; proc print; run; data shortfirst; set short long; run; proc contents; run; proc print; run;
If you SET the table with the longer length and format first, then the result has the longer format. No problems or warnings with that, because you can always fit the smaller values into a larger space, just not vice versa.
SAS formats are extremely powerful. They allow you to reuse data values for all sorts of reporting and categorization purposes without having to actually change the underlying values. My simple explanations here only scrape the surface of the flexibility and nuances of SAS formats. For more information, you might check out The Power of PROC FORMAT or The Little SAS Book.