Expanding lengths of all character variables in SAS data sets

20

CVP engine as a magnifying glassIn my earlier blog post, Changing variable type and variable length in SAS datasets, I showed how you can effectively change variables lengths in a SAS data set. That approach works fine when you need to change length attribute for few variables, on a case by case basis. But what if you need to change lengths for all character variables in a data set? Or if you need to do this for all data sets in a data library? For example, you need to expand (increase) all your character variables lengths by 50%. Well, then the case-by-case approach becomes too laborious and inefficient.

What is a character variable’s length attribute?

Before reading any further, let’s take a quick quiz:

Q: A character variable length attribute represents a number of:

  1. Bits
  2. Bytes
  3. Centimeters
  4. Characters

If your answer is anything but B, it’s incorrect. According to the SAS documentation, length refers to the number of bytes used to store each of the variable's values in a SAS data set. You can use a LENGTH statement to set the length of both numeric and character variables.

It is true though that for some older encoding systems (ASCII, ISO/IEC 8859, EBCDIC, etc.) there was no difference between the number of bytes and the number of characters as those systems were based on exactly one byte per character encoding. They are even called Single Byte Character Sets (SBCS) for that reason. The problem is they can accommodate only a maximum of 28=256 symbols which is not nearly enough to cover all the variety of natural languages, special characters, emojis etc.

Why would we want to expand character variable lengths?

Use case 1. Expanding character values range

For this scenario, let’s consider Internet traffic analysis where your data contains multiple character columns for Internet Protocol addresses (IP addresses) in 32-bit version 4 (IPv4, e.g. ‘125.255.501.780’). You transition to a newer 128-bit IPv6 standard (e.g. ‘2001:0000:3238:DFE1:0063:0000:0000:FEFB’) and need to modify your data structure to accommodate the new standard with longer character values.

Use case 2. Migrating SAS data to multi-byte encoding environment

In this scenario, you migrate/move SAS data sets from older SBCS environments to newer Multi-Byte-Character Set (MBCS) encoding environments. For such a case, the ability to increase character variables lengths in bulk with a simple action becomes especially significant and critical.

Currently, the most commonly used MBCS is Unicode which is supported by all modern operating systems, databases and web browsers. Out of different flavors of Unicode (UTF-8, UTF-16, UTF-32) the most popular is UTF-8. UTF-8 (8-bit Unicode Transformation Format) is a variable-width character set that uses from 1 to 4 one-byte (8-bit) code units per character; it is capable of encoding 1,112,064 various characters that covers most modern languages, including Arabic and Hebrew characters, hieroglyphs, emojis as well as many other special characters.

Since each UTF-8 encoded character may require somewhere between one and four bytes, and not all SBCS characters are represented by one byte in UTF-8, data migration from SBCS to UTF-8 may cause data truncation and subsequently data loss.

When SAS reads an SBCS-encoded data set and writes its records into UTF-8-encoded data set it throws an ERROR message in the log and stops execution:

ERROR: Some character data was lost during transcoding in the dataset LIBREF.DSNAME. Either the data contains characters that are not representable in the new encoding or truncation occurred during transcoding.

When SAS reads an SBCS-encoded data set and produces a UTF-8-encoded printed report only (without generating a UTF-8-encoded output data set) it generates a WARNING message (with identical description as the above ERROR message) while continuing execution:

WARNING: Some character data was lost during transcoding in the dataset LIBREF.DSNAME. Either the data contains characters that are not representable in the new encoding or truncation occurred during transcoding.

Either ERROR or WARNING is unacceptable and must be properly addressed.

How to expand all character variables lengths?

Regardless of character transcoding, SAS’ CVP Engine is short and effective answer to this question. CVP stands for Character Variable Padding which is exactly what this special-purpose engine does – it pads or expands, increases character variables by a number of bytes. CVP engine is part of Base SAS and does not require any additional licensing.

The CVP engine is a read-only engine for SAS data sets only. You can think of it as of a magnifying glass: it creates an expanded view of the character data descriptors (lengths) without changing them. Still we can use the CVP Engine to actually change a data set or a whole data library to their expanded character variables version. All we need to do is to define our source library as CVP library, for example:

libname inlib cvp 'c:\source_folder';

Then use PROC COPY to create expanded versions of our original data sets in a target library:

libname outlib 'c:\target_folder';
proc copy in=inlib out=outlib noclone;
   select dataset1 dataset2;
run;

Or, if we need to expand character variable lengths for the whole library, then we use the same PROC COPY without the SELECT statement:

proc copy in=inlib out=outlib noclone;
run;

It’s that easy. And the icing on the cake is that CVP engine by default automatically adjusts the variables format widths to match the expanded byte lengths for all converted character variables.

Avoiding character data truncation by using the CVP Engine

CVP Engine is a near-perfect SAS solution to the problem of potential data truncation when data is transcoded during migration or move from SBCS-based to MBCS-based systems.

To avoid data loss from possible data truncation during transcoding we can use the above code with a slight but important modification – define the target library with outencoding='UTF-8' option. It will result in our target data not only expanded lengthwise but properly encoded as well. Then we run this modified code in the old SBCS environment before moving/migrating our data sets to the new MBCS environment:

libname inlib cvp 'c:\source_folder';
libname outlib 'c:\utf8_target_folder' outencoding='UTF-8';
proc copy in=inlib out=outlib noclone;
   select dataset1 dataset2;
run;

Again, if you need to expand character variable lengths for the whole library, then you can use the same PROC COPY without the SELECT statement:

proc copy in=inlib out=outlib noclone;
run;

After that we can safely move our expanded, UTF-8-encoded data to the new UTF-8 environment.

Code notes

  • The code above will create a different version of your original data sets with desired encoding and expanded by 50% (default) character variables lengths. As shown below, this default behavior can be changed by using CVPBYTES= or CVPMULTIPLIER= options which explicitly define bytes expansion rate.
  • It is important to note that CVP option is specified on the input library since the CVP engine is read-only engine, thus available for input (read) processing only.
  • For the output library you specify your desired encoding option, in this case outencoding='UTF-8'.
  • The noclone option specifies not to copy data set attributes. This is needed to make sure the attributes are recreated rather than duplicated.
  • If you want to migrate your data sets using PROC MIGRATE, you should expand column lengths before using PROC COPY as shown above since the CVP engine is not currently supported with PROC MIGRATE.
  • The CVP engine supports only SAS data files (no SAS views, catalogs, item stores, and so on).

CVP Engine options

There are several options available with the CVP Engine. Here are the most widely used:

CVPBYTES=bytes - specifies the number of bytes by which to expand character variable lengths. The lengths of character variables are increased by adding the specified bytes value to the current length. You can specify a value from 0 to 32766. However, expanded length will be automatically limited by the character variable maximum length of 32767.

Example: libname inlib 'SAS data-library' cvpbytes=5;

The CVPBYTES= option implicitly specifies the CVP engine, that is if you specify the CVPBYTES= option you don’t have to specify CVP engine explicitly as SAS will use it automatically.

CVPMULTIPLIER=multiplier - specifies a multiplier value that expands character variable. The lengths of character variables are increased by multiplying the current length by the specified multiplier value. You can specify a multiplier value from 1 to 5, or you can specify 0 and then the CVP engine determines the multiplier automatically.

Example: libname inlib 'SAS data-library' cvpmultiplier=2.5;

The CVPMULTIPLIER= option also implicitly specifies the CVP engine, that is if you specify the CVPMULTIPLIER= option, you don’t have to specify CVP engine explicitly as SAS will use it automatically.

Notes:

  • You cannot specify both the CVPMULTIPLIER= option and the CVPBYTES= option. Specify only one of these options.
  • If you explicitly assign the CVP engine but do not specify either CVPBYTES= or CVPMULTIPLIER=, then SAS defaults to using CVPMULTIPLIER=1.5 to increase the lengths of the character variables.

Additional Resources

Your thoughts?

Have you found this blog post useful? Please share your use cases, thoughts and feedback in the comments section below.

Share

About Author

Leonid Batkhan

Leonid Batkhan, Ph.D. in Computer Science and Automatic Control Systems, has been a SAS user for more than 25 years. He came to work for SAS in 1995 and is currently a Senior Consultant with the SAS Federal Data Management and Business Intelligence Practice. During his career, Leonid has successfully implemented dozens of SAS applications and projects in various industries. All posts by Leonid Batkhan >>>

20 Comments

  1. Leonid,

    Nice post!
    A disadvantage of the CVP engine might be that it uses a fixed ratio of increased lengths without examining the data, therefore causing all character variables to become arbitrarily larger.
    Rick Langston wrote a nice paper (A Macro for Ensuring Data Integrity When Converting SAS® Data Sets) about examining the data to determine the maximum length needed for each character variable so that the transcoded data set will have the sufficient lengths to avoid any truncation.
    See: https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2018/1778-2018.pdf

    The %COPY_TO_NEW_ENCODING macro described in this paper is also documented:
    https://go.documentation.sas.com/?docsetId=nlsref&docsetTarget=p1g1d26os4w0von1cdfh827foo3r.htm&docsetVersion=9.4&locale=en

    We successfully used a process based on this macro to transcode terrabytes of data for a customer..

    • Leonid Batkhan

      Thank you, Lex, for your comment and pointing to another great resource. Yes, CVP engine applies a simple blanket length expansion. However, there is a catch when length increase is optimized based on the existing data values. As they say in the investment world "Past performance is no guarantee of future results." Future data values (including newer characters) may be added to the data structure and if it's over-optimized based on some past data snapshot it might not able to accommodate the newer data...

  2. Leonid,

    Great post! And thanks for spurring a conversation around an incredibly useful answer to a very important topic for anyone involved with data on a day-to-day basis.

  3. I'm planning to pass this article along to my colleagues as a must-read. Its useful content on CVP and NLS issues and ensuing discussion are strategic to us since we often field calls on the truncation warning and errors that can be resolved using the CVP engine. I especially like the descriptive words and picture of the magnifying glass. Once I started writing blogs, I can never take for granted the small things like this that can add time but make for a fun read! Your article is definitely going to be a reference I will use often with customers in Technical Support.

  4. I haven't had a use case for something like this for a very long time, but it's good to know. I probably would have written a long and complicated program to do this not knowing about this feature. This one is definitely worth bookmarking just in case I ever need it.

  5. Fascinating. Would it be possible to use something like this to change datasets that included scripts in a foreign language to a roman transliteration?
    Sometimes we might need a longer length.
    For example the Russian letter Щ is transliterated by SHCH, which is four characters instead on one.

    • Leonid Batkhan

      Thank you, John, for your feedback. Of course, it is possible to use CVP engine to expand character lengths to accommodate foreign language transliteration. In fact, CVP engine does not care why you are expanding character variable length - it just does it. But you provided a valid use case for it. To do the actual character transliteration/replacement, you can use TRANWRD() function. For example, to replace all occurrences of Russian letter Щ in variable x to SHCH, you can use tranwrd(x, 'Щ', 'SHCH') .

        • Leonid Batkhan

          No worries. With CVP engine's CVPBYTES= option you can expand character variable length by anywhere from 0 to 32766 bytes up to the maximum character length of 32767 bytes. This will allow for 32767:7=4681 letters Щ in German transliteration. 🙂

  6. Great blog
    Having old exposure to growing data becoming truncated we solved some of this by greatly increasing char lengths and using data set option compress=char.
    Would compression work without performance penalties on CVPMULTIPLIER= 4 or larger ?

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top