Demystifying and resolving common transcoding problems

Technical Support regularly receives incoming calls from customers who have encountered the following transcoding warning:

WARNING: Some character data was lost during transcoding in the data set xxx.xxx. Either the data contains characters that are not representable in the new encoding or truncation occurred during transcoding

People are not always exactly sure what this warning means nor what to do about it. No worries! This blog provides background information about why this warning occurs and offers resources that can help you understand and resolve any encoding or truncation issues that are the source of the warning above.

In most Western character sets, each character occupies only one byte of computer memory or storage. These are known as single-byte character sets (SBCS). Character sets for languages such as Japanese, Korean, and Simplified and Traditional Chinese require mixed-width encoding. The character sets for these languages are double-byte character sets (DBCS). (The term DBCS can be misleading because not all of the characters in a double-byte character set are two bytes. Thus, the term multibyte character set [MBCS] is sometimes used instead of DBCS.)

Unicode is a multibyte character set that was created to support all languages. It includes all characters from most modern written languages and historic scripts, even hieroglyphs and cuneiform. UTF-8 is one of the more common encoding forms of the Unicode standard, and this encoding is recommended as the SAS session encoding for multilingual environments if your data sources contain characters from many regions. Each character can be up to four bytes in UTF-8. UTF-8 is the SAS session encoding for SAS® Viya™. Clients such as SAS Studio and SAS Visual Analytics typically execute statements in a server environment that run the UTF-8 encoding.

Now, let's look at how SAS handles data from different encodings along with two causes of the warning shown earlier.

Starting in SAS®9, SAS data sets store an encoding indicator in the descriptor portion. If the encoding value of the file differs from the encoding of the currently executing SAS session, the Cross Environment Data Access (CEDA) engine is invoked when SAS reads the data set. CEDA transcodes the data, by default. Transcoding is the process of converting the contents of SAS files from one encoding to another, and this process is necessary in order to read data from around the world.

When transcoding occurs, you receive an informational note about the use of CEDA. In addition, you might see the warning (shown above) in your SAS log.

The transcoding warning is issued when you have values for character variables that have either of the following characteristics:

The number of bytes for a character in one encoding do not match the number of bytes that are used for the same character in another encoding (as shown in the image below). If the length of the column is not wide enough to accommodate the additional bytes, truncation of the character data occurs.

The characters exist in one encoding, but they do not exist in another encoding. As a result, the transcoding is not successful.

If you use a procedure (for example, the PRINT or REPORT procedures), the procedure runs to completion. The variable is truncated if the additional bytes cause the value to exceed the length of the column. Alternatively, if a character from the data set does not also exist in the target encoding or if it occupies a different code point (after transcoding), the column might be blank or another character might be substituted for the original character. The procedure generates the transcoding warning.

If you have an output data set open, SAS transcodes the character data and stops writing observations to the file at the point it encounters the problematic characters. This behavior can result in either fewer observations than expected or a data set with zero observations. When this happens, SAS generates an error message instead of a warning. This error message contains the same content as the warning message.

ERROR: Some character data was lost during transcoding in the data set xxx.xxx. Either the data contains characters that are not representable in the new encoding or truncation occurred during transcoding

New Video Available to Help You Resolve the Transcoding Error or Warning

Before you attempt to resolve the transcoding error or warning, it is helpful to understand a bit about encodings and how to handle multilingual SAS data. To help you in this endeavor, SAS Technical Support worked with Education and Training to create a new, short (12 minutes) video to help you quickly troubleshoot and resolve the most common reasons for the transcoding error or warning.

Click the image below to access the video. (Note: You will be prompted to create a SAS profile if you do not have one already.)

If you want to practice the techniques in this video, you can download the data sets that are used in the video here.

This video is based on solutions that have helped customers who called SAS Technical Support for assistance. The video offers these solutions in a visual format that is easy to follow. In addition to showing you how to launch SAS in different encodings to resolve the error, the video also introduces the Character Variable Padding (CVP) engine (Read-only) for processing SAS data files that contain multilingual data.

This video is one of a number of resources available to help you. The next section lists several tips that might help prevent or resolve transcoding warnings or errors.

Helpful Tips from the Field

The following tips have been tested and used in the field, and you might find them helpful when you work with multilingual data. The first two tips are usage tips; the others are programming tips.

Invoke SAS in more than one encoding. For this first tip, it is important to know that if you try to change the value of the ENCODING system option during a SAS session, SAS ignores the option. When this situation occurs, the following warning is generated:

WARNING 30-12: SAS option ENCODING is valid only at startup of the SAS System. The SAS option is ignored.

If you work with data that is stored in more than one encoding, you can invoke SAS in more than one session encoding and manage the sessions easily by setting the title bar to display the session encoding in the main SAS window.

Note: This solution only works in the Microsoft Windows operating environment. Display of UTF-8 encoded data in the SAS windowing environment is not fully supported. SAS Enterprise Guide and SAS Studio have full support for display of UTF-8 characters.

To customize the title bar of the main SAS window.

1. In a Windows environment, select Start ► All Programs ► SAS ► Additional Languages.

2. Under Additional Languages, select the shortcut for the session encoding that you want. For example, the following image shows that SAS 9.4 (Unicode Support) is selected

3. Right-click the shortcut and select Properties. This opens the properties dialog box for that shortcut.

4. On the Shortcut tab, place your cursor at the end of the current string in the Target text box. If it is not already listed in the box, enter the location of the configuration file that you want to use by using the –CONFIG option followed by the –AWSTITLE option.

Then include in quotation marks the text that you want to see in the main SAS window, as shown in the following example:

The entire line should look similar to the following, only it will appear as one long string:

"C:\Program Files\SASHome\SASFoundation\9.4\sas.exe" -CONFIG "C:\Program Files\SASHome\SASFoundation\9.4\nls\
u8\sasv9.cfg" -awstitle "SAS UTF8"

The following display shows the –AWSTITLE option added to the Target field of the SAS 9.4 (Unicode Support) Properties dialog box:

5. Click OK.

Now, when you open SAS, you can immediately recognize the session encoding based on the value in the title bar.

You can pin this shortcut to your task bar by dragging it from the Start menu to the task bar. Then you should see your custom title when you hover your cursor over the shortcut.

Prevent an unexpected transcoding warning or error: If you read or convert a SAS data set from LATIN1 or WLATIN1 encoding to UTF-8 that contains Microsoft Word smart quotes or dashes in character variables, it can cause truncation errors. Those are not ASCII characters, and they require more than one byte when converted to UTF-8. For example, performing the following steps causes an unexpected transcoding error or warning.

1. You use SAS in the WLATIN1 encoding.
2. You copy a quoted string from Microsoft Word and paste that string into the DATALINES section of a SAS program in the SAS windowing environment.
3. You create a new variable that uses the quoted string in its value.
4. You execute the SAS program to create a permanent SAS data set.
5. You invoke a UTF-8 SAS session to read the WLATIN1 data set in a procedure or DATA step.

You can prevent this issue by clearing the "Straight quotes" with "smart quotes" option in Word, as shown below. This option is available by selecting File ►Options ► Proofing ► AutoCorrect Option. On the AutoFormat tab, the option is listed under the Replace category.

Set the CVP engine explicitly (for UTF-8 sessions). The transcoding warning is common if you use a UTF-8 session encoding and the input data source is in the WLATIN1 encoding. The first 128 code points of the UTF-8 code page are identical to the 7-bit ASCII encoding. If your character data only contains characters within those 128 code points, SAS does not generate a warning or error when it transcodes the data. For code points above 128, the character variables must be expanded to two or more bytes. Many characters that are used in Western European languages fall into this category. You can address this issue and avoid the warning or error by adding the CVP engine to the LIBNAME statement for your input data source.

The following example LIBNAME statement illustrates how to specify the CVP engine explicitly:

libname mySAS cvp 'SAS-library-path';
data utf8dat;
set mySAS.wlatindat;
run;

In this example, the code is submitted in a UTF-8 SAS session. The CVP engine is Read-only, and it is used to read the input data set from the WLATIN1 encoding.

Set the CVP engine implicitly. The CVP engine provides an easy way to convert your files and avoid truncation problems by using a default value of 1.5 times the original length of the character variable. However, that value might not be sufficient for the length of the characters in your data. The expansion length should be based on the characters that are contained in your data.

For example, if your data consists of double-byte character (DBCS) data (Chinese, Japanese, and so on), most characters are three bytes and emoji characters are four bytes. Therefore, the default 1.5 value for the CVP engine might not be enough for the number of additional bytes. A value of 2.5 is a reasonable expansion for DBCS characters, but you might need more.

In such cases, you might want to try setting the CVP engine implicitly. When you set the CVP engine implicitly, you can use either the CVPBYTES= or CVPMULTIPLIER= options. The following example uses the CVPBYTES= option to accommodate the four bytes that are required for the winking emoticon in UTF-8:

libname enc 'c:\public\encdata' cvpbytes=4;

Convert text in a character variable to another encoding. Sometimes, the CVP engine does not correct the transcoding error or warning. In these cases, Technical Support usually finds that the destination encoding does not support all of the characters found in the data. For example, if you have a UTF-8 data set that contains French, Greek, Hindi, and other characters and you are running a WLATIN1 session, SAS will successfully transcode the French and other Western European characters to WLATIN1. However, the Greek and Hindi characters cannot be transcoded because there is no representation for them in the WLATIN1 encoding. As a result, an error is generated, and the data is lost.

If you have this kind of multilingual data, you can use the KPROPDATA() function to transcode a character string from and to any encoding and convert characters that are not represented in the target encoding to a character of your choice (a question mark or a space). Note that you must specify BINARY or ANY as the input encoding to prevent CEDA from transcoding your data. (See the next tip for information about suppressing transcoding.) A macro is available in the technical paper Multilingual Computing with SAS 9.4 (on page 7) that you can run to accomplish these tasks.

Suppress transcoding. There are certain cases in which you might want to bypass transcoding errors. For example, suppose that you know that your data contains only ASCII characters. You are providing data sets to users and you do not know whether the users will run SAS in WLATIN1 or UTF-8 session encodings. Because the characters are ASCII, SAS does not need to transcode them. You do not want CEDA notes nor the extra overhead that CEDA requires. In such a case, you can suppress transcoding without risking data loss. In the LIBNAME statement for the output data set, specify either the OUTENCODING=ASCIIANY option or the ENCODING=ASCIIANY data set option.

Example 1:

libname final 'SAS-library-path' outencoding=asciiany; 
proc copy in=sashelp out=final noclone;
   select dsname;
run;

Example 2:

libname final 'SAS-library-path'
data final.class(encoding=asciiany);
   set dsname;
run;

The value of ASCIIANY means no transcoding occurs on ASCII machines. When the data source is open on an EBCDIC machine, SAS assumes that the data is ASCII and transcodes it into the EBCDIC session encoding. Note: The CONTENTS procedure shows the encoding value: us-ascii ASCII (ANSI).

The encoding options shown above have two other values:

ANY: With this value, SAS does not transcode at all, which is, effectively, binary mode.
EBCDICANY: With this value, transcoding only occurs on ASCII machines. SAS assumes the data is EBCDIC and transcodes the character data into the ASCII session encoding.

Additional Self-Help Resources

This section lists several documents that provide more detail about encoding concepts and suggests ways to handle transcoding problems.

Encoding: Helping SAS speak your language. This blog by Bari Lawhorn briefly explains what an encoding is, how to determine the default encoding, what to do if your encoding differs from that of other people with whom you share SAS data, and where to find more in-depth information about encodings.
Multilingual Computing with SAS® 9.4: This paper, which describes new and enhanced features for handling multiple languages in SAS 9.4, covers topics such as installing and configuring the SAS System, changing the locale of a deployment, working with multibyte data, multilingual support in SAS^® Visual Analytics, and more
SAS® and UTF-8: Ultimately the Finest. Your Data and Applications Will Thank You!: This SAS Global Forum 2017 paper by Elizabeth Bales and Wei Zheng is a perfect go-to resource when you are migrating data files and formats from another encoding to UTF-8.
SAS^® Encoding: Understanding the Details: This SAS Press book, by Manfred Kiefer, explains the basics about character encoding that are required for creating, manipulating, and rendering any type of character. This book also provide examples for troubleshooting a variety of encoding problems.
SAS® 9.4 National Language Support (NLS): Reference Guide, Fifth Edition. This user guide provides detailed information about encodings, transcoding, and other NLS topics (including dictionaries of NLS autocall macros, data set options, system options, formats, informats, functions, and procedures)
SAS Note 15597, "How to convert SAS data set encoding." This note explains how you can convert the encoding for a SAS data set.
SAS Note 52716, "The error "Some character data was lost during transcoding in the data set" occurs when the data set encoding does not match the SAS® session encoding."
SAS Sample 55054, "PROC SQL can be used to identify data representation and encoding for all data sets in a library:" This sample enables you to query a library of data sets that are in mixed encodings so you can plan to use them in compatible session encodings or you can convert the data sets to another encoding. Using the SQL procedure in this sample, you can create output similar to the following that shows you the data representation and encoding for all data sets in your library.

1 Comment

Leonid Batkhan on May 19, 2017 2:35 pm

Great post, Jody! Isn't it time though to have UTF-8 as a default not just for SAS® Viya™, but across the board for all SAS products! Or maybe have a SAS session global option for output data DATAENCODE= ?

Blogs