Encoding: helping SAS speak your language

4

If you live in an English speaking country you are used to a relatively unadorned alphabet. Take a look at the French and Spanish languages, where vowels are decorated with accents like “acción” in Spanish, and the circumflex, or the hat used in “pâte” in French. Look at the gorgeous scripting you get to use if you read and write the letter "a" in Japanese: あ . Nice looking, right?

If you work with data that originates from another country or is distributed across the globe, you need to know about the SAS system options that control how the characters in your data are stored. Two of these options are ENCODING and LOCALE. These options will help guarantee that if your Japanese counterpart sends you SAS information in Japanese, you see the appropriate output, and not a series of question marks or blank boxes in your SAS session, or worse, errors in your log window.

The ENCODING system option instructs SAS how to store the data created by SAS in that session and how to read data from external sources. The LOCALE system option instructs SAS how to represent currency, date and time values, how to display menu items and tasks, and sets default papersize and timezone values.

What exactly is encoding?

Encoding is not a term SAS invented. It is the way that characters are represented by computers. This W3C Internationalization page is an excellent non-SAS resource if you’d like to study the concept  further.  In a nutshell, characters of natural-language writing systems that together form an abstract character repertoire (like Wlatin1 which covers most West European languages) are mapped to a set of unique numbers (nonnegative integers). An encoding defines the way those numbers are stored in the computer.

Many encoding values exist in SAS, so that in combination with the LOCALE option setting, SAS will run in over 100 countries with SAS windows, pmenus and log messages localized accordingly. The SBCS, DBCS, and Unicode Encoding Values for Transcoding Data is a table of the current encoding values for SAS 9.4.

For example, if I invoke SAS with the LOCALE setting of Korean and an encoding value of euc-kr, I will see notes, warnings and errors written to my log in Korean:

NOTE: 변수 'a'이(가) 초기화되지 않았습니다.

The above text is translated to English as:

Note: Variable a is not initialized

Why is encoding important?

The ENCODING system option is important to SAS programmers because its setting determines how individual characters are stored internally (according to the session encoding).  As an illustration, on Windows, using an encoding of Wlatin1, the character “á” is shown with the hexadecimal representation of E1. However when running a Unicode (UTF8) encoding session of SAS, the same value is stored internally as c3A1:

/* The hex format shows a difference in how á   */ 
/* is stored with different encoding settings   */ 
data test; 
 x='á'; 
 y=getoption('encoding'); 
 put x=; 
 put y=; 
 put x $hex4.; 
run;

x=á x=á
y=WLATIN1 y=UTF-8
E1 C3A1

Software Globalization specialist Manfred Kiefer authored the following book to help illustrate and demystify encoding and help you make a strategy to internationalize your SAS installation: SAS® Encoding: Understanding the Details

What is the default encoding value?

As stated previously, the encoding value is set based upon the value of LOCALE, and the regional settings selected in the SAS Deployment Wizard during installation and deployment of SAS software, as shown here:

encoding1

On Windows and UNIX machines, the SAS 9.4 Intelligence Platform installation provides the Locale Setup Manager task in the SAS Deployment Manager to configure the language and region for SAS Foundation and certain SAS applications.  See the  SAS(R) 9.4 Intelligence Platform: Installation and Configuration Guide for more information.

The SAS Technical Paper Multilingual Computing with SAS® 9.4 explains how SAS is deployed for National Language Support.  Three images are automatically deployed for SAS on all Windows and UNIX machines:

  • ‘English’ is a single-byte SAS image that displays an English User Interface and English messages by default. The LOCALE and ENCODING options for the English image are set to match the Regional settings or, if the Regional Settings selection is an Asian language, it sets LOCALE and ENCODING to support en_US.
  • ‘English with DBCS’ is a double-byte SAS image that displays English User Interface and English SAS messages by default. This image supports languages that require a double-byte character set, such as Chinese. If a double-byte language is selected later in the Regional Settings dialog, the LOCALE option in the ‘English with DBCS support’ config file is set to match. Otherwise, the LOCALE defaults to ja_JP.
  • ‘Unicode Support’ is installed for all Windows and UNIX deployments, even if the SAS server is not configured for Unicode support. The ENCODING option is set to utf-8. The LOCALE of the Unicode server is set to match the Regional Settings locale selection.

What if my encoding differs from others with whom I share SAS data?

You could encounter this error because the data set encoding does not match the SAS session encoding:

ERROR: Some character data was lost during transcoding in the data set

A comparison of the PROC OPTIONS group=LANGUAGECONTROL settings with the dataset’s encoding will help determine what steps you should take to ensure you can access and modify the data set in question.  This SAS Note 52716 discusses the lost character data error in detail.

The most common method of preventing this error is to launch SAS using a different configuration file, so that the encoding for the SAS session matches that of the dataset.  Alternatively, requesting data in a different format (i.e., different encoding) is feasible as well.

The logic shown in SAS Note 15597 shows logic to convert the encoding of a SAS data set. However, if it is the case that you will be sharing data with different languages and encoding values, it is imperative that you communicate with those with whom you share data and SAS files to ensure that the SAS system settings you use consistently allow you seamless access to shared data.

What if I have non-SAS data sources?

If you read or access data from a database such as Oracle, it is imperative that your data base client communicates with SAS in order to correctly interpret native characters.   SAS note 51411 tells how to correct a potential problem for both Windows and UNIX systems, and may require the input of your data base administrator. The SAS Technical Paper Multilingual Computing with SAS® 9.4 describes configuration steps for many other data base clients.

Where can I find more information on this topic?

Share

About Author

Bari Lawhorn

Sr Principal Technical Support Analyst

Bari Lawhorn is a Senior Principal Technical Support Engineer in the Programming Clients and Interfaces group in Technical Support, where she has provided support for the DATA step and Base procedures for more than 25 years. Bari supports SAS Studio, has specialized in support for the SAS Output Delivery System since its inception and is part of a Technical Support NLS (National Language Support) "virtual team."

4 Comments

  1. Really useful!!
    Unfortunately on a recent contract we were using SAS on a locked down Citrix environment where we weren't able to change the locale of the session (we couldn't get to the config file).

    I was able to use the Unicode function to work the data, despite the fact that we couldn't view the actual text. There are some really powerful encoding related facilities in SAS - the KCVT function for changing the encoding of a specific field for one!

    • Bari Lawhorn

      Rob, thanks for the feedback. You are correct that the K functions are really powerful, I am so glad you found them useful (and that you have given us ideas for a future blog post!).

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top