Have you ever seen question marks in ODS output, diamonds with question marks in them where a copyright symbol should be, or unexpected symbols in your Web browser? In other words, do you sometimes see garbage instead of text?
Manfred Kiefer’s new book, SAS Encoding: Understanding the Details can help. It offers answers to the question, “Why do my characters get garbled on the computer, and how do I fix the problem?”
We asked Kiefer to explain some of the basics behind encoding—what are the most important ideas programmers need to understand when working with encodings. Here’s what he told us:
- First of all, users need to understand that characters of natural-language writing systems are eventually represented digitally as combinations of zeros and ones. An encoding tells the computer how to interpret these zeros and ones into real characters.
- Encodings are available for almost any language. However, the basic problems persist: languages use several often incompatible encodings; software needs to support many different encodings; and data transfer across platforms always runs the risk of data loss or corruption.
- Programmers need to remember that a character is not a byte, at least not always. For Chinese, Japanese or Korean with their thousands of characters, two bytes are required to represent each character; these are referred to as double-byte character sets (DBCS). In UTF-8 (a form of Unicode) a character may even take up to as many as four bytes!
- Transcoding is the process of mapping data from one encoding to another. In order to avoid data loss or data corruption, the encodings involved need to be compatible. You can successfully transcode only if the same characters are available on either side.
- You do not need to understand binary, but you should be able to check and interpret the hexadecimal values of characters. This will tell you how a character has really been stored in a file, no matter what the display shows you.
Have you encountered encoding issues in your work? Anything you’d add to this list? Tell us about it! Or read a free chapter or order your copy of Kiefer’s book to learn more about the basic concepts of characters, encodings, glyphs, and fonts, and how to troubleshoot encoding problems.