Character Encoding

Character Encoding
An Introduction

Computer Science Department

bernstdh@jmu.edu

Background

Digital Computers:
- Most digital computers are binary
The Real World:
- Sets (even discrete sets) are rarely binary
The Implication:
- The representation of a real-world set in a digital computer requires an encoding scheme

Some History of Character Encodings

The 1800s:
- Morse Code - A variable bit-length scheme with a maximum of 4-bits for latin letters and 5-bits for arabic numerals (1836)
- Baudot Code - A 5-bit code (1870)
The 1900s:
- International Telegraph Association (ITA) 2 - A 5-bit code that included control characters like carriage return, line feed, and bell (1930)
- American Standard Code for Information Exchange (ASCII) - 7 bits, 95 printable and 33 unprintable characters (1963)
- Extended Binary Coded Decimal Interchange Code (EBCDIC) - 8 bits (1963)

The Modern Era

Some Important Realizations:
- There are a lot of writing systems in the world
- Adding more bits to an encoding scheme is wasteful if they aren't going to be needed by a large number of users
Use Decomposition:
- Characters are mapped to code points (i.e., numerical values in the code space)
- Code points are then encoded in different ways with different numbers of bits per character

The Modern Approach

The Decomposition:
- The character repertoire (i.e., the set of supported characters)
- The coded character set (i.e., the integer values, or code points, for each character)
- The character encoding form (i.e., how the integer values will be represented in binary form using values with a fixed number of bits)
- The character encoding scheme (i.e., how the fixed-size integer values, which may not use octets, should be mapped into octets)
Examples:
- The Universal Character Set ( ISO 10646) contains over 120,000 different characters
- ISO 8859-1 contains code points for 191 characters from the Latin alphabet used in Western Europe
- UTF-8 maps code points to variable-length sequences of 8-bit words and UTF-16 maps code points to variable-length sequences of 16-bit words
- UTF-16 encodings can, in principal, put the octets in either big-endian or little-endian order

8-Bit Unicode Transformation Format (UTF-8)

Properties:
- Code points are mapped to between 1-octet and 4-octets
- The mapping is backward compatible with 7-bit ASCII
The Algorithm:
- 1-Octet: The high-order bit is 0 and the other bits contain the value in the interval \([0, 127]\)
- Multiple-Octets: The number of leading 1s (which are always followed by a 0) in the first byte indicates the number of octets used and each subsequent byte begins with 10 (so they can be easily identified as such)

UTF-8 (cont.)

Theoretical Byte Sequences
(Note: Some of the following sequences are not considered well-formed in the specification.)

The Unicode Standard

The Unicode Consortium:
- Got it all started in the 1990s
- "Synchronizes" with the ISO
The Standard:
- "[T]he official way to implement ISO/IEC 10646."
- "[T]he Unicode Standard imposes additional constraints on implementations to ensure that they treat characters uniformly across platforms and applications. To this end, it supplies an extensive set of functional character specifications, character data, algorithms and substantial background material that is not in ISO/IEC 10646."

The Unicode Standard (cont.)

Nerd Humor

(Courtesy of xkcd)

There's Always More to Learn