- Forward


Character Encoding
An Introduction


Prof. David Bernstein
James Madison University

Computer Science Department
bernstdh@jmu.edu

Print

Background
Back SMYC Forward
  • Digital Computers:
    • Most digital computers are binary
  • The Real World:
    • Sets (even discrete sets) are rarely binary
  • The Implication:
    • The representation of a real-world set in a digital computer requires an encoding scheme
Some History of Character Encodings
Back SMYC Forward
  • The 1800s:
    • Morse Code - A variable bit-length scheme with a maximum of 4-bits for latin letters and 5-bits for arabic numerals (1836)
    • Baudot Code - A 5-bit code (1870)
  • The 1900s:
    • International Telegraph Association (ITA) 2 - A 5-bit code that included control characters like carriage return, line feed, and bell (1930)
    • American Standard Code for Information Exchange (ASCII) - 7 bits, 95 printable and 33 unprintable characters (1963)
    • Extended Binary Coded Decimal Interchange Code (EBCDIC) - 8 bits (1963)
The Modern Era
Back SMYC Forward
  • Some Important Realizations:
    • There are a lot of writing systems in the world
    • Adding more bits to an encoding scheme is wasteful if they aren't going to be needed by a large number of users
  • Use Decomposition:
    • Characters are mapped to code points (i.e., numerical values in the code space)
    • Code points are then encoded in different ways with different numbers of bits per character
The Modern Approach
Back SMYC Forward
  • The Decomposition:
    • The character repertoire (i.e., the set of supported characters)
    • The coded character set (i.e., the integer values, or code points, for each character)
    • The character encoding form (i.e., how the integer values will be represented in binary form using values with a fixed number of bits)
    • The character encoding scheme (i.e., how the fixed-size integer values, which may not use octets, should be mapped into octets)
  • Examples:
    • The Universal Character Set ( ISO 10646) contains over 120,000 different characters
    • ISO 8859-1 contains code points for 191 characters from the Latin alphabet used in Western Europe
    • UTF-8 maps code points to variable-length sequences of 8-bit words and UTF-16 maps code points to variable-length sequences of 16-bit words
    • UTF-16 encodings can, in principal, put the octets in either big-endian or little-endian order
8-Bit Unicode Transformation Format (UTF-8)
Back SMYC Forward
  • Properties:
    • Code points are mapped to between 1-octet and 4-octets
    • The mapping is backward compatible with 7-bit ASCII
  • The Algorithm:
    • 1-Octet: The high-order bit is 0 and the other bits contain the value in the interval \([0, 127]\)
    • Multiple-Octets: The number of leading 1s (which are always followed by a 0) in the first byte indicates the number of octets used and each subsequent byte begins with 10 (so they can be easily identified as such)
UTF-8 (cont.)
Back SMYC Forward

Theoretical Byte Sequences
(Note: Some of the following sequences are not considered well-formed in the specification.)

utf-8
The Unicode Standard
Back SMYC Forward
  • The Unicode Consortium:
    • Got it all started in the 1990s
    • "Synchronizes" with the ISO
  • The Standard:
    • "[T]he official way to implement ISO/IEC 10646."
    • "[T]he Unicode Standard imposes additional constraints on implementations to ensure that they treat characters uniformly across platforms and applications. To this end, it supplies an extensive set of functional character specifications, character data, algorithms and substantial background material that is not in ISO/IEC 10646."
The Unicode Standard (cont.)
Back SMYC Forward

Nerd Humor

/imgs
(Courtesy of xkcd)
There's Always More to Learn
Back -