Unicode and UTF-8

A computer work directly with bits. We need an encoding scheme to convert between bit strings and human readable characters. In ASCII encoding, strings of 8 bits are mapped to a set of 128 characters. But this is insufficient to represent all characters of all languages. Thus the need for Unicode and encoding schemes such as UTF-8.

Unicode:

  • It is the current standard for mapping characters and symbols to code points (think of this as integers). It defines 1,114,112 code points.
  • Unicode is not an encoding. It does not tell you how to go from code points to the corresponding bit string of 0s and 1s. To encode code points as bit strings, we use UTF.

There are a few ways to encode Unicode code points into bit strings: UTF-8, UTF-16, UTF-32.

  • UTF (“Unicode Transformation Format”)
  • UTF-32 always use 4 bytes, which wastes a lot of space.
  • UTF-8 and UTF-16 are variable length encodings. UTF-8 encodes each Unicode code point using 2, 3, or 4 bytes. UTF-16 uses 2 or 4 bytes.

UTF-8:

  • UTF-8 is the most popular encoding system for Unicode due to its space efficiency. It represents each Unicode character using 2, 3, or 4 bytes.
  • If a character can be represented using a single byte (because its Unicode code point is a very small number), UTF-8 will encode it with a single byte.
  • The first 128 Unicode characters corresponds directly with ASCII and UTF-8 converts these to one byte. Non ASCII are represented with 2 or more bytes. By using less bytes to represent more common characters (i.e. ASCII characters), UTF-8 saves on memory.

UTF-16 is only more efficient than UTF-8 on non English texts. On characters that are further back in the Unicode table, UTF-8 will encode all characters as four bytes, but UTF-16 might encode some as 2 bytes and some others as 4 bytes.

Written on December 12, 2022