WTF is UTF? – An Integrated World

UTF stands for Unicode Transformation Format. The most popular version is UTF-8 which represents 8 bits or 8 one’s or zero’s. (11111111 -> 00000000). The secret though is that it uses up to 4 bytes. We know that 1 byte = 8 bits so 4 bytes gives us 32 bits.

8 bits can represent 255 options (2^8). Standard ASCII only uses 128 symbols (upper case and lower case version of the latin alphabet, numbers 0 to 9 and a bunch of symbols). The other 128 symbols cover extended ASCII.

But as we know, the world doesn’t work purely in ABCs. Therefore, 2 bytes are required. The catch is that this does not mean there are 2^16 (65536) options available to represent. The UTF-8 encoding scheme reserves some bits (depending on the bytes) meaning that for 2 bytes, only 11 bits (not 16) are available. 2^11 = 2048. However, only 1920 are used. 1920 characters can represent almost all Latin-script alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Thaana.

3 bytes are needed for characters in the rest of the Basic Multilingual Plane, which contains virtually all characters in common use including most Chinese, Japanese and Korean characters. 4 bytes are needed for characters in the other planes of Unicode, which include less common CJK characters, various historic scripts, mathematical symbols, and emoji’s. – https://en.wikipedia.org/wiki/UTF-8

Leave a Reply Cancel reply