UTF-16

UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode. The encoding is variable-length, as code points are encoded with one or two 16-bit code units (also see comparison of Unicode encodings for a comparison of UTF-8, -16 & -32). UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode. The encoding is variable-length, as code points are encoded with one or two 16-bit code units (also see comparison of Unicode encodings for a comparison of UTF-8, -16 & -32). UTF-16 arose from an earlier fixed-width 16-bit encoding known as UCS-2 (for 2-byte Universal Character Set) once it became clear that more than 216 code points were needed. UTF-16 is used internally by systems such as Windows and Java and by JavaScript, and often for plain text and for word-processing data files on Windows. It is rarely used for files on Unix/Linux or macOS. It never gained popularity on the web, where UTF-8 is dominant (and considered 'the mandatory encoding for all ' by WHATWG). UTF-16 is used by under 0.01% of web pages themselves. WHATWG recommends that for security reasons browser apps should not use UTF-16. In the late 1980s, work began on developing a uniform encoding for a 'Universal Character Set' (UCS) that would replace earlier language-specific encodings with one coordinated system. The goal was to include all required characters from most of the world's languages, as well as symbols from technical domains such as science, mathematics, and music. The original idea was to replace the typical 256-character encodings, which required 1 byte per character, with an encoding using 65,536 (216) values, which would require 2 bytes per character. Two groups worked on this in parallel, ISO/IEC JTC 1/SC 2 and the Unicode Consortium, the latter representing mostly manufacturers of computing equipment. The two groups attempted to synchronize their character assignments so that the developing encodings would be mutually compatible. The early 2-byte encoding was usually called 'Unicode', but is now called 'UCS-2'. UCS-2 differs from UTF-16 by being a constant length encoding and only capable of encoding characters of BMP. Early in this process it became increasingly clear that 216 characters would not suffice, and IEEE introduced a larger 31-bit space and an encoding (UCS-4) that would require 4 bytes per character. This was resisted by the Unicode Consortium, both because 4 bytes per character wasted a lot of disk space and memory, and because some manufacturers were already heavily invested in 2-byte-per-character technology. The UTF-16 encoding scheme was developed as a compromise to resolve this impasse in version 2.0 of the Unicode standard in July 1996 and is fully specified in RFC 2781 published in 2000 by the IETF. In UTF-16, code points greater or equal to 216 are encoded using two 16-bit code units. The standards organizations chose the largest block available of un-allocated 16-bit code points to use as these code units. Unlike UTF-8 they did not provide a means to encode these code points. UTF-16 is specified in the latest versions of both the international standard ISO/IEC 10646 and the Unicode Standard. 'UCS-2 should now be considered obsolete. It no longer refers to an encoding form in either 10646 or the Unicode Standard.' There are no plans to extend UTF-16 to support a higher number of code points, or the codes replaced by surrogates, as allocating code points for this would violate the Unicode Stability Policy with respect to general category or surrogate code points. An example idea would be to allocate another BMP value to prefix a triple of low,low,high surrogates (the order swapped so that it cannot match a surrogate pair in searches), allowing 230 more code points to be encoded, but changing the purpose of a code point is disallowed (using no prefix is also not allowed as two of these characters next to each other would match a surrogate pair). Both UTF-16 and UCS-2 encode code points in this range as single 16-bit code units that are numerically equal to the corresponding code points. These code points in the Basic Multilingual Plane (BMP) are the only code points that can be represented in UCS-2. As of Unicode 9.0, some modern non-Latin Asian, Middle-Eastern, and African scripts fall outside this range, as do most emoji characters. Code points from the other planes (called Supplementary Planes) are encoded as two 16-bit code units called a surrogate pair, by the following scheme:

Parent Topic

Child Topic

No Parent Topic