English

Punycode

Punycode is a representation of Unicode with the limited ASCII character subset used for Internet host names. Using Punycode, host names containing Unicode characters are transcoded to a subset of ASCII consisting of letters, digits, and hyphen, which is called the Letter-Digit-Hyphen (LDH) subset. For example, München (German name for Munich) is encoded as Mnchen-3ya.A number system with little-endian ordering is used which allows variable-length codes without separate delimiters: a digit lower than a threshold value marks that it is the most-significant digit, hence the end of the number. The threshold value depends on the position in the number and also on previous insertions, to increase efficiency. Correspondingly the weights of the digits vary. Punycode is a representation of Unicode with the limited ASCII character subset used for Internet host names. Using Punycode, host names containing Unicode characters are transcoded to a subset of ASCII consisting of letters, digits, and hyphen, which is called the Letter-Digit-Hyphen (LDH) subset. For example, München (German name for Munich) is encoded as Mnchen-3ya. While the Domain Name System (DNS) technically supports arbitrary sequences of octets in domain name labels, the DNS standards recommend the use of the LDH subset of ASCII conventionally used for host names, and require that string comparisons between DNS domain names should be case-insensitive. The Punycode syntax is a method of encoding strings containing Unicode characters, such as internationalized domain names (IDNA), into the LDH subset of ASCII favored by DNS. It is specified in IETF Request for Comments 3492. As stated in RFC 3492, 'Punycode is an instance of a more general algorithm called Bootstring, which allows strings composed from a small set of 'basic' code points to uniquely represent any string of code points drawn from a larger set.' Punycode defines parameters for the general Bootstring algorithm to match the characteristics of Unicode text. This section demonstrates the procedure for Punycode encoding, using the example of the string 'bücher' (Bücher is German for books), which is translated into the label 'bcher-kva'. First, all basic ASCII characters in the string are copied from input to output, skipping over any other characters. For example, 'bücher' is copied to 'bcher'. If any characters were copied an ASCII hyphen is added to the output next (e.g., 'bücher' → 'bcher-'). Since it is a basic character, the ASCII hyphen may itself appear in the string before this additional character. However, the additional ASCII hyphen does not cause any ambiguity as no later part of the encoding process can introduce another ASCII hyphen; the last ASCII hyphen, if any, signifies the end of the basic characters. The next part of the encoding process first requires an understanding of the decoder, which is a finite-state machine with two state variables i and n. i is an index into the string ranging from zero (representing a potential insertion at the start) to the current length of the extended string (representing a potential insertion at the end). i starts at zero, and n starts at 128 (the first non-ASCII code point). The state progression is a monotonic function. A state change either increments i or, if i is at its maximum, resets i to zero and increments n by 1, then goes back to incrementing i in the following state change. At each state change, either the code point denoted by n is inserted or it is not inserted.

[ "Operating system", "World Wide Web", "domain name" ]
Parent Topic
Child Topic
    No Parent Topic