Online UTF-8 encoding and decoding tool
Introduction to UTF-8
UTF-8 is a variable-length character encoding for Unicode, also known as Universal Code.
UTF-8 encodes UNICODE characters in 1 to 6 bytes.
UTF-8 encoding rules
If there is only one byte, its highest binary bit is 0;
If it is multi-byte, the first byte starts from the highest bit, the number of consecutive binary bits with a value of 1 determines the number of encoded bytes, and the rest of the bytes start with 10.
Unicode/UCS-4 | number of bits | UTF-8 | number of bytes | Remark |
0000~ 007F | 0~7 | 0 XXX XXXX | 1 | |
0080~ 07FF | 8~11 | 110 X XXXX 10 XX XXXX | 2 | |
0800~ FFFF | 12~16 | 1110 XXXX 10 XX XXXX 10 XX XXXX | 3 | Basic definition range: 0~FFFF |
10000~ 1F FFFF | 17~21 | 1111 0 XXX 10 XX XXXX 10 XX XXXX 10 XX XXXX | 4 | Unicode6.1 definition range: 0~10 FFFF |
20 0000~ 3FFFFFF | 22~26 | 1111 10 XX 10 XX XXXX 10 XX XXXX 10 XX XXXX 10 XX XXXX | 5 | Description: This non-unicode encoding range belongs to UCS-4 encoding The early specification, UTF-8, can reach 6-byte sequences, which can cover up to 31 bits (the original limit of the universal character set). Nonetheless, in November 2003 UTF-8 was re-specified by RFC 3629, which can only use the areas originally defined by Unicode, U+0000 to U+10FFFF. According to the specification, these byte values will not appear in a legal UTF-8 sequence |
400 0000~ 7FFF FFFF | 27~31 | 1111 110X 10 XX XXXX 10 XX XXXX 10 XX XXXX 10 XX XXXX 10 XX XXXX | 6 |
UTF-8 advantages
UTF-8 encoding can be read and written quickly by masking bits and shifting operations. strcmp() and wcscmp() return the same result for string comparisons, thus making sorting easier. Bytes FF and FE are never present in UTF-8 encoding, so they can be used to indicate UTF-16 or UTF-32 text (see BOM) UTF-8 is byte-order independent. Its endianness is the same on all systems, so it doesn't really need a BOM.
UTF-8 Disadvantages
You can't tell the number of bytes of UTF-8 text from the number of UNICODE characters, because UTF-8 is a variable-length encoding and it takes 2 bytes to encode those characters that are only 1 byte in the extended ASCII character set ISO Latin-1 is a subset of UNICODE, but not a subset of UTF-8. UTF-8 encoding of 8-bit characters will be filtered by email gateways because internet messages are originally designed to be 7-bit ASCII. Hence the UTF-7 encoding. UTF-8 uses the value 100xxxxx in its representation more than 50% of the time, and existing implementations such as ISO 2022, 4873, 6429, and 8859 systems mistake it for a C1 control code. Hence the UTF-7.5 encoding.