Online UTF-8 Encode and Decode

Online UTF-8 encoding and decoding tool

Introduction to UTF-8

UTF-8 is a variable-length character encoding for Unicode, also known as Universal Code.

UTF-8 encodes UNICODE characters in 1 to 6 bytes.

UTF-8 encoding rules

If there is only one byte, its highest binary bit is 0;

If it is multi-byte, the first byte starts from the highest bit, the number of consecutive binary bits with a value of 1 determines the number of encoded bytes, and the rest of the bytes start with 10.

The UTF-8 conversion table is represented as follows:
Unicode/UCS-4	number of bits	UTF-8	number of bytes	Remark
0000~ 007F	0~7	0 XXX XXXX	1
0080~ 07FF	8~11	110 X XXXX 10 XX XXXX	2
0800~ FFFF	12~16	1110 XXXX 10 XX XXXX 10 XX XXXX	3	Basic definition range: 0~FFFF
10000~ 1F FFFF	17~21	1111 0 XXX 10 XX XXXX 10 XX XXXX 10 XX XXXX	4	Unicode6.1 definition range: 0~10 FFFF
20 0000~ 3FFFFFF	22~26	1111 10 XX 10 XX XXXX 10 XX XXXX 10 XX XXXX 10 XX XXXX	5	Description: This non-unicode encoding range belongs to UCS-4 encoding The early specification, UTF-8, can reach 6-byte sequences, which can cover up to 31 bits (the original limit of the universal character set). Nonetheless, in November 2003 UTF-8 was re-specified by RFC 3629, which can only use the areas originally defined by Unicode, U+0000 to U+10FFFF. According to the specification, these byte values will not appear in a legal UTF-8 sequence
400 0000~ 7FFF FFFF	27~31	1111 110X 10 XX XXXX 10 XX XXXX 10 XX XXXX 10 XX XXXX 10 XX XXXX	6

UTF-8 advantages

UTF-8 encoding can be read and written quickly by masking bits and shifting operations. strcmp() and wcscmp() return the same result for string comparisons, thus making sorting easier. Bytes FF and FE are never present in UTF-8 encoding, so they can be used to indicate UTF-16 or UTF-32 text (see BOM) UTF-8 is byte-order independent. Its endianness is the same on all systems, so it doesn't really need a BOM.

UTF-8 Disadvantages

You can't tell the number of bytes of UTF-8 text from the number of UNICODE characters, because UTF-8 is a variable-length encoding and it takes 2 bytes to encode those characters that are only 1 byte in the extended ASCII character set ISO Latin-1 is a subset of UNICODE, but not a subset of UTF-8. UTF-8 encoding of 8-bit characters will be filtered by email gateways because internet messages are originally designed to be 7-bit ASCII. Hence the UTF-7 encoding. UTF-8 uses the value 100xxxxx in its representation more than 50% of the time, and existing implementations such as ISO 2022, 4873, 6429, and 8859 systems mistake it for a C1 control code. Hence the UTF-7.5 encoding.

Tool Introduction