UTF-8: Difference between revisions
Content added Content deleted
(New page: '''Unicode Transformation Format, 8-bit representation''' or UTF-8 is a particular encoding of Unicode code-points into eight-bit octets. It was originally developed for Bell Labs' Pla...) |
m (Can scan UTF-8 sequence backwards) |
||
(3 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
'''Unicode Transformation Format, 8-bit representation''' or UTF-8 is a particular encoding of [[Unicode]] code-points into eight-bit octets. It was originally developed for Bell Labs' Plan 9 operating system by Ken Thompson (inventor of Unix) and Rob Pike in 1992. It is widely used on Unix-like systems and for XML documents. |
[[Category:Encyclopedia]]'''Unicode Transformation Format, 8-bit representation''' or UTF-8 is a particular encoding of [[Unicode]] code-points into eight-bit octets. It was originally developed for [[Bell Labs]]' [[Plan 9]] operating system by Ken Thompson (inventor of [[Unix]]) and Rob Pike in 1992. It is widely used on Unix-like systems and for XML documents. |
||
Some advantages of UTF-8: |
Some advantages of UTF-8: |
||
Line 5: | Line 5: | ||
* subsumes 7-bit ASCII |
* subsumes 7-bit ASCII |
||
* one can detect the start of characters |
* one can detect the start of characters |
||
* one can scan characters in both directions forward and backward |
|||
* can encode code-points at least 32-bits long |
* can encode code-points at least 32-bits long |
||
Latest revision as of 09:46, 19 August 2009
Unicode Transformation Format, 8-bit representation or UTF-8 is a particular encoding of Unicode code-points into eight-bit octets. It was originally developed for Bell Labs' Plan 9 operating system by Ken Thompson (inventor of Unix) and Rob Pike in 1992. It is widely used on Unix-like systems and for XML documents.
Some advantages of UTF-8:
- byte-order independent
- subsumes 7-bit ASCII
- one can detect the start of characters
- one can scan characters in both directions forward and backward
- can encode code-points at least 32-bits long
Challenges:
- characters do not have a fixed size. One needs to walk an entire string to determine the character length of a string.
- biased towards European scripts. Japanese code points are more compactly stored in other encodings, such as UTF-16 or UCS-2.