UTF-8: Difference between revisions

From Rosetta Code
Content added Content deleted
m (Added intrawiki links)
m (Can scan UTF-8 sequence backwards)
 
Line 5: Line 5:
* subsumes 7-bit ASCII
* subsumes 7-bit ASCII
* one can detect the start of characters
* one can detect the start of characters
* one can scan characters in both directions forward and backward
* can encode code-points at least 32-bits long
* can encode code-points at least 32-bits long



Latest revision as of 09:46, 19 August 2009

Unicode Transformation Format, 8-bit representation or UTF-8 is a particular encoding of Unicode code-points into eight-bit octets. It was originally developed for Bell Labs' Plan 9 operating system by Ken Thompson (inventor of Unix) and Rob Pike in 1992. It is widely used on Unix-like systems and for XML documents.

Some advantages of UTF-8:

  • byte-order independent
  • subsumes 7-bit ASCII
  • one can detect the start of characters
  • one can scan characters in both directions forward and backward
  • can encode code-points at least 32-bits long

Challenges:

  • characters do not have a fixed size. One needs to walk an entire string to determine the character length of a string.
  • biased towards European scripts. Japanese code points are more compactly stored in other encodings, such as UTF-16 or UCS-2.