UTF-8 encode and decode: Difference between revisions

← Older edit

UTF-8 encode and decode (view source)

Revision as of 15:30, 5 May 2024

88,921 bytes added , 13 days ago

→‎{{header|langur}}

Langurmonkey

885

edits

Revision as of 11:22, 13 May 2020 (view source) Noel (talk \| contribs) No edit summary ← Older edit		Latest revision as of 15:30, 5 May 2024 (view source) Langurmonkey (talk \| contribs) (→‎{{header\|langur}})
(32 intermediate revisions by 22 users not shown)
Line 3: As described in [[UTF-8]] and in [[wp:UTF-8\|Wikipedia]], UTF-8 is a popular encoding of (multi-byte) [[Unicode]] code-points into eight-bit octets. The goal of this task is to write a encoder that takes a unicode code-point (an integer representing a unicode character) and returns a sequence of ~~1-4~~1–4 bytes representing that character in the UTF-8 encoding. Then you have to write the corresponding decoder that takes a sequence of ~~1-4~~1–4 UTF-8 encoded bytes and return the corresponding unicode character. Demonstrate the functionality of your encoder and decoder on the following five characters: Line 20: Provided below is a reference implementation in Common Lisp. =={{header\|11l}}== {{trans\|Python}} <syntaxhighlight lang="11l">F unicode_code(ch) R ‘U+’hex(ch.code).zfill(4) F utf8hex(ch) R ch.encode(‘utf-8’).map(c -> hex(c)).join(‘ ’) print(‘#<11 #<15 #<15’.format(‘Character’, ‘Unicode’, ‘UTF-8 encoding (hex)’)) V chars = [‘A’, ‘ö’, ‘Ж’, ‘€’] L(char) chars print(‘#<11 #<15 #<15’.format(char, unicode_code(char), utf8hex(char)))</syntaxhighlight> {{out}} <pre> Character Unicode UTF-8 encoding (hex) A U+0041 41 ö U+00F6 C3 B6 Ж U+0416 D0 96 € U+20AC E2 82 AC </pre> =={{header\|8th}}== <~~lang~~syntaxhighlight lang="8th"> hex \ so bytes print nicely Line 58 ⟶ 81: bye </syntaxhighlight> ~~</lang>~~ Output:<pre> A 41 Line 71 ⟶ 94: [e2,82,ac] € [f0,9d,84,9e] 𝄞 </pre> =={{header\|Action!}}== <syntaxhighlight lang="action!">TYPE Unicode=[BYTE bc1,bc2,bc3] BYTE ARRAY hex=['0 '1 '2 '3 '4 '5 '6 '7 '8 '9 'A 'B 'C 'D 'E 'F] BYTE FUNC DecodeHex(CHAR c) BYTE i FOR i=0 TO 15 DO IF c=hex(i) THEN RETURN (i) FI OD Break() RETURN (255) BYTE FUNC DecodeHex2(CHAR c1,c2) BYTE h1,h2,res h1=DecodeHex(c1) h2=DecodeHex(c2) res=(h1 LSH 4)%h2 RETURN (res) PROC ValUnicode(CHAR ARRAY s Unicode POINTER u) BYTE i,len len=s(0) IF len<6 AND len>8 THEN Break() FI IF s(1)#'U OR s(2)#'+ THEN Break() FI IF len=6 THEN u.bc1=0 ELSEIF len=7 THEN u.bc1=DecodeHex(s(3)) IF u.bc1>$10 THEN Break() FI ELSE u.bc1=DecodeHex2(s(3),s(4)) FI u.bc2=DecodeHex2(s(len-3),s(len-2)) u.bc3=DecodeHex2(s(len-1),s(len)) RETURN PROC PrintHex2(BYTE x) Put(hex(x RSH 4)) Put(hex(x&$0F)) RETURN PROC StrUnicode(Unicode POINTER u) Print("U+") IF u.bc1>$F THEN PrintHex2(u.bc1) ELSEIF u.bc1>0 THEN Put(hex(u.bc1)) FI PrintHex2(u.bc2) PrintHex2(u.bc3) RETURN PROC PrintArray(BYTE ARRAY a BYTE len) BYTE i Put('[) FOR i=0 TO len-1 DO IF i>0 THEN Put(32 )FI PrintHex2(a(i)) OD Put(']) RETURN PROC Encode(Unicode POINTER u BYTE ARRAY buf BYTE POINTER len) IF u.bc1>0 THEN len^=4 buf(0)=$F0 % (u.bc1 RSH 2) buf(1)=$80 % ((u.bc1 & $03) LSH 4) % (u.bc2 RSH 4) buf(2)=$80 % ((u.bc2 & $0F) LSH 2) % (u.bc3 RSH 6) buf(3)=$80 % (u.bc3 & $3F) ELSEIF u.bc2>=$08 THEN len^=3 buf(0)=$E0 % (u.bc2 RSH 4) buf(1)=$80 % ((u.bc2 & $0F) LSH 2) % (u.bc3 RSH 6) buf(2)=$80 % (u.bc3 & $3F) ELSEIF u.bc2>0 OR u.bc3>=$80 THEN len^=2 buf(0)=$C0 % (u.bc2 LSH 2) % (u.bc3 RSH 6) buf(1)=$80 % (u.bc3 & $3F) ELSE len^=1 buf(0)=u.bc3 FI RETURN PROC Decode(BYTE ARRAY buf BYTE len Unicode POINTER u) IF len=1 THEN u.bc1=0 u.bc2=0 u.bc3=buf(0) ELSEIF len=2 THEN u.bc1=0 u.bc2=(buf(0) & $1F) RSH 2 u.bc3=(buf(0) LSH 6) % (buf(1) & $3F) ELSEIF len=3 THEN u.bc1=0 u.bc2=(buf(0) LSH 4) % ((buf(1) & $3F) RSH 2) u.bc3=(buf(1) LSH 6) % (buf(2) & $3F) ELSEIF len=4 THEN u.bc1=((buf(0) & $07) LSH 2) % ((buf(1) & $3F) RSH 4) u.bc2=(buf(1) LSH 4) % ((buf(2) & $3F) RSH 2) u.bc3=((buf(2) & $03) LSH 6) % (buf(3) & $3F) ELSE Break() FI RETURN PROC Main() DEFINE PTR="CARD" DEFINE COUNT="11" PTR ARRAY case(COUNT) Unicode uni,res BYTE ARRAY buf(4) BYTE i,len case(0)="U+0041" case(1)="U+00F6" case(2)="U+0416" case(3)="U+20AC" case(4)="U+1D11E" case(5)="U+0024" case(6)="U+00A2" case(7)="U+0939" case(8)="U+20AC" case(9)="U+D55C" case(10)="U+10348" FOR i=0 TO COUNT-1 DO IF i=0 THEN PrintE("From RosettaCode:") ELSEIF i=5 THEN PutE() PrintE("From Wikipedia:") FI ValUnicode(case(i),uni) Encode(uni,buf,@len) Decode(buf,len,res) StrUnicode(uni) Print(" -> ") PrintArray(buf,len) Print(" -> ") StrUnicode(res) PutE() OD RETURN</syntaxhighlight> {{out}} [https://gitlab.com/amarok8bit/action-rosetta-code/-/raw/master/images/UTF-8_encode_and_decode.png Screenshot from Atari 8-bit computer] <pre> From RosettaCode: U+0041 -> [41] -> U+0041 U+00F6 -> [C3 B6] -> U+00F6 U+0416 -> [D0 96] -> U+0416 U+20AC -> [E2 82 AC] -> U+20AC U+1D11E -> [F0 9D 84 9E] -> U+1D11E From Wikipedia: U+0024 -> [24] -> U+0024 U+00A2 -> [C2 A2] -> U+00A2 U+0939 -> [E0 A4 B9] -> U+0939 U+20AC -> [E2 82 AC] -> U+20AC U+D55C -> [ED 95 9C] -> U+D55C U+10348 -> [F0 90 8D 88] -> U+10348 </pre> Line 76 ⟶ 269: {{works with\|Ada\|Ada\|2012}} <~~lang~~syntaxhighlight ~~Ada~~lang="ada">with Ada.Strings.Fixed; use Ada.Strings.Fixed; with Ada.Strings.UTF_Encoding.Wide_Wide_Strings; with Ada.Integer_Text_IO; Line 124 ⟶ 317: end; end loop; end UTF8_Encode_And_Decode;</~~lang~~syntaxhighlight> {{out}} <pre>Character Unicode UTF-8 encoding (hex) Line 134 ⟶ 327: 𝄞 U+1D11E F0 9D 84 9E </pre> =={{header\|ATS}}== The following code is long but consists largely of proofs. UTF-8 is a complicated encoding, but also a well defined one that lends itself to compile-time verification methods. Despite this complexity, what actually gets generated is highly optimizable C code. Note that the following demonstration requires no ATS-specific library support whatsoever. <syntaxhighlight lang="ats">(* UTF-8 encoding and decoding in ATS2. This is adapted from library code I wrote long ago and actually handles the original 1-to-6-byte encoding, as well. Valid Unicode requires only 1 to 4 bytes. What is remarkable about the following rather lengthy code is its use of proofs of what is and what is not valid UTF-8. Much of what follows is proofs rather than executable code. It seemed simpler to "mostly copy" my old code, than to try to pare that code down to a minimum that still demonstrated some of what makes ATS different from all but a few (if any) other languages. ) #define ATS_EXTERN_PREFIX "utf8_encoding_" #include "share/atspre_define.hats" #include "share/atspre_staload.hats" ( A variant of ‘andalso’, with dependent types. ) fn {} andalso1 {b1, b2 : bool} (b1 : bool b1, b2 : bool b2) :<> [b3 : bool \| b3 == (b1 && b2)] bool b3 = if b1 then b2 else false infixl (&&) &&& macdef &&& = andalso1 (###################### C CODE ####################################) %{^ _Static_assert (4 <= sizeof (int), "sizeof(int) must equal at least 4"); #define utf8_encoding_2_entries(v) \ (v), (v) #define utf8_encoding_4_entries(v) \ utf8_encoding_2_entries (v), \ utf8_encoding_2_entries (v) #define utf8_encoding_8_entries(v) \ utf8_encoding_4_entries (v), \ utf8_encoding_4_entries (v) #define utf8_encoding_16_entries(v) \ utf8_encoding_8_entries (v), \ utf8_encoding_8_entries (v) #define utf8_encoding_32_entries(v) \ utf8_encoding_16_entries (v), \ utf8_encoding_16_entries (v) #define utf8_encoding_64_entries(v) \ utf8_encoding_32_entries (v), \ utf8_encoding_32_entries (v) #define utf8_encoding_128_entries(v) \ utf8_encoding_64_entries (v), \ utf8_encoding_64_entries (v) static const atstype_int8 utf8_encoding_extended_utf8_lengths__[256] = { utf8_encoding_128_entries (1), utf8_encoding_64_entries (-1), utf8_encoding_32_entries (2), utf8_encoding_16_entries (3), utf8_encoding_8_entries (4), utf8_encoding_4_entries (5), utf8_encoding_2_entries (6), utf8_encoding_2_entries (-1) }; static const atstype_int8 utf8_encoding_utf8_lengths__[256] = { utf8_encoding_128_entries (1), utf8_encoding_64_entries (-1), utf8_encoding_32_entries (2), utf8_encoding_16_entries (3), utf8_encoding_8_entries (4), utf8_encoding_4_entries (-1), utf8_encoding_2_entries (-1), utf8_encoding_2_entries (-1) }; #define utf8_encoding_extended_utf8_character_length(c) \ (utf8_encoding_extended_utf8_lengths__[(atstype_uint8) (c)]) #define utf8_encoding_utf8_character_length(c) \ (utf8_encoding_utf8_lengths__[(atstype_uint8) (c)]) %} (###################### INTERFACE #################################) stadef is_valid_unicode_code_point (u : int) : bool = (0x0 <= u && u < 0xD800) \|\| (0xE000 <= u && u <= 0x10FFFF) extern fun {} is_valid_unicode_code_point : {u : int} int u -<> [b : bool \| b == is_valid_unicode_code_point u] bool b (------------------------------------------------------------------) stadef is_extended_utf8_1byte_first_byte (c0 : int) : bool = 0x00 <= c0 && c0 <= 0x7F stadef is_extended_utf8_2byte_first_byte (c0 : int) : bool = 0xC0 <= c0 && c0 <= 0xDF stadef is_extended_utf8_3byte_first_byte (c0 : int) : bool = 0xE0 <= c0 && c0 <= 0xEF stadef is_extended_utf8_4byte_first_byte (c0 : int) : bool = 0xF0 <= c0 && c0 <= 0xF7 stadef is_extended_utf8_5byte_first_byte (c0 : int) : bool = 0xF8 <= c0 && c0 <= 0xFB stadef is_extended_utf8_6byte_first_byte (c0 : int) : bool = 0xFC <= c0 && c0 <= 0xFD stadef extended_utf8_character_length (c0 : int) : int = ifint (is_extended_utf8_1byte_first_byte c0, 1, ifint (is_extended_utf8_2byte_first_byte c0, 2, ifint (is_extended_utf8_3byte_first_byte c0, 3, ifint (is_extended_utf8_4byte_first_byte c0, 4, ifint (is_extended_utf8_5byte_first_byte c0, 5, ifint (is_extended_utf8_6byte_first_byte c0, 6, ~1)))))) stadef extended_utf8_char_length_relation (c0 : int, n : int) : bool = (n == 1 && is_extended_utf8_1byte_first_byte c0) \|\| (n == 2 && is_extended_utf8_2byte_first_byte c0) \|\| (n == 3 && is_extended_utf8_3byte_first_byte c0) \|\| (n == 4 && is_extended_utf8_4byte_first_byte c0) \|\| (n == 5 && is_extended_utf8_5byte_first_byte c0) \|\| (n == 6 && is_extended_utf8_6byte_first_byte c0) extern prfun extended_utf8_char_length_relation_to_length : {c0 : int} {n : int \| extended_utf8_char_length_relation (c0, n) \|\| (n == ~1 && ((c0 < 0x00) \|\| (0x7F < c0 && c0 < 0xC0) \|\| (0xFD < c0)))} () -<prf> [n == extended_utf8_character_length c0] void extern prfun extended_utf8_char_length_to_length_relation : {c0 : int} {n : int \| n == extended_utf8_character_length c0} () -<prf> [extended_utf8_char_length_relation (c0, n) \|\| (n == ~1 && ((c0 < 0x00) \|\| (0x7F < c0 && c0 < 0xC0) \|\| (0xFD < c0)))] void // extended_utf8_character_length: // // Return value = 1, 2, 3, or 4 indicates that there may be a valid // UTF-8 character, of the given length. It may also be a `valid' // overlong sequence. Otherwise there definitely is not a valid // character of any sort starting with the given byte. Return // value = 5 or 6 indicates the possible start of an `extended // UTF-8' character of the given length, including code points up // to 0xffffffff. Return value = ~1 means the byte is not the // start of a valid sequence. extern fun extended_utf8_character_length : {c0 : int \| 0x00 <= c0; c0 <= 0xFF} int c0 -<> [n : int \| n == extended_utf8_character_length c0] int n = "mac#%" stadef utf8_character_length (c0 : int) : int = ifint (is_extended_utf8_1byte_first_byte c0, 1, ifint (is_extended_utf8_2byte_first_byte c0, 2, ifint (is_extended_utf8_3byte_first_byte c0, 3, ifint (is_extended_utf8_4byte_first_byte c0, 4, ~1)))) stadef utf8_char_length_relation (c0 : int, n : int) : bool = (n == 1 && is_extended_utf8_1byte_first_byte c0) \|\| (n == 2 && is_extended_utf8_2byte_first_byte c0) \|\| (n == 3 && is_extended_utf8_3byte_first_byte c0) \|\| (n == 4 && is_extended_utf8_4byte_first_byte c0) extern prfun utf8_char_length_relation_to_length : {c0 : int} {n : int \| utf8_char_length_relation (c0, n) \|\| (n == ~1 && ((c0 < 0x00) \|\| (0x7F < c0 && c0 < 0xC0) \|\| (0xF7 < c0)))} () -<prf> [n == utf8_character_length c0] void extern prfun utf8_char_length_to_length_relation : {c0 : int} {n : int \| n == utf8_character_length c0} () -<prf> [utf8_char_length_relation (c0, n) \|\| (n == ~1 && ((c0 < 0x00) \|\| (0x7F < c0 && c0 < 0xC0) \|\| (0xF7 < c0)))] void extern fun utf8_character_length : {c0 : int \| 0x00 <= c0; c0 <= 0xFF} int c0 -<> [n : int \| n == utf8_character_length c0] int n = "mac#%" (------------------------------------------------------------------) stadef is_valid_utf8_continuation_byte (c : int) : bool = 0x80 <= c && c <= 0xBF stadef is_invalid_utf8_continuation_byte (c : int) : bool = c < 0x80 \|\| 0xBF < c extern fun {} is_valid_utf8_continuation_byte : {c : int} int c -<> [b : bool \| b == is_valid_utf8_continuation_byte c] bool b (------------------------------------------------------------------) stadef extended_utf8_char_1byte_decoding (c0 : int) : int = c0 stadef extended_utf8_char_2byte_decoding (c0 : int, c1 : int) : int = 64 (c0 - 0xC0) + (c1 - 0x80) stadef extended_utf8_char_3byte_decoding (c0 : int, c1 : int, c2 : int) : int = 64 * 64 * (c0 - 0xE0) + 64 * (c1 - 0x80) + (c2 - 0x80) stadef extended_utf8_char_4byte_decoding (c0 : int, c1 : int, c2 : int, c3 : int) : int = 64 * 64 * 64 * (c0 - 0xF0) + 64 * 64 * (c1 - 0x80) + 64 * (c2 - 0x80) + (c3 - 0x80) stadef extended_utf8_char_5byte_decoding (c0 : int, c1 : int, c2 : int, c3 : int, c4 : int) : int = 64 * 64 * 64 * 64 * (c0 - 0xF8) + 64 * 64 * 64 * (c1 - 0x80) + 64 * 64 * (c2 - 0x80) + 64 * (c3 - 0x80) + (c4 - 0x80) stadef extended_utf8_char_6byte_decoding (c0 : int, c1 : int, c2 : int, c3 : int, c4 : int, c5 : int) : int = 64 * 64 * 64 * 64 * 64 * (c0 - 0xFC) + 64 * 64 * 64 * 64 * (c1 - 0x80) + 64 * 64 * 64 * (c2 - 0x80) + 64 * 64 * (c3 - 0x80) + 64 * (c4 - 0x80) + (c5 - 0x80) dataprop EXTENDED_UTF8_CHAR (length : int, u : int, c0 : int, c1 : int, c2 : int, c3 : int, c4 : int, c5 : int) = \| {u, c0 : int \| 0 <= u; u <= 0x7F; is_extended_utf8_1byte_first_byte c0; u == extended_utf8_char_1byte_decoding (c0)} EXTENDED_UTF8_CHAR_1byte (1, u, c0, ~1, ~1, ~1, ~1, ~1) // \| {u, c0, c1 : int \| 0 <= u; u <= 0x7FF; is_extended_utf8_2byte_first_byte c0; is_valid_utf8_continuation_byte c1; u == extended_utf8_char_2byte_decoding (c0, c1)} EXTENDED_UTF8_CHAR_2byte (2, u, c0, c1, ~1, ~1, ~1, ~1) // \| {u, c0, c1, c2 : int \| 0 <= u; u <= 0xFFFF; is_extended_utf8_3byte_first_byte c0; is_valid_utf8_continuation_byte c1; is_valid_utf8_continuation_byte c2; u == extended_utf8_char_3byte_decoding (c0, c1, c2)} EXTENDED_UTF8_CHAR_3byte (3, u, c0, c1, c2, ~1, ~1, ~1) // \| {u, c0, c1, c2, c3 : int \| 0 <= u; u <= 0x1FFFFF; is_extended_utf8_4byte_first_byte c0; is_valid_utf8_continuation_byte c1; is_valid_utf8_continuation_byte c2; is_valid_utf8_continuation_byte c3; u == extended_utf8_char_4byte_decoding (c0, c1, c2, c3)} EXTENDED_UTF8_CHAR_4byte (4, u, c0, c1, c2, c3, ~1, ~1) // \| {u, c0, c1, c2, c3, c4 : int \| 0 <= u; u <= 0x3FFFFFF; is_extended_utf8_5byte_first_byte c0; is_valid_utf8_continuation_byte c1; is_valid_utf8_continuation_byte c2; is_valid_utf8_continuation_byte c3; is_valid_utf8_continuation_byte c4; u == extended_utf8_char_5byte_decoding (c0, c1, c2, c3, c4)} EXTENDED_UTF8_CHAR_5byte (5, u, c0, c1, c2, c3, c4, ~1) // \| {u, c0, c1, c2, c3, c4, c5 : int \| 0 <= u; u <= 0x7FFFFFFF; is_extended_utf8_6byte_first_byte c0; is_valid_utf8_continuation_byte c1; is_valid_utf8_continuation_byte c2; is_valid_utf8_continuation_byte c3; is_valid_utf8_continuation_byte c4; is_valid_utf8_continuation_byte c5; u == extended_utf8_char_6byte_decoding (c0, c1, c2, c3, c4, c5)} EXTENDED_UTF8_CHAR_6byte (6, u, c0, c1, c2, c3, c4, c5) extern prfun decode_extended_utf8_istot : {n : int \| 1 <= n; n <= 6} {c0, c1, c2, c3, c4, c5 : int \| extended_utf8_char_length_relation (c0, n); n <= 1 \|\| is_valid_utf8_continuation_byte c1; n <= 2 \|\| is_valid_utf8_continuation_byte c2; n <= 3 \|\| is_valid_utf8_continuation_byte c3; n <= 4 \|\| is_valid_utf8_continuation_byte c4; n <= 5 \|\| is_valid_utf8_continuation_byte c5; 1 < n \|\| c1 == ~1; 2 < n \|\| c2 == ~1; 3 < n \|\| c3 == ~1; 4 < n \|\| c4 == ~1; 5 < n \|\| c5 == ~1} () -<prf> [u : int] EXTENDED_UTF8_CHAR (n, u, c0, c1, c2, c3, c4, c5) extern prfun decode_extended_utf8_isfun : {na : int \| 1 <= na; na <= 6} {ua : int} {c0a, c1a, c2a, c3a, c4a, c5a : int} {nb : int \| nb == na} {ub : int} {c0b, c1b, c2b, c3b, c4b, c5b : int \| c0b == c0a; c1b == c1a; c2b == c2a; c3b == c3a; c4b == c4a; c5b == c5a} (EXTENDED_UTF8_CHAR (na, ua, c0a, c1a, c2a, c3a, c4a, c5a), EXTENDED_UTF8_CHAR (nb, ub, c0b, c1b, c2b, c3b, c4b, c5b)) -<prf> [ua == ub] void extern prfun lemma_extended_utf8_char_length : {n : int} {u : int} {c0, c1, c2, c3, c4, c5 : int} EXTENDED_UTF8_CHAR (n, u, c0, c1, c2, c3, c4, c5) -<prf> [n == extended_utf8_character_length c0] void extern fun {} decode_extended_utf8_1byte : {c0 : int \| 0x00 <= c0; c0 <= 0x7F} int c0 -<> [u : int] (EXTENDED_UTF8_CHAR (1, u, c0, ~1, ~1, ~1, ~1, ~1) \| int u) extern fun {} decode_extended_utf8_2byte : {c0, c1 : int \| 0xC0 <= c0; c0 <= 0xDF; is_valid_utf8_continuation_byte c1} (int c0, int c1) -<> [u : int] (EXTENDED_UTF8_CHAR (2, u, c0, c1, ~1, ~1, ~1, ~1) \| int u) extern fun {} decode_extended_utf8_3byte : {c0, c1, c2 : int \| 0xE0 <= c0; c0 <= 0xEF; is_valid_utf8_continuation_byte c1; is_valid_utf8_continuation_byte c2} (int c0, int c1, int c2) -<> [u : int] (EXTENDED_UTF8_CHAR (3, u, c0, c1, c2, ~1, ~1, ~1) \| int u) extern fun {} decode_extended_utf8_4byte : {c0, c1, c2, c3 : int \| 0xF0 <= c0; c0 <= 0xF7; is_valid_utf8_continuation_byte c1; is_valid_utf8_continuation_byte c2; is_valid_utf8_continuation_byte c3} (int c0, int c1, int c2, int c3) -<> [u : int] (EXTENDED_UTF8_CHAR (4, u, c0, c1, c2, c3, ~1, ~1) \| int u) extern fun {} decode_extended_utf8_5byte : {c0, c1, c2, c3, c4 : int \| 0xF8 <= c0; c0 <= 0xFB; is_valid_utf8_continuation_byte c1; is_valid_utf8_continuation_byte c2; is_valid_utf8_continuation_byte c3; is_valid_utf8_continuation_byte c4} (int c0, int c1, int c2, int c3, int c4) -<> [u : int] (EXTENDED_UTF8_CHAR (5, u, c0, c1, c2, c3, c4, ~1) \| int u) extern fun {} decode_extended_utf8_6byte : {c0, c1, c2, c3, c4, c5 : int \| 0xFC <= c0; c0 <= 0xFD; is_valid_utf8_continuation_byte c1; is_valid_utf8_continuation_byte c2; is_valid_utf8_continuation_byte c3; is_valid_utf8_continuation_byte c4; is_valid_utf8_continuation_byte c5} (int c0, int c1, int c2, int c3, int c4, int c5) -<> [u : int] (EXTENDED_UTF8_CHAR (6, u, c0, c1, c2, c3, c4, c5) \| int u) (------------------------------------------------------------------) stadef extended_utf8_shortest_length (u : int) = ifint (u < 0, ~1, ifint (u <= 0x7F, 1, ifint (u <= 0x7FF, 2, ifint (u <= 0xFFFF, 3, ifint (u <= 0x1FFFFF, 4, ifint (u <= 0x3FFFFFF, 5, ifint (u <= 0x7FFFFFFF, 6, ~1))))))) dataprop EXTENDED_UTF8_SHORTEST (length : int, u : int, c0 : int, c1 : int, c2 : int, c3 : int, c4 : int, c5 : int) = \| {u, c0 : int \| 0 <= u; u <= 0x7F; c0 == u} EXTENDED_UTF8_SHORTEST_1byte (1, u, c0, ~1, ~1, ~1, ~1, ~1) of EXTENDED_UTF8_CHAR (1, u, c0, ~1, ~1, ~1, ~1, ~1) // \| {u, c0, c1 : int \| 0x7F < u; u <= 0x7FF; c0 == 0xC0 + (u \ndiv 64); c1 == 0x80 + (u \nmod 64)} EXTENDED_UTF8_SHORTEST_2byte (2, u, c0, c1, ~1, ~1, ~1, ~1) of EXTENDED_UTF8_CHAR (2, u, c0, c1, ~1, ~1, ~1, ~1) // \| {u, c0, c1, c2 : int \| 0x7FF < u; u <= 0xFFFF; c0 == 0xE0 + (u \ndiv (64 * 64)); c1 == 0x80 + ((u \ndiv 64) \nmod 64); c2 == 0x80 + (u \nmod 64)} EXTENDED_UTF8_SHORTEST_3byte (3, u, c0, c1, c2, ~1, ~1, ~1) of EXTENDED_UTF8_CHAR (3, u, c0, c1, c2, ~1, ~1, ~1) // \| {u, c0, c1, c2, c3 : int \| 0xFFFF < u; u <= 0x1FFFFF; c0 == 0xF0 + (u \ndiv (64 * 64 * 64)); c1 == 0x80 + ((u \ndiv (64 * 64)) \nmod 64); c2 == 0x80 + ((u \ndiv 64) \nmod 64); c3 == 0x80 + (u \nmod 64)} EXTENDED_UTF8_SHORTEST_4byte (4, u, c0, c1, c2, c3, ~1, ~1) of EXTENDED_UTF8_CHAR (4, u, c0, c1, c2, c3, ~1, ~1) // \| {u, c0, c1, c2, c3, c4 : int \| 0x1FFFFF < u; u <= 0x3FFFFFF; c0 == 0xF8 + (u \ndiv (64 * 64 * 64 * 64)); c1 == 0x80 + ((u \ndiv (64 * 64 * 64)) \nmod 64); c2 == 0x80 + ((u \ndiv (64 * 64)) \nmod 64); c3 == 0x80 + ((u \ndiv 64) \nmod 64); c4 == 0x80 + (u \nmod 64)} EXTENDED_UTF8_SHORTEST_5byte (5, u, c0, c1, c2, c3, c4, ~1) of EXTENDED_UTF8_CHAR (5, u, c0, c1, c2, c3, c4, ~1) // \| {u, c0, c1, c2, c3, c4, c5 : int \| 0x3FFFFFF < u; u <= 0x7FFFFFFF; c0 == 0xFC + (u \ndiv (64 * 64 * 64 * 64 * 64)); c1 == 0x80 + ((u \ndiv (64 * 64 * 64 * 64)) \nmod 64); c2 == 0x80 + ((u \ndiv (64 * 64 * 64)) \nmod 64); c3 == 0x80 + ((u \ndiv (64 * 64)) \nmod 64); c4 == 0x80 + ((u \ndiv 64) \nmod 64); c5 == 0x80 + (u \nmod 64)} EXTENDED_UTF8_SHORTEST_6byte (6, u, c0, c1, c2, c3, c4, c5) of EXTENDED_UTF8_CHAR (6, u, c0, c1, c2, c3, c4, c5) extern prfun extended_utf8_shortest_is_char : {n : int} {u : int} {c0, c1, c2, c3, c4, c5 : int} EXTENDED_UTF8_SHORTEST (n, u, c0, c1, c2, c3, c4, c5) -<prf> EXTENDED_UTF8_CHAR (n, u, c0, c1, c2, c3, c4, c5) extern prfun lemma_extended_utf8_shortest_length : {n : int} {u : int} {c0, c1, c2, c3, c4, c5 : int} EXTENDED_UTF8_SHORTEST (n, u, c0, c1, c2, c3, c4, c5) -<prf> [n == extended_utf8_shortest_length u] void extern fun {} encode_extended_utf8_character : {u : nat \| u <= 0x7FFFFFFF} int u -<> [n : int] [c0, c1, c2, c3, c4, c5 : int] @(EXTENDED_UTF8_SHORTEST (n, u, c0, c1, c2, c3, c4, c5) \| int n, int c0, int c1, int c2, int c3, int c4, int c5) (------------------------------------------------------------------) // // A valid UTF-8 character is one that encodes a valid Unicode // code point and is not overlong. // dataprop UTF8_CHAR_INVALID_CASES (c0 : int, c1 : int, c2 : int, c3 : int) = // The cases are not mutually exclusive. \| {c0, c1, c2, c3 : int \| extended_utf8_character_length c0 == ~1} // We might not be using this case, presently (has that changed?), // but it is included for completeness. UTF8_CHAR_INVALID_bad_c0 (c0, c1, c2, c3) \| {c0, c1, c2, c3 : int \| 2 <= extended_utf8_character_length c0; ~(is_valid_utf8_continuation_byte c1)} UTF8_CHAR_INVALID_bad_c1 (c0, c1, c2, c3) \| {c0, c1, c2, c3 : int \| 3 <= extended_utf8_character_length c0; ~(is_valid_utf8_continuation_byte c2)} UTF8_CHAR_INVALID_bad_c2 (c0, c1, c2, c3) \| {c0, c1, c2, c3 : int \| extended_utf8_character_length c0 == 4; ~(is_valid_utf8_continuation_byte c3)} UTF8_CHAR_INVALID_bad_c3 (c0, c1, c2, c3) \| {c0, c1, c2, c3 : int \| extended_utf8_character_length c0 == 2; extended_utf8_char_2byte_decoding (c0, c1) <= 0x7F} UTF8_CHAR_INVALID_invalid_2byte (c0, c1, c2, c3) \| {c0, c1, c2, c3 : int \| extended_utf8_character_length c0 == 3 && (extended_utf8_char_3byte_decoding (c0, c1, c2) <= 0x7FF \|\| ~(is_valid_unicode_code_point (extended_utf8_char_3byte_decoding (c0, c1, c2))))} UTF8_CHAR_INVALID_invalid_3byte (c0, c1, c2, c3) \| {c0, c1, c2, c3 : int \| extended_utf8_character_length c0 == 4 && (extended_utf8_char_4byte_decoding (c0, c1, c2, c3) <= 0xFFFF \|\| ~(is_valid_unicode_code_point (extended_utf8_char_4byte_decoding (c0, c1, c2, c3))))} UTF8_CHAR_INVALID_invalid_4byte (c0, c1, c2, c3) dataprop UTF8_CHAR_VALID_BYTES (c0 : int, c1 : int, c2 : int, c3 : int) = \| {c0 : int \| 0 <= c0 && c0 <= 0x7F} UTF8_CHAR_VALID_BYTES_1byte (c0, ~1, ~1, ~1) \| {c0, c1 : int \| 0xC2 <= c0 && c0 <= 0xDF; is_valid_utf8_continuation_byte c1} UTF8_CHAR_VALID_BYTES_2byte (c0, c1, ~1, ~1) \| {c0, c1, c2 : int \| (0xE1 <= c0 && c0 <= 0xEC) \|\| c0 == 0xEE \|\| c0 == 0xEF \|\| (c0 == 0xE0 && 0xA0 <= c1) \|\| (c0 == 0xED && c1 < 0xA0); is_valid_utf8_continuation_byte c1; is_valid_utf8_continuation_byte c2} UTF8_CHAR_VALID_BYTES_3byte (c0, c1, c2, ~1) \| {c0, c1, c2, c3 : int \| (0xF1 <= c0 && c0 <= 0xF3) \|\| (c0 == 0xF0 && 0x90 <= c1) \|\| (c0 == 0xF4 && c1 < 0x90); is_valid_utf8_continuation_byte c1; is_valid_utf8_continuation_byte c2; is_valid_utf8_continuation_byte c3} UTF8_CHAR_VALID_BYTES_4byte (c0, c1, c2, c3) dataprop UTF8_CHAR_INVALID_BYTES (c0 : int, c1 : int, c2 : int, c3 : int) = \| // This should never occur. {c0 : int \| is_extended_utf8_1byte_first_byte c0} UTF8_CHAR_INVALID_BYTES_1byte (c0, ~1, ~1, ~1) \| {c0, c1 : int \| is_extended_utf8_2byte_first_byte c0; c0 == 0xC0 \|\| c0 == 0xC1 \|\| is_invalid_utf8_continuation_byte c1} UTF8_CHAR_INVALID_BYTES_2byte (c0, c1, ~1, ~1) \| {c0, c1, c2 : int \| is_extended_utf8_3byte_first_byte c0; (c0 == 0xE0 && c1 < 0xA0) \|\| (c0 == 0xED && 0xA0 <= c1) \|\| is_invalid_utf8_continuation_byte c1 \|\| is_invalid_utf8_continuation_byte c2} UTF8_CHAR_INVALID_BYTES_3byte (c0, c1, c2, ~1) \| {c0, c1, c2, c3 : int \| is_extended_utf8_4byte_first_byte c0; 0xF4 < c0 \|\| (c0 == 0xF0 && c1 < 0x90) \|\| (c0 == 0xF4 && 0x90 <= c1) \|\| is_invalid_utf8_continuation_byte c1 \|\| is_invalid_utf8_continuation_byte c2 \|\| is_invalid_utf8_continuation_byte c3} UTF8_CHAR_INVALID_BYTES_4byte (c0, c1, c2, c3) \| {c0, c1, c2, c3 : int \| ~(is_extended_utf8_1byte_first_byte c0); ~(is_extended_utf8_2byte_first_byte c0); ~(is_extended_utf8_3byte_first_byte c0); ~(is_extended_utf8_4byte_first_byte c0)} UTF8_CHAR_INVALID_BYTES_bad_c0 (c0, c1, c2, c3) dataprop UTF8_CHAR_VALIDITY (n : int, u : int, c0 : int, c1 : int, c2 : int, c3 : int, b : bool) = \| {n : int} {u : int \| is_valid_unicode_code_point u} {c0, c1, c2, c3 : int} UTF8_CHAR_valid (n, u, c0, c1, c2, c3, true) of (EXTENDED_UTF8_SHORTEST (n, u, c0, c1, c2, c3, ~1, ~1), UTF8_CHAR_VALID_BYTES (c0, c1, c2, c3)) \| {n : int} {u : int} {c0, c1, c2, c3 : int} UTF8_CHAR_invalid (n, u, c0, c1, c2, c3, false) of (UTF8_CHAR_INVALID_CASES (c0, c1, c2, c3), UTF8_CHAR_INVALID_BYTES (c0, c1, c2, c3)) propdef UTF8_CHAR_VALID (n : int, u : int, c0 : int, c1 : int, c2 : int, c3 : int) = UTF8_CHAR_VALIDITY (n, u, c0, c1, c2, c3, true) propdef UTF8_CHAR_INVALID (c0 : int, c1 : int, c2 : int, c3 : int) = [n : int] [u : int] UTF8_CHAR_VALIDITY (n, u, c0, c1, c2, c3, false) extern prfun utf8_char_valid_implies_shortest : {n : int} {u : int} {c0, c1, c2, c3 : int} UTF8_CHAR_VALID (n, u, c0, c1, c2, c3) -<prf> EXTENDED_UTF8_SHORTEST (n, u, c0, c1, c2, c3, ~1, ~1) extern prfun lemma_valid_utf8_character_1byte : {u, c0 : int \| is_extended_utf8_1byte_first_byte c0; u == extended_utf8_char_1byte_decoding c0; 0 <= u; u <= 0x7F} () -<prf> UTF8_CHAR_VALID (1, u, c0, ~1, ~1, ~1) extern prfun lemma_valid_utf8_character_2byte : {u, c0, c1 : int \| is_extended_utf8_2byte_first_byte c0; is_valid_utf8_continuation_byte c1; u == extended_utf8_char_2byte_decoding (c0, c1); 0x7F < u; u <= 0x7FF} () -<prf> UTF8_CHAR_VALID (2, u, c0, c1, ~1, ~1) extern prfun lemma_valid_utf8_character_3byte : {u, c0, c1, c2 : int \| is_extended_utf8_3byte_first_byte c0; is_valid_utf8_continuation_byte c1; is_valid_utf8_continuation_byte c2; u == extended_utf8_char_3byte_decoding (c0, c1, c2); 0x7FF < u; u <= 0xFFFF; // // Exclude the UTF-16 surrogate halves. // ~(0xD800 <= u && u < 0xE000)} () -<prf> UTF8_CHAR_VALID (3, u, c0, c1, c2, ~1) extern prfun lemma_valid_utf8_character_4byte : {u, c0, c1, c2, c3 : int \| is_extended_utf8_4byte_first_byte c0; is_valid_utf8_continuation_byte c1; is_valid_utf8_continuation_byte c2; is_valid_utf8_continuation_byte c3; u == extended_utf8_char_4byte_decoding (c0, c1, c2, c3); 0xFFFF < u; u <= 0x10FFFF} () -<prf> UTF8_CHAR_VALID (4, u, c0, c1, c2, c3) extern prfun utf8_character_1byte_valid_bytes : // This does not really do anything, but is included // for completeness. {u, c0 : int \| is_extended_utf8_1byte_first_byte c0} UTF8_CHAR_VALID (1, u, c0, ~1, ~1, ~1) -<prf> void extern prfun utf8_character_2byte_valid_bytes : {u, c0, c1 : int \| is_extended_utf8_2byte_first_byte c0} UTF8_CHAR_VALID (2, u, c0, c1, ~1, ~1) -<prf> [0xC2 <= c0; c0 <= 0xDF; is_valid_utf8_continuation_byte c1] void extern prfun utf8_character_3byte_valid_bytes : {u, c0, c1, c2 : int \| is_extended_utf8_3byte_first_byte c0} UTF8_CHAR_VALID (3, u, c0, c1, c2, ~1) -<prf> [(0xE1 <= c0 && c0 <= 0xEC) \|\| c0 == 0xEE \|\| c0 == 0xEF \|\| (c0 == 0xE0 && 0xA0 <= c1) \|\| (c0 == 0xED && c1 < 0xA0); is_valid_utf8_continuation_byte c1; is_valid_utf8_continuation_byte c2] void extern prfun utf8_character_4byte_valid_bytes : {u, c0, c1, c2, c3 : int \| is_extended_utf8_4byte_first_byte c0} UTF8_CHAR_VALID (4, u, c0, c1, c2, c3) -<prf> [(0xF1 <= c0 && c0 <= 0xF3) \|\| (c0 == 0xF0 && 0x90 <= c1) \|\| (c0 == 0xF4 && c1 < 0x90); is_valid_utf8_continuation_byte c1; is_valid_utf8_continuation_byte c2; is_valid_utf8_continuation_byte c3] void extern prfun utf8_character_1byte_invalid_bytes : // This does not really do anything, but is included // for completeness. {c0 : int \| is_extended_utf8_1byte_first_byte c0} UTF8_CHAR_INVALID (c0, ~1, ~1, ~1) -<prf> void extern prfun utf8_character_2byte_invalid_bytes : {c0, c1 : int \| is_extended_utf8_2byte_first_byte c0} UTF8_CHAR_INVALID (c0, c1, ~1, ~1) -<prf> [c0 == 0xC0 \|\| c0 == 0xc1 \|\| is_invalid_utf8_continuation_byte c1] void extern prfun utf8_character_3byte_invalid_bytes : {c0, c1, c2 : int \| is_extended_utf8_3byte_first_byte c0} UTF8_CHAR_INVALID (c0, c1, c2, ~1) -<prf> [(c0 == 0xE0 && c1 < 0xA0) \|\| (c0 == 0xED && 0xA0 <= c1) \|\| is_invalid_utf8_continuation_byte c1 \|\| is_invalid_utf8_continuation_byte c2] void extern prfun utf8_character_4byte_invalid_bytes : {c0, c1, c2, c3 : int \| is_extended_utf8_4byte_first_byte c0} UTF8_CHAR_INVALID (c0, c1, c2, c3) -<prf> [0xF4 < c0 \|\| (c0 == 0xF0 && c1 < 0x90) \|\| (c0 == 0xF4 && 0x90 <= c1) \|\| is_invalid_utf8_continuation_byte c1 \|\| is_invalid_utf8_continuation_byte c2 \|\| is_invalid_utf8_continuation_byte c3] void extern fun {} is_valid_utf8_character_1byte : {c0 : int \| is_extended_utf8_1byte_first_byte c0} int c0 -<> [b : bool \| b == true] [u : int] (UTF8_CHAR_VALIDITY (1, u, c0, ~1, ~1, ~1, b) \| bool b) extern fun {} is_valid_utf8_character_2byte : {c0, c1 : int \| is_extended_utf8_2byte_first_byte c0} (int c0, int c1) -<> [b : bool] [u : int] (UTF8_CHAR_VALIDITY (2, u, c0, c1, ~1, ~1, b) \| bool b) extern fun {} is_valid_utf8_character_3byte : {c0, c1, c2 : int \| is_extended_utf8_3byte_first_byte c0} (int c0, int c1, int c2) -<> [b : bool] [u : int] (UTF8_CHAR_VALIDITY (3, u, c0, c1, c2, ~1, b) \| bool b) extern fun {} is_valid_utf8_character_4byte : {c0, c1, c2, c3 : int \| is_extended_utf8_4byte_first_byte c0} (int c0, int c1, int c2, int c3) -<> [b : bool] [u : int] (UTF8_CHAR_VALIDITY (4, u, c0, c1, c2, c3, b) \| bool b) extern fun {} decode_utf8_1byte : {c0 : int \| is_extended_utf8_1byte_first_byte c0} int c0 -<> [u : int \| 0 <= u; u <= 0x7F] int u extern fun {} decode_utf8_2byte : {c0, c1 : int \| 0xC2 <= c0; c0 <= 0xDF; is_valid_utf8_continuation_byte c1} (int c0, int c1) -<> [u : int \| 0x7F < u; u <= 0x7FF] int u extern fun {} decode_utf8_3byte : {c0, c1, c2 : int \| (0xE1 <= c0 && c0 <= 0xEC) \|\| c0 == 0xEE \|\| c0 == 0xEF \|\| (c0 == 0xE0 && 0xA0 <= c1) \|\| (c0 == 0xED && c1 < 0xA0); is_valid_utf8_continuation_byte c1; is_valid_utf8_continuation_byte c2} (int c0, int c1, int c2) -<> [u : int \| 0x7FF < u; u <= 0xFFFF; u < 0xD800 \|\| 0xE000 <= u] int u extern fun {} decode_utf8_4byte : {c0, c1, c2, c3 : int \| (0xF1 <= c0 && c0 <= 0xF3) \|\| (c0 == 0xF0 && 0x90 <= c1) \|\| (c0 == 0xF4 && c1 < 0x90); is_valid_utf8_continuation_byte c1; is_valid_utf8_continuation_byte c2; is_valid_utf8_continuation_byte c3} (int c0, int c1, int c2, int c3) -<> [u : int \| 0xFFFF < u; u <= 0x10FFFF] int u extern fun {} encode_utf8_character : {u : int \| is_valid_unicode_code_point u} int u -<> [n : int] [c0, c1, c2, c3 : int \| extended_utf8_char_length_relation (c0, n)] @(UTF8_CHAR_VALID (n, u, c0, c1, c2, c3) \| int n, int c0, int c1, int c2, int c3) (------------------------------------------------------------------) #define utf8_decode_next_error_char ~1 (* Returns @(utf8_decode_next_error_char, ..) on error. ) extern fun utf8_array_decode_next {utf8len : int \| 0 < utf8len} {n_utf8arr : int \| utf8len <= n_utf8arr} {i : int \| i < utf8len} (utf8len : size_t utf8len, utf8arr : &(@[char][n_utf8arr]), i : size_t i) :<> [c : int \| is_valid_unicode_code_point c \|\| c == utf8_decode_next_error_char] [i_next : int \| i_next == i + 1 \|\| i_next == i + 2 \|\| i_next == i + 3 \|\| i_next == i + 4; i_next <= utf8len] @(int c, size_t i_next) extern fun utf8_string_decode_next {utf8len : int \| 0 < utf8len} {n_utf8str : int \| utf8len <= n_utf8str} {i : int \| i < utf8len} (utf8len : size_t utf8len, utf8str : string n_utf8str, i : size_t i) :<> [c : int \| is_valid_unicode_code_point c \|\| c == utf8_decode_next_error_char] [i_next : int \| i_next == i + 1 \|\| i_next == i + 2 \|\| i_next == i + 3 \|\| i_next == i + 4; i_next <= utf8len] @(int c, size_t i_next) overload utf8_decode_next with utf8_array_decode_next overload utf8_decode_next with utf8_string_decode_next (###################### IMPLEMENTATION ############################) // Integer division by 64 is equivalent to shifting right // by 6 bits. extern prfun _shift_right_twelve : {x, y, z : nat \| y == (x \ndiv 64); z == (y \ndiv 64)} (int x, int y, int z) -<prf> [z == (x \ndiv (64 64))] void primplement _shift_right_twelve (x, y, z) = () extern prfun _shift_right_eighteen : {x, y, z, u : nat \| y == (x \ndiv 64); z == (y \ndiv 64); u == (z \ndiv 64)} (int x, int y, int z, int u) -<prf> [u == (x \ndiv (64 * 64 * 64))] void primplement _shift_right_eighteen (x, y, z, u) = () extern prfun _shift_right_twenty_four : {x, y, z, u, v : nat \| y == (x \ndiv 64); z == (y \ndiv 64); u == (z \ndiv 64); v == (u \ndiv 64)} (int x, int y, int z, int u, int v) -<prf> [v == (x \ndiv (64 * 64 * 64 * 64))] void primplement _shift_right_twenty_four (x, y, z, u, v) = () (------------------------------------------------------------------) implement {} is_valid_unicode_code_point u = ((0x0 <= u) * (u < 0xD800)) + ((0xE000 <= u) * (u <= 0x10FFFF)) (------------------------------------------------------------------) primplement extended_utf8_char_length_relation_to_length () = () primplement extended_utf8_char_length_to_length_relation () = () primplement utf8_char_length_relation_to_length () = () primplement utf8_char_length_to_length_relation () = () (------------------------------------------------------------------) implement {} is_valid_utf8_continuation_byte c = (0x80 <= c) * (c <= 0xBF) (------------------------------------------------------------------) primplement decode_extended_utf8_istot {n} {c0, c1, c2, c3, c4, c5} () = sif n == 1 then let prfn make_pf {u : int \| u == extended_utf8_char_1byte_decoding (c0)} () :<prf> EXTENDED_UTF8_CHAR (n, u, c0, c1, c2, c3, c4, c5) = EXTENDED_UTF8_CHAR_1byte () stadef u = extended_utf8_char_1byte_decoding (c0) in make_pf {u} () end else sif n == 2 then let prfn make_pf {u : int \| u == extended_utf8_char_2byte_decoding (c0, c1)} () :<prf> EXTENDED_UTF8_CHAR (n, u, c0, c1, c2, c3, c4, c5) = EXTENDED_UTF8_CHAR_2byte () stadef u = extended_utf8_char_2byte_decoding (c0, c1) in make_pf {u} () end else sif n == 3 then let prfn make_pf {u : int \| u == extended_utf8_char_3byte_decoding (c0, c1, c2)} () :<prf> EXTENDED_UTF8_CHAR (n, u, c0, c1, c2, c3, c4, c5) = EXTENDED_UTF8_CHAR_3byte () stadef u = extended_utf8_char_3byte_decoding (c0, c1, c2) in make_pf {u} () end else sif n == 4 then let prfn make_pf {u : int \| u == extended_utf8_char_4byte_decoding (c0, c1, c2, c3)} () :<prf> EXTENDED_UTF8_CHAR (n, u, c0, c1, c2, c3, c4, c5) = EXTENDED_UTF8_CHAR_4byte () stadef u = extended_utf8_char_4byte_decoding (c0, c1, c2, c3) in make_pf {u} () end else sif n == 5 then let prfn make_pf {u : int \| u == extended_utf8_char_5byte_decoding (c0, c1, c2, c3, c4)} () :<prf> EXTENDED_UTF8_CHAR (n, u, c0, c1, c2, c3, c4, c5) = EXTENDED_UTF8_CHAR_5byte () stadef u = extended_utf8_char_5byte_decoding (c0, c1, c2, c3, c4) in make_pf {u} () end else let prfn make_pf {u : int \| u == extended_utf8_char_6byte_decoding (c0, c1, c2, c3, c4, c5)} () :<prf> EXTENDED_UTF8_CHAR (n, u, c0, c1, c2, c3, c4, c5) = EXTENDED_UTF8_CHAR_6byte () stadef u = extended_utf8_char_6byte_decoding (c0, c1, c2, c3, c4, c5) in make_pf {u} () end primplement decode_extended_utf8_isfun {na} (pf_a, pf_b) = sif na == 1 then let prval EXTENDED_UTF8_CHAR_1byte () = pf_a prval EXTENDED_UTF8_CHAR_1byte () = pf_b in end else sif na == 2 then let prval EXTENDED_UTF8_CHAR_2byte () = pf_a prval EXTENDED_UTF8_CHAR_2byte () = pf_b in end else sif na == 3 then let prval EXTENDED_UTF8_CHAR_3byte () = pf_a prval EXTENDED_UTF8_CHAR_3byte () = pf_b in end else sif na == 4 then let prval EXTENDED_UTF8_CHAR_4byte () = pf_a prval EXTENDED_UTF8_CHAR_4byte () = pf_b in end else sif na == 5 then let prval EXTENDED_UTF8_CHAR_5byte () = pf_a prval EXTENDED_UTF8_CHAR_5byte () = pf_b in end else let prval EXTENDED_UTF8_CHAR_6byte () = pf_a prval EXTENDED_UTF8_CHAR_6byte () = pf_b in end primplement lemma_extended_utf8_char_length pf_char = case+ pf_char of \| EXTENDED_UTF8_CHAR_1byte () => () \| EXTENDED_UTF8_CHAR_2byte () => () \| EXTENDED_UTF8_CHAR_3byte () => () \| EXTENDED_UTF8_CHAR_4byte () => () \| EXTENDED_UTF8_CHAR_5byte () => () \| EXTENDED_UTF8_CHAR_6byte () => () implement {} decode_extended_utf8_1byte c0 = (EXTENDED_UTF8_CHAR_1byte () \| c0) implement {} decode_extended_utf8_2byte (c0, c1) = let val u0 = c0 - 0xC0 val u1 = c1 - 0x80 in (EXTENDED_UTF8_CHAR_2byte () \| 64 * u0 + u1) end implement {} decode_extended_utf8_3byte (c0, c1, c2) = let val u0 = c0 - 0xE0 val u1 = c1 - 0x80 val u2 = c2 - 0x80 in (EXTENDED_UTF8_CHAR_3byte () \| 64 * (64 * u0 + u1) + u2) end implement {} decode_extended_utf8_4byte (c0, c1, c2, c3) = let val u0 = c0 - 0xF0 val u1 = c1 - 0x80 val u2 = c2 - 0x80 val u3 = c3 - 0x80 in (EXTENDED_UTF8_CHAR_4byte () \| 64 * (64 * (64 * u0 + u1) + u2) + u3) end implement {} decode_extended_utf8_5byte (c0, c1, c2, c3, c4) = let val u0 = c0 - 0xF8 val u1 = c1 - 0x80 val u2 = c2 - 0x80 val u3 = c3 - 0x80 val u4 = c4 - 0x80 in (EXTENDED_UTF8_CHAR_5byte () \| 64 * (64 * (64 * (64 * u0 + u1) + u2) + u3) + u4) end implement {} decode_extended_utf8_6byte (c0, c1, c2, c3, c4, c5) = let val u0 = c0 - 0xFC val u1 = c1 - 0x80 val u2 = c2 - 0x80 val u3 = c3 - 0x80 val u4 = c4 - 0x80 val u5 = c5 - 0x80 in (EXTENDED_UTF8_CHAR_6byte () \| 64 * (64 * (64 * (64 * (64 * u0 + u1) + u2) + u3) + u4) + u5) end (------------------------------------------------------------------) primplement extended_utf8_shortest_is_char pf_shortest = case+ pf_shortest of \| EXTENDED_UTF8_SHORTEST_1byte pf_char => pf_char \| EXTENDED_UTF8_SHORTEST_2byte pf_char => pf_char \| EXTENDED_UTF8_SHORTEST_3byte pf_char => pf_char \| EXTENDED_UTF8_SHORTEST_4byte pf_char => pf_char \| EXTENDED_UTF8_SHORTEST_5byte pf_char => pf_char \| EXTENDED_UTF8_SHORTEST_6byte pf_char => pf_char primplement lemma_extended_utf8_shortest_length pf_shortest = case+ pf_shortest of \| EXTENDED_UTF8_SHORTEST_1byte pf_char => { prval EXTENDED_UTF8_CHAR_1byte () = pf_char } \| EXTENDED_UTF8_SHORTEST_2byte pf_char => { prval EXTENDED_UTF8_CHAR_2byte () = pf_char } \| EXTENDED_UTF8_SHORTEST_3byte pf_char => { prval EXTENDED_UTF8_CHAR_3byte () = pf_char } \| EXTENDED_UTF8_SHORTEST_4byte pf_char => { prval EXTENDED_UTF8_CHAR_4byte () = pf_char } \| EXTENDED_UTF8_SHORTEST_5byte pf_char => { prval EXTENDED_UTF8_CHAR_5byte () = pf_char } \| EXTENDED_UTF8_SHORTEST_6byte pf_char => { prval EXTENDED_UTF8_CHAR_6byte () = pf_char } implement {} encode_extended_utf8_character u = if u <= 0x7F then let val c0 = u in @(EXTENDED_UTF8_SHORTEST_1byte (EXTENDED_UTF8_CHAR_1byte ()) \| 1, c0, ~1, ~1, ~1, ~1, ~1) end else if u <= 0x7FF then let val c1 = 0x80 + (u \nmod 64) val u1 = u \ndiv 64 val c0 = 0xC0 + u1 in @(EXTENDED_UTF8_SHORTEST_2byte (EXTENDED_UTF8_CHAR_2byte ()) \| 2, c0, c1, ~1, ~1, ~1, ~1) end else if u <= 0xFFFF then let val c2 = 0x80 + (u \nmod 64) val u1 = u \ndiv 64 val c1 = 0x80 + (u1 \nmod 64) val u2 = u1 \ndiv 64 val c0 = 0xE0 + u2 in @(EXTENDED_UTF8_SHORTEST_3byte (EXTENDED_UTF8_CHAR_3byte ()) \| 3, c0, c1, c2, ~1, ~1, ~1) end else if u <= 0x1FFFFF then let val c3 = 0x80 + (u \nmod 64) val u1 = u \ndiv 64 val c2 = 0x80 + (u1 \nmod 64) val u2 = u1 \ndiv 64 val c1 = 0x80 + (u2 \nmod 64) val u3 = u2 \ndiv 64 val c0 = 0xF0 + u3 prval () = _shift_right_twelve (u, u1, u2) in @(EXTENDED_UTF8_SHORTEST_4byte (EXTENDED_UTF8_CHAR_4byte ()) \| 4, c0, c1, c2, c3, ~1, ~1) end else if u <= 0x3FFFFFF then let val c4 = 0x80 + (u \nmod 64) val u1 = u \ndiv 64 val c3 = 0x80 + (u1 \nmod 64) val u2 = u1 \ndiv 64 val c2 = 0x80 + (u2 \nmod 64) val u3 = u2 \ndiv 64 val c1 = 0x80 + (u3 \nmod 64) val u4 = u3 \ndiv 64 val c0 = 0xF8 + u4 prval () = _shift_right_twelve (u, u1, u2) prval () = _shift_right_eighteen (u, u1, u2, u3) in @(EXTENDED_UTF8_SHORTEST_5byte (EXTENDED_UTF8_CHAR_5byte ()) \| 5, c0, c1, c2, c3, c4, ~1) end else let val c5 = 0x80 + (u \nmod 64) val u1 = u \ndiv 64 val c4 = 0x80 + (u1 \nmod 64) val u2 = u1 \ndiv 64 val c3 = 0x80 + (u2 \nmod 64) val u3 = u2 \ndiv 64 val c2 = 0x80 + (u3 \nmod 64) val u4 = u3 \ndiv 64 val c1 = 0x80 + (u4 \nmod 64) val u5 = u4 \ndiv 64 val c0 = 0xFC + u5 prval () = _shift_right_twelve (u, u1, u2) prval () = _shift_right_eighteen (u, u1, u2, u3) prval () = _shift_right_twenty_four (u, u1, u2, u3, u4) in @(EXTENDED_UTF8_SHORTEST_6byte (EXTENDED_UTF8_CHAR_6byte ()) \| 6, c0, c1, c2, c3, c4, c5) end (------------------------------------------------------------------) primplement utf8_char_valid_implies_shortest pf_valid = case+ pf_valid of \| UTF8_CHAR_valid (pf_shortest, _) => pf_shortest primplement lemma_valid_utf8_character_1byte {u, c0} () = let prfn make_pf {u : int \| u == extended_utf8_char_1byte_decoding (c0)} () :<prf> EXTENDED_UTF8_CHAR (1, u, c0, ~1, ~1, ~1, ~1, ~1) = EXTENDED_UTF8_CHAR_1byte () prval pf_char = make_pf {u} () prval pf_shortest = EXTENDED_UTF8_SHORTEST_1byte pf_char prval pf_bytes = UTF8_CHAR_VALID_BYTES_1byte () prval pf_valid = UTF8_CHAR_valid (pf_shortest, pf_bytes) in pf_valid end primplement lemma_valid_utf8_character_2byte {u, c0, c1} () = let prfn make_pf {u : int \| u == extended_utf8_char_2byte_decoding (c0, c1)} () :<prf> EXTENDED_UTF8_CHAR (2, u, c0, c1, ~1, ~1, ~1, ~1) = EXTENDED_UTF8_CHAR_2byte () prval pf_char = make_pf {u} () prval pf_shortest = EXTENDED_UTF8_SHORTEST_2byte pf_char prval pf_bytes = UTF8_CHAR_VALID_BYTES_2byte () prval pf_valid = UTF8_CHAR_valid (pf_shortest, pf_bytes) in pf_valid end primplement lemma_valid_utf8_character_3byte {u, c0, c1, c2} () = let prfn make_pf {u : int \| u == extended_utf8_char_3byte_decoding (c0, c1, c2)} () :<prf> EXTENDED_UTF8_CHAR (3, u, c0, c1, c2, ~1, ~1, ~1) = EXTENDED_UTF8_CHAR_3byte () prval pf_char = make_pf {u} () prfn lemma_c1 () :<prf> [c1 == 0x80 + ((u \ndiv 64) \nmod 64)] void = { // Convert to Horner form. stadef u1 = 64 * (64 * (c0 - 0xE0) + (c1 - 0x80)) + (c2 - 0x80) prval EQINT () = eqint_make {u, u1} () } prval () = lemma_c1 () prval pf_shortest = EXTENDED_UTF8_SHORTEST_3byte pf_char prval pf_bytes = UTF8_CHAR_VALID_BYTES_3byte () prval pf_valid = UTF8_CHAR_valid (pf_shortest, pf_bytes) in pf_valid end primplement lemma_valid_utf8_character_4byte {u, c0, c1, c2, c3} () = let prfn make_pf {u : int \| u == extended_utf8_char_4byte_decoding (c0, c1, c2, c3)} () :<prf> EXTENDED_UTF8_CHAR (4, u, c0, c1, c2, c3, ~1, ~1) = EXTENDED_UTF8_CHAR_4byte () prval pf_char = make_pf {u} () prfn lemma_c1 () :<prf> [c1 == 0x80 + ((u \ndiv (64 * 64)) \nmod 64)] void = { prval EQINT () = eqint_make {(64 * 64 * (c0 - 0xE0)) \ndiv (64 * 64), c0 - 0xE0} () prval pfd = divmod_istot {u \ndiv (64 * 64), 64} () prval pfm = divmod_elim pfd prval () = mul_elim pfm } prval () = lemma_c1 () prfn lemma_c2 () :<prf> [c2 == 0x80 + ((u \ndiv 64) \nmod 64)] void = { // Convert to Horner form. stadef u1 = 64 * (64 * (64 * (c0 - 0xF0) + (c1 - 0x80)) + (c2 - 0x80)) + (c3 - 0x80) prval EQINT () = eqint_make {u, u1} () } prval () = lemma_c2 () prval pf_shortest = EXTENDED_UTF8_SHORTEST_4byte pf_char prval pf_bytes = UTF8_CHAR_VALID_BYTES_4byte () prval pf_valid = UTF8_CHAR_valid (pf_shortest, pf_bytes) in pf_valid end primplement utf8_character_1byte_valid_bytes pf_valid = () primplement utf8_character_2byte_valid_bytes pf_valid = { prval UTF8_CHAR_valid (_, pf_bytes) = pf_valid prval UTF8_CHAR_VALID_BYTES_2byte () = pf_bytes } primplement utf8_character_3byte_valid_bytes pf_valid = { prval UTF8_CHAR_valid (_, pf_bytes) = pf_valid prval UTF8_CHAR_VALID_BYTES_3byte () = pf_bytes } primplement utf8_character_4byte_valid_bytes pf_valid = { prval UTF8_CHAR_valid (_, pf_bytes) = pf_valid prval UTF8_CHAR_VALID_BYTES_4byte () = pf_bytes } primplement utf8_character_1byte_invalid_bytes pf_invalid = { prval UTF8_CHAR_invalid (_, pf_bytes) = pf_invalid prval UTF8_CHAR_INVALID_BYTES_1byte () = pf_bytes } primplement utf8_character_2byte_invalid_bytes pf_invalid = { prval UTF8_CHAR_invalid (_, pf_bytes) = pf_invalid prval UTF8_CHAR_INVALID_BYTES_2byte () = pf_bytes } primplement utf8_character_3byte_invalid_bytes pf_invalid = { prval UTF8_CHAR_invalid (_, pf_bytes) = pf_invalid prval UTF8_CHAR_INVALID_BYTES_3byte () = pf_bytes } primplement utf8_character_4byte_invalid_bytes pf_invalid = { prval UTF8_CHAR_invalid (_, pf_bytes) = pf_invalid prval UTF8_CHAR_INVALID_BYTES_4byte () = pf_bytes } implement {} is_valid_utf8_character_1byte {c0} c0 = let stadef u = extended_utf8_char_1byte_decoding c0 in (lemma_valid_utf8_character_1byte {u, c0} () \| true) end implement {} is_valid_utf8_character_2byte {c0, c1} (c0, c1) = let stadef u = extended_utf8_char_2byte_decoding (c0, c1) in if not (is_valid_utf8_continuation_byte c1) then (UTF8_CHAR_invalid (UTF8_CHAR_INVALID_bad_c1 (), UTF8_CHAR_INVALID_BYTES_2byte ()) \| false) else if c0 < 0xC2 then // The sequence is overlong. (UTF8_CHAR_invalid (UTF8_CHAR_INVALID_invalid_2byte (), UTF8_CHAR_INVALID_BYTES_2byte ()) \| false) else (lemma_valid_utf8_character_2byte {u, c0, c1} () \| true) end implement {} is_valid_utf8_character_3byte {c0, c1, c2} (c0, c1, c2) = let stadef u = extended_utf8_char_3byte_decoding (c0, c1, c2) in if not (is_valid_utf8_continuation_byte c1) then (UTF8_CHAR_invalid (UTF8_CHAR_INVALID_bad_c1 (), UTF8_CHAR_INVALID_BYTES_3byte ()) \| false) else if not (is_valid_utf8_continuation_byte c2) then (UTF8_CHAR_invalid (UTF8_CHAR_INVALID_bad_c2 (), UTF8_CHAR_INVALID_BYTES_3byte ()) \| false) else if (0xE1 <= c0) * (c0 <= 0xEC) then (lemma_valid_utf8_character_3byte {u, c0, c1, c2} () \| true) else if c0 = 0xEE then (lemma_valid_utf8_character_3byte {u, c0, c1, c2} () \| true) else if c0 = 0xEF then (lemma_valid_utf8_character_3byte {u, c0, c1, c2} () \| true) else if (c0 = 0xE0) * (0xA0 <= c1) then (lemma_valid_utf8_character_3byte {u, c0, c1, c2} () \| true) else if (c0 = 0xED) * (c1 < 0xA0) then (lemma_valid_utf8_character_3byte {u, c0, c1, c2} () \| true) else // Either the sequence is overlong or it decodes to // an invalid code point. (UTF8_CHAR_invalid (UTF8_CHAR_INVALID_invalid_3byte (), UTF8_CHAR_INVALID_BYTES_3byte ()) \| false) end implement {} is_valid_utf8_character_4byte {c0, c1, c2, c3} (c0, c1, c2, c3) = let stadef u = extended_utf8_char_4byte_decoding (c0, c1, c2, c3) in if not (is_valid_utf8_continuation_byte c1) then (UTF8_CHAR_invalid (UTF8_CHAR_INVALID_bad_c1 (), UTF8_CHAR_INVALID_BYTES_4byte ()) \| false) else if not (is_valid_utf8_continuation_byte c2) then (UTF8_CHAR_invalid (UTF8_CHAR_INVALID_bad_c2 (), UTF8_CHAR_INVALID_BYTES_4byte ()) \| false) else if not (is_valid_utf8_continuation_byte c3) then (UTF8_CHAR_invalid (UTF8_CHAR_INVALID_bad_c3 (), UTF8_CHAR_INVALID_BYTES_4byte ()) \| false) else if (0xF1 <= c0) * (c0 <= 0xF3) then (lemma_valid_utf8_character_4byte {u, c0, c1, c2, c3} () \| true) else if (c0 = 0xF0) * (0x90 <= c1) then (lemma_valid_utf8_character_4byte {u, c0, c1, c2, c3} () \| true) else if (c0 = 0xF4) * (c1 < 0x90) then (lemma_valid_utf8_character_4byte {u, c0, c1, c2, c3} () \| true) else // Either the sequence is overlong or it decodes to // an invalid code point. (UTF8_CHAR_invalid (UTF8_CHAR_INVALID_invalid_4byte (), UTF8_CHAR_INVALID_BYTES_4byte ()) \| false) end implement {} decode_utf8_1byte c0 = c0 implement {} decode_utf8_2byte (c0, c1) = let val u0 = c0 - 0xC0 val u1 = c1 - 0x80 in 64 * u0 + u1 end implement {} decode_utf8_3byte (c0, c1, c2) = let val u0 = c0 - 0xE0 val u1 = c1 - 0x80 val u2 = c2 - 0x80 in 64 * (64 * u0 + u1) + u2 end implement {} decode_utf8_4byte (c0, c1, c2, c3) = let val u0 = c0 - 0xF0 val u1 = c1 - 0x80 val u2 = c2 - 0x80 val u3 = c3 - 0x80 in 64 * (64 * (64 * u0 + u1) + u2) + u3 end implement {} encode_utf8_character {u} u = if u <= 0x7F then let val c0 = u stadef c0 = u in @(lemma_valid_utf8_character_1byte {u, c0} () \| 1, c0, ~1, ~1, ~1) end else if u <= 0x7FF then let val c1 = 0x80 + (u \nmod 64) val u1 = u \ndiv 64 val c0 = 0xC0 + u1 stadef c1 = 0x80 + (u \nmod 64) stadef u1 = u \ndiv 64 stadef c0 = 0xC0 + u1 in @(lemma_valid_utf8_character_2byte {u, c0, c1} () \| 2, c0, c1, ~1, ~1) end else if u <= 0xFFFF then let val c2 = 0x80 + (u \nmod 64) val u1 = u \ndiv 64 val c1 = 0x80 + (u1 \nmod 64) val u2 = u1 \ndiv 64 val c0 = 0xE0 + u2 stadef c2 = 0x80 + (u \nmod 64) stadef u1 = u \ndiv 64 stadef c1 = 0x80 + (u1 \nmod 64) stadef u2 = u1 \ndiv 64 stadef c0 = 0xE0 + u2 in @(lemma_valid_utf8_character_3byte {u, c0, c1, c2} () \| 3, c0, c1, c2, ~1) end else let val c3 = 0x80 + (u \nmod 64) val u1 = u \ndiv 64 val c2 = 0x80 + (u1 \nmod 64) val u2 = u1 \ndiv 64 val c1 = 0x80 + (u2 \nmod 64) val u3 = u2 \ndiv 64 val c0 = 0xF0 + u3 stadef c3 = 0x80 + (u \nmod 64) stadef u1 = u \ndiv 64 stadef c2 = 0x80 + (u1 \nmod 64) stadef u2 = u1 \ndiv 64 stadef c1 = 0x80 + (u2 \nmod 64) stadef u3 = u2 \ndiv 64 stadef c0 = 0xF0 + u3 prval () = _shift_right_twelve (u, u1, u2) in @(lemma_valid_utf8_character_4byte {u, c0, c1, c2, c3} () \| 4, c0, c1, c2, c3) end (------------------------------------------------------------------) implement utf8_array_decode_next (utf8len, utf8arr, i) = let macdef getchr (j) = g1ofg0 (char2u2i (utf8arr[,(j)])) macdef error_return = @(utf8_decode_next_error_char, i + i2sz 1) prval _ = lemma_g1uint_param (i) val c0 = getchr (i) in if 0x00 <= c0 &&& c0 <= 0xFF then let val seqlen = utf8_character_length c0 in case+ seqlen of \| 1 => @(c0, i + i2sz 1) \| 2 => if utf8len < i + i2sz 2 then error_return else let val c1 = getchr (i + i2sz 1) val (pf \| valid) = is_valid_utf8_character_2byte (c0, c1) in if valid then let prval _ = utf8_character_2byte_valid_bytes pf val code_point = decode_utf8_2byte (c0, c1) in @(code_point, i + i2sz 2) end else error_return end \| 3 => if utf8len < i + i2sz 3 then error_return else let val c1 = getchr (i + i2sz 1) val c2 = getchr (i + i2sz 2) val (pf \| valid) = is_valid_utf8_character_3byte (c0, c1, c2) in if valid then let prval _ = utf8_character_3byte_valid_bytes pf val code_point = decode_utf8_3byte (c0, c1, c2) in @(code_point, i + i2sz 3) end else error_return end \| 4 => if utf8len < i + i2sz 4 then error_return else let val c1 = getchr (i + i2sz 1) val c2 = getchr (i + i2sz 2) val c3 = getchr (i + i2sz 3) val (pf \| valid) = is_valid_utf8_character_4byte (c0, c1, c2, c3) in if valid then let prval _ = utf8_character_4byte_valid_bytes pf val code_point = decode_utf8_4byte (c0, c1, c2, c3) in @(code_point, i + i2sz 4) end else error_return end \| _ => error_return end else (* This branch should never be run on a system with 8-bit char. ) error_return end implement utf8_string_decode_next {..} {n_utf8str} (utf8len, utf8str, i) = let val [p : addr] p = string2ptr utf8str val (pf, consume_pf \| p) = $UNSAFE.ptr1_vtake{@[char][n_utf8str]} p val result = utf8_array_decode_next (utf8len, !p, i) prval _ = consume_pf pf in result end (###################### DEMONSTRATION #############################) fn encode_LATIN_CAPITAL_LETTER_A () : void = { val u = 0x0041 ( Return both a proof of valid encoding and the encoding. ) val (pf_valid \| n, c0, c1, c2, c3) = encode_utf8_character (u) ( Verify the encoding. ) val _ = assertloc (n = 1) val _ = assertloc (c0 = 0x41) } fn decode_LATIN_CAPITAL_LETTER_A () : void = { val str = "\x41\0" val n = length str val (c, i) = utf8_decode_next (n, str, i2sz 0) ( Verify that the decoding is correct. ) val _ = assertloc (c = 0x0041) ( Verify that the next index is 1. ) val _ = assertloc (i = i2sz 1) } fn encode_LATIN_SMALL_LETTER_O_WITH_DIAERESIS () : void = { val u = 0x00F6 ( Return both a proof of valid encoding and the encoding. ) val (pf_valid \| n, c0, c1, c2, c3) = encode_utf8_character (u) ( Verify the encoding. ) val _ = assertloc (n = 2) val _ = assertloc (c0 = 0xC3) val _ = assertloc (c1 = 0xB6) } fn decode_LATIN_SMALL_LETTER_O_WITH_DIAERESIS () : void = { val str = "\xC3\xB6\0" val n = length str val (c, i) = utf8_decode_next (n, str, i2sz 0) ( Verify that the decoding is correct. ) val _ = assertloc (c = 0x00F6) ( Verify that the next index is 2. ) val _ = assertloc (i = i2sz 2) } fn encode_CYRILLIC_CAPITAL_LETTER_ZHE () : void = { val u = 0x0416 ( Return both a proof of valid encoding and the encoding. ) val (pf_valid \| n, c0, c1, c2, c3) = encode_utf8_character (u) ( Verify the encoding. ) val _ = assertloc (n = 2) val _ = assertloc (c0 = 0xD0) val _ = assertloc (c1 = 0x96) } fn decode_CYRILLIC_CAPITAL_LETTER_ZHE () : void = { val str = "\xD0\x96\0" val n = length str val (c, i) = utf8_decode_next (n, str, i2sz 0) ( Verify that the decoding is correct. ) val _ = assertloc (c = 0x0416) ( Verify that the next index is 2. ) val _ = assertloc (i = i2sz 2) } fn encode_EURO_SIGN () : void = { val u = 0x20AC ( Return both a proof of valid encoding and the encoding. ) val (pf_valid \| n, c0, c1, c2, c3) = encode_utf8_character (u) ( Verify the encoding. ) val _ = assertloc (n = 3) val _ = assertloc (c0 = 0xE2) val _ = assertloc (c1 = 0x82) val _ = assertloc (c2 = 0xAC) } fn decode_EURO_SIGN () : void = { val str = "\xE2\x82\xAC\0" val n = length str val (c, i) = utf8_decode_next (n, str, i2sz 0) ( Verify that the decoding is correct. ) val _ = assertloc (c = 0x20AC) ( Verify that the next index is 3. ) val _ = assertloc (i = i2sz 3) } fn encode_MUSICAL_SYMBOL_G_CLEF () : void = { val u = 0x1D11E ( Return both a proof of valid encoding and the encoding. ) val (pf_valid \| n, c0, c1, c2, c3) = encode_utf8_character (u) ( Verify the encoding. ) val _ = assertloc (n = 4) val _ = assertloc (c0 = 0xF0) val _ = assertloc (c1 = 0x9D) val _ = assertloc (c2 = 0x84) val _ = assertloc (c3 = 0x9E) } fn decode_MUSICAL_SYMBOL_G_CLEF () : void = { val str = "\xF0\x9D\x84\x9E\0" val n = length str val (c, i) = utf8_decode_next (n, str, i2sz 0) ( Verify that the decoding is correct. ) val _ = assertloc (c = 0x1D11E) ( Verify that the next index is 4. ) val _ = assertloc (i = i2sz 4) } implement main0 () = begin encode_LATIN_CAPITAL_LETTER_A (); decode_LATIN_CAPITAL_LETTER_A (); encode_LATIN_SMALL_LETTER_O_WITH_DIAERESIS (); decode_LATIN_SMALL_LETTER_O_WITH_DIAERESIS (); encode_CYRILLIC_CAPITAL_LETTER_ZHE (); decode_CYRILLIC_CAPITAL_LETTER_ZHE (); encode_EURO_SIGN (); decode_EURO_SIGN (); encode_MUSICAL_SYMBOL_G_CLEF (); decode_MUSICAL_SYMBOL_G_CLEF (); println! ("SUCCESS") end</syntaxhighlight> {{out}} <pre>$ patscc -O2 utf8_encoding.dats && ./a.out SUCCESS</pre> =={{header\|AutoHotkey}}== <syntaxhighlight lang="autohotkey">Encode_UTF(hex){ Bytes := hex>=0x10000 ? 4 : hex>=0x0800 ? 3 : hex>=0x0080 ? 2 : hex>=0x0001 ? 1 : 0 Prefix := [0, 0xC0, 0xE0, 0xF0] loop % Bytes { if (A_Index < Bytes) UTFCode := Format("{:X}", (hex&0x3F) + 0x80) . UTFCode ; 3F=00111111, 80=10000000 else UTFCode := Format("{:X}", hex + Prefix[Bytes]) . UTFCode ; C0=11000000, E0=11100000, F0=11110000 hex := hex>>6 } return "0x" UTFCode } ;---------------------------------------------------------------------------------------- Decode_UTF(hex){ Bytes := hex>=0x10000 ? 4 : hex>=0x0800 ? 3 : hex>=0x0080 ? 2 : hex>=0x0001 ? 1 : 0 bin := ConvertBase(16, 2, hex) loop, % Bytes { B := SubStr(bin, -7) if Bytes > 1 B := LTrim(B, 1) , B := StrReplace(B, 0,,, 1) bin := SubStr(bin, 1, StrLen(bin)-8) Uni := B . Uni } return "0x" ConvertBase(2, 16, Uni) } ;---------------------------------------------------------------------------------------- ; www.autohotkey.com/boards/viewtopic.php?f=6&t=3607#p18985 ConvertBase(InputBase, OutputBase, number){ static u := A_IsUnicode ? "_wcstoui64" : "_strtoui64" static v := A_IsUnicode ? "_i64tow" : "_i64toa" VarSetCapacity(s, 65, 0) value := DllCall("msvcrt.dll\" u, "Str", number, "UInt", 0, "UInt", InputBase, "CDECL Int64") DllCall("msvcrt.dll\" v, "Int64", value, "Str", s, "UInt", OutputBase, "CDECL") return s }</syntaxhighlight> Examples:<syntaxhighlight lang="autohotkey">data = (comment 0x0041 0x00F6 0x0416 0x20AC 0x1D11E ) output := "unicode`t`tUTF`t`tunicode`n" for i, Hex in StrSplit(data, "`n", "`r"){ UTFCode := Encode_UTF(Hex) output .= Hex "`t`t" UTFCode "`t`t" Decode_UTF(UTFCode) "`n" } MsgBox % output return</syntaxhighlight> {{out}} <pre> Unicode Encode_UTF Decode_UTF 0x0041 0x41 0x41 0x00F6 0xC3B6 0xf6 0x0416 0xD096 0x416 0x20AC 0xE282AC 0x20ac 0x1D11E 0xF09D849E 0x1d11e </pre> =={{header\|BaCon}}== BaCon supports UTF8 natively. <syntaxhighlight lang="bacon">DECLARE x TYPE STRING CONST letter$ = "A ö Ж € 𝄞" PRINT "Char", TAB$(1), "Unicode", TAB$(2), "UTF-8 (hex)" PRINT "-----------------------------------" FOR x IN letter$ PRINT x, TAB$(1), "U+", HEX$(UCS(x)), TAB$(2), COIL$(LEN(x), HEX$(x[_-1] & 255)) NEXT</syntaxhighlight> {{out}} <pre>Char Unicode UTF-8 (hex) ----------------------------------- A U+41 41 ö U+F6 C3 B6 Ж U+416 D0 96 € U+20AC E2 82 AC 𝄞 U+1D11E F0 9D 84 9E</pre> =={{header\|C}}== <syntaxhighlight lang="c"> ~~<lang C>~~ #include <stdio.h> #include <stdlib.h> Line 247 ⟶ 2,312: return 0; } </syntaxhighlight> ~~</lang>~~ Output <syntaxhighlight lang="text"> Character Unicode UTF-8 encoding (hex) ---------------------------------------- Line 258 ⟶ 2,323: 𝄞 U+1d11e f0 9d 84 9e </syntaxhighlight> ~~</lang>~~ =={{header\|C sharp\|C#}}== <syntaxhighlight lang="csharp">using System; using System.Text; namespace Rosetta { class Program { static byte[] MyEncoder(int codepoint) => Encoding.UTF8.GetBytes(char.ConvertFromUtf32(codepoint)); static string MyDecoder(byte[] utf8bytes) => Encoding.UTF8.GetString(utf8bytes); static void Main(string[] args) { Console.OutputEncoding = Encoding.UTF8; // makes sure it doesn't print rectangles... foreach (int unicodePoint in new int[] { 0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E}) { byte[] asUtf8bytes = MyEncoder(unicodePoint); string theCharacter = MyDecoder(asUtf8bytes); Console.WriteLine("{0,8} {1,5} {2,-15}", unicodePoint.ToString("X4"), theCharacter, BitConverter.ToString(asUtf8bytes)); } } } } / Output: * 0041 A 41 00F6 ö C3-B6 0416 Ж D0-96 20AC € E2-82-AC 1D11E 𝄞 F0-9D-84-9E / </syntaxhighlight> =={{header\|Common Lisp}}== Line 264 ⟶ 2,362: Helper functions <~~lang~~syntaxhighlight lang="lisp"> (defun ascii-byte-p (octet) "Return t if octet is a single-byte 7-bit ASCII char. Line 305 ⟶ 2,403: when (= (nth i templates) (logand (nth i bitmasks) octet)) return i))) </syntaxhighlight> ~~</lang>~~ Encoder <~~lang~~syntaxhighlight lang="lisp"> (defun unicode-to-utf-8 (int) "Take a unicode code point, return a list of one to four UTF-8 encoded bytes (octets)." Line 340 ⟶ 2,438: ;; return the list of UTF-8 encoded bytes. byte-list)) </syntaxhighlight> ~~</lang>~~ Decoder <~~lang~~syntaxhighlight lang="lisp"> (defun utf-8-to-unicode (byte-list) "Take a list of one to four utf-8 encoded bytes (octets), return a code point." Line 364 ⟶ 2,462: (error "calculated number of bytes doesnt match the length of the byte list"))) (error "first byte in the list isnt a lead byte")))))) </syntaxhighlight> ~~</lang>~~ The test <~~lang~~syntaxhighlight lang="lisp"> (defun test-utf-8 () "Return t if the chosen unicode points are encoded and decoded correctly." Line 383 ⟶ 2,481: ;; return t if all are t (every #'= unicodes-orig unicodes-test))) </syntaxhighlight> ~~</lang>~~ Test output <~~lang~~syntaxhighlight lang="lisp"> CL-USER> (test-utf-8) character A, code point: 41, utf-8: 41 Line 395 ⟶ 2,493: character 𝄞, code point: 1D11E, utf-8: F0 9D 84 9E T </syntaxhighlight> ~~</lang>~~ =={{header\|D}}== <~~lang~~syntaxhighlight Dlang="d">import std.conv; import std.stdio; Line 410 ⟶ 2,508: writefln("%s %7X [%(%X, %)]", c, unicode, bytes); } }</~~lang~~syntaxhighlight> {{out}} Line 421 ⟶ 2,519: =={{header\|Elena}}== ELENA 46.x : <~~lang~~syntaxhighlight lang="elena">import system'routines; import extensions; Line 434 ⟶ 2,532: string printAsUTF8Array() { self.toByteArray().forEach::(b){ console.print(b.toString(16)," ") } } string printAsUTF32() { self.toArray().forEach::(c){ console.print("U+",c.toInt().toString(16)," ") } } } Line 459 ⟶ 2,557: "𝄞".printAsString().printAsUTF8Array().printAsUTF32(); console.printLine(); }</~~lang~~syntaxhighlight> {{out}} <pre> Line 467 ⟶ 2,565: € E2 82 AC U+20AC 𝄞 F0 9D 84 9E U+1D11E</pre> =={{header\|FreeBASIC}}== {{trans\|VBScript}} <syntaxhighlight lang="vbnet">Function unicode_2_utf8(x As Long) As String Dim As String y Dim As Long r Select Case x Case 0 To &H7F y = Chr(x) Case &H80 To &H7FF y = Chr(192 + x \ 64) + Chr(128 + x Mod 64) Case &H800 To &H7FFF, 32768 To 65535 r = x \ 64 y = Chr(224 + r \ 64) + Chr(128 + r Mod 64) + Chr(128 + x Mod 64) Case &H10000 To &H10FFFF r = x \ 4096 y = Chr(240 + r \ 64) + Chr(128 + r Mod 64) + Chr(128 + (x \ 64) Mod 64) + Chr(128 + x Mod 64) Case Else Print "what else? " & x & " " & Hex(x) End Select Return y End Function Function utf8_2_unicode(x As String) As Long Dim As Long primero, segundo, tercero, cuarto Dim As Long total Select Case Len(x) Case 1 'one byte If Asc(x) < 128 Then total = Asc(x) Else Print "highest bit set error" End If Case 2 'two bytes and assume primero byte is leading byte If Asc(x) \ 32 = 6 Then primero = Asc(x) Mod 32 If Asc(Mid(x, 2, 1)) \ 64 = 2 Then segundo = Asc(Mid(x, 2, 1)) Mod 64 Else Print "mask error" End If Else Print "leading byte error" End If total = 64 primero + segundo Case 3 'three bytes and assume primero byte is leading byte If Asc(x) \ 16 = 14 Then primero = Asc(x) Mod 16 If Asc(Mid(x, 2, 1)) \ 64 = 2 Then segundo = Asc(Mid(x, 2, 1)) Mod 64 If Asc(Mid(x, 3, 1)) \ 64 = 2 Then tercero = Asc(Mid(x, 3, 1)) Mod 64 Else Print "mask error last byte" End If Else Print "mask error middle byte" End If Else Print "leading byte error" End If total = 4096 * primero + 64 * segundo + tercero Case 4 'four bytes and assume primero byte is leading byte If Asc(x) \ 8 = 30 Then primero = Asc(x) Mod 8 If Asc(Mid(x, 2, 1)) \ 64 = 2 Then segundo = Asc(Mid(x, 2, 1)) Mod 64 If Asc(Mid(x, 3, 1)) \ 64 = 2 Then tercero = Asc(Mid(x, 3, 1)) Mod 64 If Asc(Mid(x, 4, 1)) \ 64 = 2 Then cuarto = Asc(Mid(x, 4, 1)) Mod 64 Else Print "mask error last byte" End If Else Print "mask error tercero byte" End If Else Print "mask error second byte" End If Else Print "mask error leading byte" End If total = Clng(262144 * primero + 4096 * segundo + 64 * tercero + cuarto) Case Else Print "more bytes than expected" End Select Return total End Function Dim As Long cp(4) = {65, 246, 1046, 8364, 119070} '[{&H0041,&H00F6,&H0416,&H20AC,&H1D11E}] Dim As String r, s Dim As integer i, j Print "ch unicode UTF-8 encoded decoded" For i = Lbound(cp) To Ubound(cp) Dim As Long cpi = cp(i) r = unicode_2_utf8(cpi) s = Hex(cpi) Print Chr(cpi); Space(10 - Len(s)); s; s = "" For j = 1 To Len(r) s &= Hex(Asc(Mid(r, j, 1))) & " " Next j Print Space(16 - Len(s)); s; s = Hex(utf8_2_unicode(r)) Print Space(8 - Len(s)); s Next i Sleep</syntaxhighlight> =={{header\|F_Sharp\|F#}}== <~~lang~~syntaxhighlight lang="fsharp"> // Unicode character point to UTF8. Nigel Galloway: March 19th., 2018 let fN g = match List.findIndex (fun n->n>g) [0x80;0x800;0x10000;0x110000] with Line 476 ⟶ 2,683: \|2->[0xe0+(g&&&0xf000>>>12);0x80+(g&&&0xfc0>>>6);0x80+(g&&&0x3f)] \|_->[0xf0+(g&&&0x1c0000>>>18);0x80+(g&&&0x3f000>>>12);0x80+(g&&&0xfc0>>>6);0x80+(g&&&0x3f)] </syntaxhighlight> ~~</lang>~~ {{out}} <pre> Line 492 ⟶ 2,699: {{works with\|gforth\|0.7.9_20191121}} {{works with\|lxf\|1.6-982-823}} <~~lang~~syntaxhighlight lang="forth">: showbytes ( c-addr u -- ) over + swap ?do i c@ 3 .r loop ; Line 504 ⟶ 2,711: \ can also be written as \ 'A' test 'ö' test 'Ж' test '€' test '𝄞' test </syntaxhighlight> ~~</lang>~~ {{out}} <pre> Line 516 ⟶ 2,723: If you also want to see the implementation of <code>xc!+</code> and <code>xc@+</code>, here it is (<code>u8!+</code> is the UTF-8 implementation of <code>xc!+</code>, and likewise for <code>u8@+</code>): <~~lang~~syntaxhighlight lang="forth">-77 Constant UTF-8-err $80 Constant max-single-byte Line 537 ⟶ 2,744: REPEAT $7F xor 2* or r> BEGIN over $80 u>= WHILE tuck c! 1+ REPEAT nip ; </syntaxhighlight> ~~</lang>~~ =={{header\|Go}}== ===Implementation=== This implementation is missing all checks for invalid data and so is not production-ready, but illustrates the basic UTF-8 encoding scheme. <~~lang~~syntaxhighlight lang="go">package main import ( Line 643 ⟶ 2,850: rune(b[3]&mbMask) } }</~~lang~~syntaxhighlight> {{out}} <pre> Line 654 ⟶ 2,861: ===Library/language=== <~~lang~~syntaxhighlight lang="go">package main import ( Line 679 ⟶ 2,886: fmt.Printf("%-7c U+%04X\t%-12X\t%c\n", codepoint, codepoint, encoded, decoded) } }</~~lang~~syntaxhighlight> {{out}} <pre> Line 691 ⟶ 2,898: Alternately: <~~lang~~syntaxhighlight lang="go">package main import ( Line 712 ⟶ 2,919: fmt.Printf("%-7c U+%04X\t%-12X\t%c\n", codepoint, codepoint, encoded, decoded) } }</~~lang~~syntaxhighlight> {{out}} <pre> Line 725 ⟶ 2,932: =={{header\|Groovy}}== {{trans\|Java}} <~~lang~~syntaxhighlight lang="groovy">import java.nio.charset.StandardCharsets class UTF8EncodeDecode { Line 751 ⟶ 2,958: } } }</~~lang~~syntaxhighlight> {{out}} <pre>Char Name Unicode UTF-8 encoded Decoded Line 764 ⟶ 2,971: Example makes use of [http://hackage.haskell.org/package/bytestring <tt>bytestring</tt>] and [http://hackage.haskell.org/package/text <tt>text</tt>] packages: <~~lang~~syntaxhighlight lang="haskell">module Main (main) where import qualified Data.ByteString as ByteString (pack, unpack) Line 791 ⟶ 2,998: (printf "U+%04X" codepoint :: String) (intercalate " " (map (printf "%02X") values)) codepoint'</~~lang~~syntaxhighlight> {{out}} <pre> Line 805 ⟶ 3,012: =={{header\|J}}== '''Solution:''' <~~lang~~syntaxhighlight lang="j">utf8=: 8&u: NB. converts to UTF-8 from unicode or unicode codepoint integer ucp=: 9&u: NB. converts to unicode from UTF-8 or unicode codepoint integer ucp_hex=: hfd@(3 u: ucp) NB. converts to unicode codepoint hexadecimal from UTF-8, unicode or unicode codepoint integer</~~lang~~syntaxhighlight> '''Examples:''' <~~lang~~syntaxhighlight lang="j"> utf8 65 246 1046 8364 119070 AöЖ€𝄞 ucp 65 246 1046 8364 119070 Line 825 ⟶ 3,032: 1d11e utf8@dfh ucp_hex utf8 65 246 1046 8364 119070 AöЖ€𝄞</~~lang~~syntaxhighlight> =={{header\|Java}}== {{works with\|Java\|7+}} <~~lang~~syntaxhighlight lang="java">import java.nio.charset.StandardCharsets; import java.util.Formatter; Line 858 ⟶ 3,065: } } }</~~lang~~syntaxhighlight> {{out}} <pre> Line 871 ⟶ 3,078: =={{header\|JavaScript}}== An implementation in ECMAScript 2015 (ES6): <~~lang~~syntaxhighlight lang="javascript"> /**************************************************************************\ \| Pure UTF-8 handling without detailed error reporting functionality. \| Line 931 ⟶ 3,138: ?( m&0x07)<<18\|( n&0x3f)<<12\|( o&0x3f)<<6\|( p&0x3f)<<0 :(()=>{throw'Invalid UTF-8 encoding!'})() </syntaxhighlight> ~~</lang>~~ The testing inputs: <~~lang~~syntaxhighlight lang="javascript"> const str= Line 952 ⟶ 3,159: :[ [ a,b,c]] ,inputs=zip3(str,cps,cus); </syntaxhighlight> ~~</lang>~~ The testing code: <~~lang~~syntaxhighlight lang="javascript"> console.log(`\ ${'Character'.padEnd(16)}\ Line 970 ⟶ 3,177: ${`[${[...utf8encode(cp)].map(n=>n.toString(0x10))}]`.padEnd(16)}\ ${utf8decode(cu).toString(0x10).padStart(8,'U+000000')}`) </syntaxhighlight> ~~</lang>~~ and finally, the output from the test: <pre> Line 981 ⟶ 3,188: </pre>Note that the misalign there on the last line is caused by the string length of astral characters being 2 so the padding functions break. =={{header\|~~Julia~~jq}}== {{works with\|~~Julia\|0.6~~jq}} '''Works with gojq, the Go implementation of jq''' '''Preliminaries''' ~~Julia supports by default UTF-8 encoding.~~ <syntaxhighlight lang="jq"># input: a decimal integer # output: the corresponding binary array, most significant bit first def binary_digits: if . == 0 then 0 else [recurse( if . == 0 then empty else ./2 \| floor end ) % 2] \| reverse \| .[1:] # remove the leading 0 end ; # Input: an array of binary digits, msb first. ~~<lang julia>for t in ("A", "ö", "Ж", "€", "𝄞")~~ def binary_to_decimal: ~~enc = Vector{UInt8}(t)~~ reduce reverse[] as $b ({power:1, result:0}; ~~dec = String(enc)~~ .result += .power $b ~~println(dec, " → ", enc)~~ \| .power = 2) ~~end</lang>~~ \| .result;</syntaxhighlight> '''Encode to UTF-8''' <syntaxhighlight lang="jq"># input: an array of decimal integers representing the utf-8 bytes of a Unicode codepoint. # output: the corresponding decimal number of that codepoint. def utf8_encode: def lpad($width): [(range(0;8)\|0), .[]][- $width:]; def multibyte: [1,0, (.[-6: ]\|lpad(6))[]]; def firstOf2: [1,1,0, (.[: -6]\|lpad(5))[]]; def firstOf3: [1,1,1,0, (.[:-12]\|lpad(4))[]]; def firstOf4: [1,1,1,1,0, (.[:-18]\|lpad(3))[]]; . as $n \| binary_digits \| length as $len \| if $len <8 then [$n] else if $len <= 12 then [ firstOf2, multibyte ] elif $len <= 16 then [ firstOf3, (.[:-6] \| multibyte), multibyte ] else [firstOf4, (.[ :-12] \| multibyte), (.[-12: -6] \| multibyte), multibyte] end \| map(binary_to_decimal) end;</syntaxhighlight> '''Decode an array of UTF-8 bytes''' <syntaxhighlight lang="jq"># input: an array of decimal integers representing the utf-8 bytes of a Unicode codepoint. # output: the corresponding decimal number of that codepoint. def utf8_decode: # Magic numbers: # x80: 128, # 10000000 # xe0: 224, # 11100000 # xf0: 240 # 11110000 (-6) as $mb # non-first bytes start 10 and carry 6 bits of data # first byte of a 2-byte encoding starts 110 and carries 5 bits of data # first byte of a 3-byte encoding starts 1110 and carries 4 bits of data # first byte of a 4-byte encoding starts 11110 and carries 3 bits of data \| map(binary_digits) as $d \| .[0] \| if . < 128 then $d[0] elif . < 224 then $d[0][-5:] + $d[1][$mb:] elif . < 240 then $d[0][-4:] + $d[1][$mb:] + $d[2][$mb:] else $d[0][-3:] + $d[1][$mb:] + $d[2][$mb:] + $d[3][$mb:] end \| binary_to_decimal ;</syntaxhighlight> '''Task''' <syntaxhighlight lang="jq">def task: [ "A", "ö", "Ж", "€", "𝄞" ][] \| . as $glyph \| explode[] \| utf8_encode as $encoded \| ($encoded\|utf8_decode) as $decoded \| "Glyph \($glyph) => \($encoded) => \($decoded) => \([$decoded]\|implode)" ; task</syntaxhighlight> {{out}} <pre> Glyph A => [65] => 65 => A Glyph ö => [195,182] => 246 => ö Glyph Ж => [208,150] => 1046 => Ж Glyph € => [226,130,172] => 8364 => € Glyph 𝄞 => [240,157,132,158] => 119070 => 𝄞 </pre> =={{header\|Julia}}== Julia supports the UTF-8 encoding (and others through packages). <syntaxhighlight lang="julia"> for t in ("A", "ö", "Ж", "€", "𝄞") println(t, " → ", codeunits(t)) end </syntaxhighlight> {{out}} Line 1,000 ⟶ 3,286: =={{header\|Kotlin}}== <~~lang~~syntaxhighlight lang="scala">// version 1.1.2 fun utf8Encode(codePoint: Int) = String(intArrayOf(codePoint), 0, 1).toByteArray(Charsets.UTF_8) Line 1,019 ⟶ 3,305: System.out.printf("%-${n}s %c\n", s, decoded) } }</~~lang~~syntaxhighlight> {{out}} Line 1,032 ⟶ 3,318: =={{header\|langur}}== <syntaxhighlight lang="langur">writeln "character Unicode UTF-8 encoding (hex)" ~~{{works with\|langur\|0.8.4}}~~ ~~<lang langur>writeln "character Unicode UTF-8 encoding (hex)"~~ for .cp in "AöЖ€𝄞" { val .utf8 = s2b cp2s .cp val .cpstr = b2s .utf8 val .utf8rep = join " ", map ffn .b: $"\{{.b:X02;}}", .utf8 writeln $"\{{.cpstr:-11;}} U+\{{.cp:X04:-8;}} \{{.utf8rep;}}" } ~~}</lang>~~ </syntaxhighlight> {{out}} Line 1,058 ⟶ 3,344: - byteArray.toHexString (intStart, intLen): returns hex string representation of byte array (e.g. for printing)<br /> - byteArray.readRawString (intLen, [strCharSet="UTF-8"]): reads a fixed number of bytes as a string <~~lang~~syntaxhighlight ~~Lingo~~lang="lingo">chars = ["A", "ö", "Ж", "€", "𝄞"] put "Character Unicode (int) UTF-8 (hex) Decoded" repeat with c in chars ba = bytearray(c) put col(c, 12) & col(charToNum(c), 16) & col(ba.toHexString(1, ba.length), 14) & ba.readRawString(ba.length) end repeat</~~lang~~syntaxhighlight> Helper function for table formatting <~~lang~~syntaxhighlight ~~Lingo~~lang="lingo">on col (val, len) str = string(val) repeat with i = str.length+1 to len Line 1,071 ⟶ 3,357: end repeat return str end</~~lang~~syntaxhighlight> {{out}} <pre> Line 1,084 ⟶ 3,370: =={{header\|Lua}}== {{works with\|Lua\|5.3}} <syntaxhighlight lang="lua"> ~~<lang Lua>~~ -- Accept an integer representing a codepoint. -- Return the values of the individual octets. Line 1,119 ⟶ 3,405: end end </syntaxhighlight> ~~</lang>~~ {{out}} <pre> Line 1,140 ⟶ 3,426: =={{header\|M2000 Interpreter}}== <syntaxhighlight lang="m2000 interpreter"> ~~<lang M2000 Interpreter>~~ Module EncodeDecodeUTF8 { a$=string$("Hello" as UTF8enc) Line 1,160 ⟶ 3,446: } EncodeDecodeUTF8 </syntaxhighlight> ~~</lang>~~ {{out}} <pre> Line 1,174 ⟶ 3,460: </pre > =={{header\|Mathematica}}/{{header\|Wolfram Language}}== <~~lang~~syntaxhighlight ~~Mathematica~~lang="mathematica">utf = ToCharacterCode[ToString["AöЖ€", CharacterEncoding -> "UTF8"]] ToCharacterCode[FromCharacterCode[utf, "UTF8"]]</~~lang~~syntaxhighlight> {{out}} <pre>{65, 195, 182, 208, 150, 226, 130, 172} {65, 246, 1046, 8364} </pre> =={{header\|Nim}}== === Using the standard library === Nim strings are encoded in UTF-8. The natural way to deal with UTF-8 and Unicode code points consists to use the module “unicode” which provides procedures to convert from strings to sequences of code points (names “runes”) and conversely. For this purpose, using sequences or bytes is not natural. Here is a way to proceed using the module “unicode”. <syntaxhighlight lang="nim">import unicode, sequtils, strformat, strutils const UChars = ["\u0041", "\u00F6", "\u0416", "\u20AC", "\u{1D11E}"] proc toSeqByte(r: Rune): seq[byte] = let s = r.toUTF8 result = @(s.toOpenArrayByte(0, s.high)) proc toRune(s: seq[byte]): Rune = s.mapIt(chr(it)).join().toRunes[0] echo "Character Unicode UTF-8 encoding (hex)" for uchar in UChars: # Convert the UTF-8 string to a rune (codepoint). var r = uchar.toRunes[0] # Convert the rune to a sequence of bytes. let s = r.toSeqByte # Convert back the sequence of bytes to a rune. r = s.toRune # Display. echo &"""{uchar:>5} U+{r.int.toHex(5)} {s.map(toHex).join(" ")}"""</syntaxhighlight> {{out}} <pre>Character Unicode UTF-8 encoding (hex) A U+00041 41 ö U+000F6 C3 B6 Ж U+00416 D0 96 € U+020AC E2 82 AC 𝄞 U+1D11E F0 9D 84 9E</pre> === Implementation === In this section, we provide two procedures to convert a Unicode code point to a UTF-8 sequence of bytes and conversely, without using the module “unicode”. We provide also a procedure to convert a sequence of bytes to a string in order to print it. The algorithm is the one used by the Go solution. <syntaxhighlight lang="nim">import sequtils, strformat, strutils const # First byte of a 2-byte encoding starts 110 and carries 5 bits of data. B2Lead = 0xC0 # 1100 0000 B2Mask = 0x1F # 0001 1111 # First byte of a 3-byte encoding starts 1110 and carries 4 bits of data. B3Lead = 0xE0 # 1110 0000 B3Mask = 0x0F # 0000 1111 # First byte of a 4-byte encoding starts 11110 and carries 3 bits of data. B4Lead = 0xF0 # 1111 0000 B4Mask = 0x07 # 0000 0111 # Non-first bytes start 10 and carry 6 bits of data. MbLead = 0x80 # 1000 0000 MbMask = 0x3F # 0011 1111 type CodePoint = distinct int32 proc toUtf8(c: CodePoint): seq[byte] = let i = int32(c) result = if i <= 1 shl 7 - 1: @[byte(i)] elif i <= 1 shl 11 - 1: @[B2Lead or byte(i shr 6), MbLead or byte(i) and MbMask] elif i <= 1 shl 16 - 1: @[B3Lead or byte(i shr 12), MbLead or byte(i shr 6) and MbMask, MbLead or byte(i) and MbMask] else: @[B4Lead or byte(i shr 18), MbLead or byte(i shr 12) and MbMask, MbLead or byte(i shr 6) and MbMask, MbLead or byte(i) and MbMask] proc toCodePoint(b: seq[byte]): CodePoint = let b0 = b[0].int32 result = CodePoint( if b0 < 0x80: b0 elif b0 < 0xE0: (b0 and B2Mask) shl 6 or b[1].int32 and MbMask elif b0 < 0xF0: (b0 and B3Mask) shl 12 or (b[1].int32 and MbMask) shl 6 or b[2].int32 and MbMask else: (b0 and B4Mask) shl 18 or (b[1].int32 and MbMask) shl 12 or (b[2].int32 and MbMask) shl 6 or b[3].int32 and MbMask) proc toString(s: seq[byte]): string = s.mapIt(chr(it)).join() const UChars = [CodePoint(0x00041), CodePoint(0x000F6), CodePoint(0x00416), CodePoint(0x020AC), CodePoint(0x1D11E)] echo "Character Unicode UTF-8 encoding (hex)" for uchar in UChars: # Convert the code point to a sequence of bytes. let s = uchar.toUtf8 # Convert back the sequence of bytes to a code point. let c = s.toCodePoint # Display. echo &"""{s.toString:>5} U+{c.int.toHex(5)} {s.map(toHex).join(" ")}""" </syntaxhighlight> {{out}} Same output as in the previous solution. =={{header\|Perl}}== <~~lang~~syntaxhighlight lang="perl">#!/usr/bin/perl use strict; use warnings; Line 1,202 ⟶ 3,604: } split //, $utf8; print "\n"; } @chars;</~~lang~~syntaxhighlight> {{out}} Line 1,215 ⟶ 3,617: =={{header\|Phix}}== {{libheader\|Phix/basics}} Standard autoinclude, see the manual and/or builtins/utfconv.e ( http://phix.x10.mx/docs/html/utfconv.htm and/or https://~~bitbucket~~github.~~org~~com/petelomax/~~phix~~Phix/~~src~~blob/master/builtins/utfconv.e )<br> As requested in the task description: ~~<lang Phix>constant tests = {#0041, #00F6, #0416, #20AC, #1D11E}~~ <!--<syntaxhighlight lang="phix">--> ~~function hex(sequence s, string fmt) -- output helper~~ <span style="color: #008080;">constant</span> <span style="color: #000000;">tests</span> <span style="color: #0000FF;">=</span> <span style="color: #0000FF;">{</span><span style="color: #000000;">#0041</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">#00F6</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">#0416</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">#20AC</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">#1D11E</span><span style="color: #0000FF;">}</span> ~~for i=1 to length(s) do~~ ~~s[i] = sprintf(fmt,s[i])~~ <span style="color: #008080;">function</span> <span style="color: #000000;">hex</span><span style="color: #0000FF;">(</span><span style="color: #004080;">sequence</span> <span style="color: #000000;">s</span><span style="color: #0000FF;">,</span> <span style="color: #004080;">string</span> <span style="color: #000000;">fmt</span><span style="color: #0000FF;">)</span> <span style="color: #000080;font-style:italic;">-- output helper</span> ~~end for~~ <span style="color: #008080;">return</span> <span style="color: #7060A8;">join</span><span style="color: #0000FF;">(</span><span style="color: #7060A8;">apply</span><span style="color: #0000FF;">(</span><span style="color: #004600;">true</span><span style="color: #0000FF;">,</span><span style="color: #7060A8;">sprintf</span><span style="color: #0000FF;">,{{</span><span style="color: #000000;">fmt</span><span style="color: #0000FF;">},</span><span style="color: #000000;">s</span><span style="color: #0000FF;">}),</span><span style="color: #008000;">','</span><span style="color: #0000FF;">)</span> ~~return join(s,',')~~ <span style="color: #008080;">end</span> <span style="color: #008080;">function</span> ~~end function~~ <span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">tests</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">do</span> <span style="color: #004080;">integer</span> <span style="color: #000000;">codepoint</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">tests</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">]</span> <span style="color: #004080;">sequence</span> <span style="color: #000000;">s</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">utf32_to_utf8</span><span style="color: #0000FF;">({</span><span style="color: #000000;">codepoint</span><span style="color: #0000FF;">}),</span> <span style="color: #000000;">r</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">utf8_to_utf32</span><span style="color: #0000FF;">(</span><span style="color: #000000;">s</span><span style="color: #0000FF;">)</span> <span style="color: #7060A8;">printf</span><span style="color: #0000FF;">(</span><span style="color: #000000;">1</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"#%04x -> {%s} -> {%s}\n"</span><span style="color: #0000FF;">,{</span><span style="color: #000000;">codepoint</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">hex</span><span style="color: #0000FF;">(</span><span style="color: #000000;">s</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"#%02x"</span><span style="color: #0000FF;">),</span><span style="color: #000000;">hex</span><span style="color: #0000FF;">(</span><span style="color: #000000;">r</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"#%04x"</span><span style="color: #0000FF;">)})</span> <span style="color: #008080;">end</span> <span style="color: #008080;">for</span> <!--</syntaxhighlight>--> ~~for i=1 to length(tests) do~~ ~~integer codepoint = tests[i]~~ ~~sequence s = utf32_to_utf8({codepoint}),~~ ~~r = utf8_to_utf32(s)~~ ~~printf(1,"#%04x -> {%s} -> {%s}\n",{codepoint, hex(s,"#%02x"),hex(r,"#%04x")})~~ ~~end for</lang>~~ {{out}} <pre> Line 1,243 ⟶ 3,647: =={{header\|Processing}}== <~~lang~~syntaxhighlight lang="java">import java.nio.charset.StandardCharsets; Integer[] code_points = {0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E}; Line 1,270 ⟶ 3,674: tel_1 += 30; tel_2 = 50; } }</~~lang~~syntaxhighlight> Line 1,277 ⟶ 3,681: The encoding and decoding procedure are kept simple and designed to work with an array of 5 elements for input/output of the UTF-8 encoding for a single code point at a time. It was decided not to use a more elaborate example that would have been able to operate on a buffer to encode/decode more than one code point at a time. <~~lang~~syntaxhighlight lang="purebasic">#UTF8_codePointMaxByteCount = 4 ;UTF-8 encoding uses only a maximum of 4 bytes to encode a codepoint Procedure UTF8_encode(x, Array encoded_codepoint.a(1)) ;x is codepoint to encode, the array will contain output Line 1,388 ⟶ 3,792: Print(#CRLF$ + #CRLF$ + "Press ENTER to exit"): Input() CloseConsole() EndIf</~~lang~~syntaxhighlight> Sample output: <pre> Unicode UTF-8 Decoded Line 1,401 ⟶ 3,805: =={{header\|Python}}== <~~lang~~syntaxhighlight lang="python"> #!/usr/bin/env python3 from unicodedata import name Line 1,418 ⟶ 3,822: chars = ['A', 'ö', 'Ж', '€', '𝄞'] for char in chars: print('{:<11} {:<36} {:<15} {:<15}'.format(char, name(char), unicode_code(char), utf8hex(char)))</~~lang~~syntaxhighlight> {{out}} <pre>Character Name Unicode UTF-8 encoding (hex) Line 1,428 ⟶ 3,832: =={{header\|Racket}}== <~~lang~~syntaxhighlight lang="racket">#lang racket (define char-map Line 1,444 ⟶ 3,848: (map (curryr number->string 16) bites) (bytes->string/utf-8 (list->bytes bites)) name)))</~~lang~~syntaxhighlight> {{out}} <pre>#\A A (41) A LATIN-CAPITAL-LETTER-A Line 1,456 ⟶ 3,860: {{works with\|Rakudo\|2017.02}} Pretty much all built in to the language. <syntaxhighlight lang="raku" ~~perl6~~line>say sprintf("%-18s %-36s\|%8s\| %7s \|%14s \| %s\n", 'Character\|', 'Name', 'Ordinal', 'Unicode', 'UTF-8 encoded', 'decoded'), '-' x 100; for < A ö Ж € 𝄞 😜 👨‍👩‍👧‍👦> -> $char { printf " %-5s \| %-43s \| %6s \| %-7s \| %12s \|%4s\n", $char, $char.uninames.join(','), $char.ords.join(' '), ('U+' X~ $char.ords».base(16)).join(' '), $char.encode('UTF8').list».base(16).Str, $char.encode('UTF8').decode; }</~~lang~~syntaxhighlight> {{out}} <pre>Character\| Name \| Ordinal\| Unicode \| UTF-8 encoded \| decoded Line 1,476 ⟶ 3,880: =={{header\|Ruby}}== <~~lang~~syntaxhighlight lang="ruby"> character_arr = ["A","ö","Ж","€","𝄞"] for c in character_arr do Line 1,484 ⟶ 3,888: puts "" end </syntaxhighlight> ~~</lang>~~ {{out}} <pre> Line 1,507 ⟶ 3,911: Code-Units: F0 9D 84 9E </pre> =={{header\|Rust}}== <syntaxhighlight lang="rust">fn main() { let chars = vec!('A', 'ö', 'Ж', '€', '𝄞'); chars.iter().for_each(\|c\| { let mut encoded = vec![0; c.len_utf8()]; c.encode_utf8(&mut encoded); let decoded = String::from_utf8(encoded.to_vec()).unwrap(); let encoded_string = encoded.iter().fold(String::new(), \|acc, val\| format!("{}{:X}", acc, val)); println!("Character: {}, Unicode:{}, UTF-8 encoded:{}, Decoded: {}", c, c.escape_unicode(), encoded_string , decoded); }); } </syntaxhighlight> {{out}} <pre> Character: A, Unicode:\u{41}, UTF-8 encoded:41, Decoded: A Character: ö, Unicode:\u{f6}, UTF-8 encoded:C3B6, Decoded: ö Character: Ж, Unicode:\u{416}, UTF-8 encoded:D096, Decoded: Ж Character: €, Unicode:\u{20ac}, UTF-8 encoded:E282AC, Decoded: € Character: 𝄞, Unicode:\u{1d11e}, UTF-8 encoded:F09D849E, Decoded: 𝄞 </pre> =={{header\|Scala}}== === Imperative solution=== <~~lang~~syntaxhighlight lang="scala">object UTF8EncodeAndDecode extends App { val codePoints = Seq(0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E) Line 1,533 ⟶ 3,958: printf(s"%-${w}c %-36s %-7s %-${16 - w}s%c%n", codePoint, Character.getName(codePoint), leftAlignedHex, s, utf8Decode(bytes)) }</~~lang~~syntaxhighlight> === Functional solution=== <~~lang~~syntaxhighlight lang="scala">import java.nio.charset.StandardCharsets object UTF8EncodeAndDecode extends App { Line 1,560 ⟶ 3,985: println(s"\nSuccessfully completed without errors. [total ${scala.compat.Platform.currentTime - executionStart} ms]") }</~~lang~~syntaxhighlight> === Composable and testable solution=== <~~lang~~syntaxhighlight lang="scala">package example object UTF8EncodeAndDecode extends TheMeat with App { Line 1,595 ⟶ 4,020: } </syntaxhighlight> ~~</lang>~~ =={{header\|Seed7}}== <~~lang~~syntaxhighlight lang="seed7">$ include "seed7_05.s7i"; include "unicode.s7i"; include "console.s7i"; Line 1,612 ⟶ 4,037: writeln("-------------------------------------------------"); for ch range "AöЖ€𝄞" do utf8 := ~~striToUtf8~~toUtf8(str(ch)); writeln(ch rpad 11 <& "U+" <& ord(ch) radix 16 lpad0 4 rpad 7 <& hex(utf8) rpad 22 <& ~~utf8ToStri~~fromUtf8(utf8)); end for; end func;</~~lang~~syntaxhighlight> {{out}} Line 1,630 ⟶ 4,055: =={{header\|Sidef}}== <~~lang~~syntaxhighlight lang="ruby">func utf8_encoder(Number code) { code.chr.encode('UTF-8').bytes.map{.chr} } Line 1,643 ⟶ 4,068: assert_eq(n, decoded.ord) say "#{decoded} -> #{encoded}" }</~~lang~~syntaxhighlight> {{out}} <pre> Line 1,655 ⟶ 4,080: =={{header\|Swift}}== In Swift there's a difference between UnicodeScalar, which is a single unicode code point, and Character which may consist out of multiple UnicodeScalars, usually because of combining characters. <~~lang~~syntaxhighlight ~~Swift~~lang="swift">import Foundation func encode(_ scalar: UnicodeScalar) -> Data { Line 1,676 ⟶ 4,101: print("character: \(decoded), code point: U+\(String(scalar.value, radix: 16)), \tutf-8: \(formattedBytes)") } </syntaxhighlight> ~~</lang>~~ {{out}} <pre> Line 1,688 ⟶ 4,113: =={{header\|Tcl}}== Note: Tcl can handle Unicodes only up to U+FFFD, i.e. the Basic Multilingual Plane (BMP, 16 bits wide). Therefore, the fifth test fails as expected. <~~lang~~syntaxhighlight ~~Tcl~~lang="tcl">proc encoder int { set u [format %c $int] set bytes {} Line 1,707 ⟶ 4,132: lappend res [encoder $test] -> [decoder [encoder $test]] puts $res }</~~lang~~syntaxhighlight> <pre> 0x0041 41 -> A Line 1,719 ⟶ 4,144: While perhaps not as readable as the above, this version handles beyond-BMP codepoints by manually composing the utf-8 byte sequences and emitting raw bytes to the console. <tt>encoding convertto utf-8</tt> command still does the heavy lifting where it can. <~~lang~~syntaxhighlight ~~Tcl~~lang="tcl">proc utf8 {codepoint} { scan $codepoint %llx cp if {$cp < 0x10000} { Line 1,746 ⟶ 4,171: set utf8 [utf8 $codepoint] puts "[format U+%04s $codepoint]\t$utf8\t[hexchars $utf8]" }</~~lang~~syntaxhighlight> {{out}}<pre>U+0041 A 41 Line 1,754 ⟶ 4,179: U+1D11E 𝄞 f0 9d 84 9e </pre> =={{header\|UNIX Shell}}== {{works with\|Bourne Again SHell}} {{works with\|Korn Shell}} {{works with\|Zsh}} Works with locale set to UTF-8. <syntaxhighlight lang="bash">function encode { typeset -i code_point=$1 printf "$(printf '\\U%08X\\n' "$code_point")" } function decode { typeset character=$1 printf 'U+%04X\n' "'$character" set +x } printf 'Char\tCode Point\tUTF-8 Bytes\n' for test in A ö Ж € 𝄞; do code_point=$(decode "$test") utf8=$(encode "$(( 16#${code_point#U+} ))") bytes=$(printf '%b' "$utf8" \| od -An -tx1 \| sed -nE '/./s/^ \| $//p') printf '%-4b\t%-10s\t%s\n' "$utf8" "$code_point" "$bytes" done</syntaxhighlight> {{Out}} <pre style="font-family: Consolas,Courier,monospace">Char Code Point UTF-8 Bytes A U+0041 41 ö U+00F6 c3 b6 Ж U+0416 d0 96 € U+20AC e2 82 ac 𝄞 U+1D11E f0 9d 84 9e</pre> =={{header\|VBA}}== <~~lang~~syntaxhighlight VBlang="vb">Private Function unicode_2_utf8(x As Long) As Byte() Dim y() As Byte Dim r As Long Line 1,877 ⟶ 4,334: Debug.Print String$(8 - Len(s), " "); s Next cpi End Sub</~~lang~~syntaxhighlight>{{out}}<pre>ch unicode UTF-8 encoded decoded A 41 41 41 ö F6 C3 B6 F6 Line 1,883 ⟶ 4,340: € 20AC E2 82 AC 20AC ? 1D11E F0 9D 84 9E 1D11E </pre> =={{header\|V (Vlang)}}== <syntaxhighlight lang="v (vlang)">import encoding.hex fn decode(s string) ?[]u8 { return hex.decode(s) } fn main() { println("${'Char':-7} ${'Unicode':7}\tUTF-8 encoded\tDecoded") for codepoint in [`A`, `ö`, `Ж`, `€`, `𝄞`] { encoded := codepoint.bytes().hex() decoded := decode(encoded)? println("${codepoint:-7} U+${codepoint:04X}\t${encoded:-12}\t${decoded.bytestr()}") } }</syntaxhighlight> {{out}} <pre>Char Unicode UTF-8 encoded Decoded A U+0041 41 A ö U+00F6 c3b6 ö Ж U+0416 d096 Ж € U+20AC e282ac € 𝄞 U+1D11E f09d849e 𝄞 </pre> =={{header\|VBScript}}== <syntaxhighlight lang="vb"> Option Explicit Dim m_1,m_2,m_3,m_4 Dim d_2,d_3,d_4 Dim h_0,h_2,h_3,h_4 Dim mc_0,mc_2,mc_3,mc_4 m_1=&h3F d_2=m_1+1 m_2=m_1 d_2 d_3= (m_2 Or m_1)+1 m_3= m_2* d_2 d_4=(m_3 Or m_2 Or m_1)+1 h_0=&h80 h_2=&hC0 h_3=&hE0 h_4=&hF0 mc_0=&h3f mc_2=&h1F mc_3=&hF mc_4=&h7 Function cp2utf8(cp) 'cp as long, returns string If cp<&h80 Then cp2utf8=Chr(cp) ElseIf (cp <=&H7FF) Then cp2utf8=Chr(h_2 or (cp \ d_2) )&Chr(h_0 Or (cp And m_1)) ElseIf (cp <=&Hffff&) Then cp2utf8= Chr(h_3 Or (cp\ d_3)) & Chr(h_0 Or (cp And m_2)\d_2) & Chr(h_0 Or (cp And m_1)) Else cp2utf8= Chr(h_4 Or (cp\d_4))& Chr(h_0 Or ((cp And m_3) \d_3))& Chr(h_0 Or ((cp And m_2)\d_2)) & Chr(h_0 Or (cp And m_1)) End if End Function Function utf82cp(utf) 'utf as string, returns long Dim a,b,m m=strreverse(utf) b= Len(utf) a=asc(mid(m,1,1)) utf82cp=a And &h7f if b=1 Then Exit Function a=asc(mid(m,2,1)) If b=2 Then utf82cp= utf82cp Or (a And mc_2)d_2 :Exit function utf82cp= utf82cp Or (a And m_1)d_2 a=asc(mid(m,3,1)) If b=3 Then utf82cp= utf82cp Or (a And mc_3)d_3 :Exit function utf82cp= utf82cp Or (a And m_1)d_3 Or (a=asc(mid(m,4,1)) And mc_4)*d_4 End Function Sub print(s): On Error Resume Next WScript.stdout.Write (s) If err= &h80070006& Then WScript.Echo " Please run this script with CScript": WScript.quit End Sub Function utf8displ(utf) Dim s,i s="" For i=1 To Len(utf) s=s &" "& Hex(Asc(Mid(utf,i,1))) Next utf8displ= pad(s,12) End function function pad(s,n) if n<0 then pad= right(space(-n) & s ,-n) else pad= left(s& space(n),n) end if :end function Sub check(i) Dim c,c0,c1,c2,u c=b(i):c0=pad(c(0),29) :c1=c(1) :c2=pad(c(2),12):u=cp2utf8(c1) print c0 & " CP:" & pad("U+" & Hex(c1),-8) & " my utf8:" & utf8displ (u) & " should be:" & c2 & " back to CP:" & pad("U+" & Hex(utf82cp(u)),-8)& vbCrLf End Sub Dim b b=Array(_ Array("LATIN CAPITAL LETTER A ",&h41," 41"),_ Array("LATIN SMALL LETTER O WITH DIAERESIS ",&hF6," C3 B6"),_ Array("CYRILLIC CAPITAL LETTER ZHE ",&h416," D0 96"),_ Array("EURO SIGN",&h20AC," E2 82 AC "),_ Array("MUSICAL SYMBOL G CLEF ",&h1D11E," F0 9D 84 9E")) check 0 check 1 check 2 check 3 check 4 </syntaxhighlight> {{out}} <small> <pre> LATIN CAPITAL LETTER A CP: U+41 my utf8: 41 should be: 41 back to CP: U+41 LATIN SMALL LETTER O WITH DIA CP: U+F6 my utf8: C3 B6 should be: C3 B6 back to CP: U+F6 CYRILLIC CAPITAL LETTER ZHE CP: U+416 my utf8: D0 96 should be: D0 96 back to CP: U+416 EURO SIGN CP: U+20AC my utf8: E2 82 AC should be: E2 82 AC back to CP: U+20AC MUSICAL SYMBOL G CLEF CP: U+1D11E my utf8: F0 9D 84 9E should be: F0 9D 84 9E back to CP: U+1D11E </pre> </small> =={{header\|Wren}}== The utf8_decode function was translated from the Go entry. <syntaxhighlight lang="wren">import "./fmt" for Fmt var utf8_encode = Fn.new { \|cp\| String.fromCodePoint(cp).bytes.toList } var utf8_decode = Fn.new { \|b\| var mbMask = 0x3f // non-first bytes start 10 and carry 6 bits of data var b0 = b[0] if (b0 < 0x80) { return b0 } else if (b0 < 0xe0) { var b2Mask = 0x1f // first byte of a 2-byte encoding starts 110 and carries 5 bits of data return (b0 & b2Mask) << 6 \| (b[1] & mbMask) } else if (b0 < 0xf0) { var b3Mask = 0x0f // first byte of a 3-byte encoding starts 1110 and carries 4 bits of data return (b0 & b3Mask) << 12 \| (b[1] & mbMask) << 6 \| (b[2] & mbMask) } else { var b4Mask = 0x07 // first byte of a 4-byte encoding starts 11110 and carries 3 bits of data return (b0 & b4Mask) << 18 \| (b[1] & mbMask) << 12 \| (b[2] & mbMask) << 6 \| (b[3] & mbMask) } } var tests = [ ["LATIN CAPITAL LETTER A", 0x41], ["LATIN SMALL LETTER O WITH DIAERESIS", 0xf6], ["CYRILLIC CAPITAL LETTER ZHE", 0x416], ["EURO SIGN", 0x20ac], ["MUSICAL SYMBOL G CLEF", 0x1d11e] ] System.print("Character Name Unicode UTF-8 encoding (hex)") System.print("---------------------------------------------------------------------------------") for (test in tests) { var cp = test[1] var bytes = utf8_encode.call(cp) var utf8 = bytes.map { \|b\| Fmt.Xz(2, b) }.join(" ") var cp2 = utf8_decode.call(bytes) var uni = String.fromCodePoint(cp2) System.print("%(Fmt.s(-11, uni)) %(Fmt.s(-37, test[0])) U+%(Fmt.s(-8, Fmt.Xz(4, cp2))) %(utf8)") }</syntaxhighlight> {{out}} <pre> Character Name Unicode UTF-8 encoding (hex) --------------------------------------------------------------------------------- A LATIN CAPITAL LETTER A U+0041 41 ö LATIN SMALL LETTER O WITH DIAERESIS U+00F6 C3 B6 Ж CYRILLIC CAPITAL LETTER ZHE U+0416 D0 96 € EURO SIGN U+20AC E2 82 AC 𝄞 MUSICAL SYMBOL G CLEF U+1D11E F0 9D 84 9E </pre> =={{header\|zkl}}== <~~lang~~syntaxhighlight lang="zkl">println("Char Unicode UTF-8"); foreach utf,unicode_int in (T( T("\U41;",0x41), T("\Uf6;",0xf6), T("\U416;",0x416), T("\U20AC;",0x20ac), T("\U1D11E;",0x1d11e))){ Line 1,895 ⟶ 4,529: println("%s %s %9s %x".fmt(char,char2,"U+%x".fmt(unicode_int),utf_int)); }</~~lang~~syntaxhighlight> Int.len() --> number of bytes in int. This could be hard coded because UTF-8 has a max of 6 bytes and (0x41).toBigEndian(6) --> 0x41,0,0,0,0,0 which is