UTF-8 encode and decode: Difference between revisions

← Older edit

UTF-8 encode and decode (view source)

Revision as of 15:30, 5 May 2024

10,351 bytes added , 13 days ago

→‎{{header|langur}}

Langurmonkey

885

edits

Revision as of 22:17, 2 May 2022 (view source) Chemoelectric (talk \| contribs) (→‎{{header\|ATS}}) ← Older edit		Latest revision as of 15:30, 5 May 2024 (view source) Langurmonkey (talk \| contribs) (→‎{{header\|langur}})
(17 intermediate revisions by 12 users not shown)
Line 3: As described in [[UTF-8]] and in [[wp:UTF-8\|Wikipedia]], UTF-8 is a popular encoding of (multi-byte) [[Unicode]] code-points into eight-bit octets. The goal of this task is to write a encoder that takes a unicode code-point (an integer representing a unicode character) and returns a sequence of ~~1-4~~1–4 bytes representing that character in the UTF-8 encoding. Then you have to write the corresponding decoder that takes a sequence of ~~1-4~~1–4 UTF-8 encoded bytes and return the corresponding unicode character. Demonstrate the functionality of your encoder and decoder on the following five characters: Line 24: {{trans\|Python}} <~~lang~~syntaxhighlight lang="11l">F unicode_code(ch) R ‘U+’hex(ch.code).zfill(4) Line 33: V chars = [‘A’, ‘ö’, ‘Ж’, ‘€’] L(char) chars print(‘#<11 #<15 #<15’.format(char, unicode_code(char), utf8hex(char)))</~~lang~~syntaxhighlight> {{out}} Line 45: =={{header\|8th}}== <~~lang~~syntaxhighlight lang="8th"> hex \ so bytes print nicely Line 81: bye </syntaxhighlight> ~~</lang>~~ Output:<pre> A 41 Line 97: =={{header\|Action!}}== <~~lang~~syntaxhighlight ~~Action~~lang="action!">TYPE Unicode=[BYTE bc1,bc2,bc3] BYTE ARRAY hex=['0 '1 '2 '3 '4 '5 '6 '7 '8 '9 'A 'B 'C 'D 'E 'F] Line 246: StrUnicode(res) PutE() OD RETURN</~~lang~~syntaxhighlight> {{out}} [https://gitlab.com/amarok8bit/action-rosetta-code/-/raw/master/images/UTF-8_encode_and_decode.png Screenshot from Atari 8-bit computer] Line 269: {{works with\|Ada\|Ada\|2012}} <~~lang~~syntaxhighlight ~~Ada~~lang="ada">with Ada.Strings.Fixed; use Ada.Strings.Fixed; with Ada.Strings.UTF_Encoding.Wide_Wide_Strings; with Ada.Integer_Text_IO; Line 317: end; end loop; end UTF8_Encode_And_Decode;</~~lang~~syntaxhighlight> {{out}} <pre>Character Unicode UTF-8 encoding (hex) Line 330: =={{header\|ATS}}== The following code is ~~quite~~ long but consists largely of proofs. UTF-8 is a complicated encoding, but also a well defined ~~encoding,~~one ~~which~~that lends itself to compile-time verification methods. Despite this complexity, what actually gets generated is highly optimizable C code. Note that the following demonstration requires no ATS-specific library support whatsoever. <lang ATS>(* <syntaxhighlight lang="ats">(* UTF-8 encoding and decoding in ATS2. Line 2,109 ⟶ 2,111: println! ("SUCCESS") end</~~lang~~syntaxhighlight> {{out}} Line 2,116 ⟶ 2,118: =={{header\|AutoHotkey}}== <~~lang~~syntaxhighlight ~~AutoHotkey~~lang="autohotkey">Encode_UTF(hex){ Bytes := hex>=0x10000 ? 4 : hex>=0x0800 ? 3 : hex>=0x0080 ? 2 : hex>=0x0001 ? 1 : 0 Prefix := [0, 0xC0, 0xE0, 0xF0] Line 2,150 ⟶ 2,152: DllCall("msvcrt.dll\" v, "Int64", value, "Str", s, "UInt", OutputBase, "CDECL") return s }</~~lang~~syntaxhighlight> Examples:<~~lang~~syntaxhighlight ~~AutoHotkey~~lang="autohotkey">data = (comment 0x0041 Line 2,166 ⟶ 2,168: } MsgBox % output return</~~lang~~syntaxhighlight> {{out}} <pre> Line 2,179 ⟶ 2,181: =={{header\|BaCon}}== BaCon supports UTF8 natively. <~~lang~~syntaxhighlight lang="bacon">DECLARE x TYPE STRING CONST letter$ = "A ö Ж € 𝄞" Line 2,188 ⟶ 2,190: FOR x IN letter$ PRINT x, TAB$(1), "U+", HEX$(UCS(x)), TAB$(2), COIL$(LEN(x), HEX$(x[_-1] & 255)) NEXT</~~lang~~syntaxhighlight> {{out}} <pre>Char Unicode UTF-8 (hex) Line 2,199 ⟶ 2,201: =={{header\|C}}== <syntaxhighlight lang="c"> ~~<lang C>~~ #include <stdio.h> #include <stdlib.h> Line 2,310 ⟶ 2,312: return 0; } </syntaxhighlight> ~~</lang>~~ Output <syntaxhighlight lang="text"> Character Unicode UTF-8 encoding (hex) ---------------------------------------- Line 2,321 ⟶ 2,323: 𝄞 U+1d11e f0 9d 84 9e </syntaxhighlight> ~~</lang>~~ =={{header\|C sharp\|C#}}== <~~lang~~syntaxhighlight lang="csharp">using System; using System.Text; Line 2,352 ⟶ 2,354: 20AC € E2-82-AC 1D11E 𝄞 F0-9D-84-9E / </syntaxhighlight> ~~</lang>~~ Line 2,360 ⟶ 2,362: Helper functions <~~lang~~syntaxhighlight lang="lisp"> (defun ascii-byte-p (octet) "Return t if octet is a single-byte 7-bit ASCII char. Line 2,401 ⟶ 2,403: when (= (nth i templates) (logand (nth i bitmasks) octet)) return i))) </syntaxhighlight> ~~</lang>~~ Encoder <~~lang~~syntaxhighlight lang="lisp"> (defun unicode-to-utf-8 (int) "Take a unicode code point, return a list of one to four UTF-8 encoded bytes (octets)." Line 2,436 ⟶ 2,438: ;; return the list of UTF-8 encoded bytes. byte-list)) </syntaxhighlight> ~~</lang>~~ Decoder <~~lang~~syntaxhighlight lang="lisp"> (defun utf-8-to-unicode (byte-list) "Take a list of one to four utf-8 encoded bytes (octets), return a code point." Line 2,460 ⟶ 2,462: (error "calculated number of bytes doesnt match the length of the byte list"))) (error "first byte in the list isnt a lead byte")))))) </syntaxhighlight> ~~</lang>~~ The test <~~lang~~syntaxhighlight lang="lisp"> (defun test-utf-8 () "Return t if the chosen unicode points are encoded and decoded correctly." Line 2,479 ⟶ 2,481: ;; return t if all are t (every #'= unicodes-orig unicodes-test))) </syntaxhighlight> ~~</lang>~~ Test output <~~lang~~syntaxhighlight lang="lisp"> CL-USER> (test-utf-8) character A, code point: 41, utf-8: 41 Line 2,491 ⟶ 2,493: character 𝄞, code point: 1D11E, utf-8: F0 9D 84 9E T </syntaxhighlight> ~~</lang>~~ =={{header\|D}}== <~~lang~~syntaxhighlight Dlang="d">import std.conv; import std.stdio; Line 2,506 ⟶ 2,508: writefln("%s %7X [%(%X, %)]", c, unicode, bytes); } }</~~lang~~syntaxhighlight> {{out}} Line 2,517 ⟶ 2,519: =={{header\|Elena}}== ELENA 46.x : <~~lang~~syntaxhighlight lang="elena">import system'routines; import extensions; Line 2,530 ⟶ 2,532: string printAsUTF8Array() { self.toByteArray().forEach::(b){ console.print(b.toString(16)," ") } } string printAsUTF32() { self.toArray().forEach::(c){ console.print("U+",c.toInt().toString(16)," ") } } } Line 2,555 ⟶ 2,557: "𝄞".printAsString().printAsUTF8Array().printAsUTF32(); console.printLine(); }</~~lang~~syntaxhighlight> {{out}} <pre> Line 2,563 ⟶ 2,565: € E2 82 AC U+20AC 𝄞 F0 9D 84 9E U+1D11E</pre> =={{header\|FreeBASIC}}== {{trans\|VBScript}} <syntaxhighlight lang="vbnet">Function unicode_2_utf8(x As Long) As String Dim As String y Dim As Long r Select Case x Case 0 To &H7F y = Chr(x) Case &H80 To &H7FF y = Chr(192 + x \ 64) + Chr(128 + x Mod 64) Case &H800 To &H7FFF, 32768 To 65535 r = x \ 64 y = Chr(224 + r \ 64) + Chr(128 + r Mod 64) + Chr(128 + x Mod 64) Case &H10000 To &H10FFFF r = x \ 4096 y = Chr(240 + r \ 64) + Chr(128 + r Mod 64) + Chr(128 + (x \ 64) Mod 64) + Chr(128 + x Mod 64) Case Else Print "what else? " & x & " " & Hex(x) End Select Return y End Function Function utf8_2_unicode(x As String) As Long Dim As Long primero, segundo, tercero, cuarto Dim As Long total Select Case Len(x) Case 1 'one byte If Asc(x) < 128 Then total = Asc(x) Else Print "highest bit set error" End If Case 2 'two bytes and assume primero byte is leading byte If Asc(x) \ 32 = 6 Then primero = Asc(x) Mod 32 If Asc(Mid(x, 2, 1)) \ 64 = 2 Then segundo = Asc(Mid(x, 2, 1)) Mod 64 Else Print "mask error" End If Else Print "leading byte error" End If total = 64 primero + segundo Case 3 'three bytes and assume primero byte is leading byte If Asc(x) \ 16 = 14 Then primero = Asc(x) Mod 16 If Asc(Mid(x, 2, 1)) \ 64 = 2 Then segundo = Asc(Mid(x, 2, 1)) Mod 64 If Asc(Mid(x, 3, 1)) \ 64 = 2 Then tercero = Asc(Mid(x, 3, 1)) Mod 64 Else Print "mask error last byte" End If Else Print "mask error middle byte" End If Else Print "leading byte error" End If total = 4096 * primero + 64 * segundo + tercero Case 4 'four bytes and assume primero byte is leading byte If Asc(x) \ 8 = 30 Then primero = Asc(x) Mod 8 If Asc(Mid(x, 2, 1)) \ 64 = 2 Then segundo = Asc(Mid(x, 2, 1)) Mod 64 If Asc(Mid(x, 3, 1)) \ 64 = 2 Then tercero = Asc(Mid(x, 3, 1)) Mod 64 If Asc(Mid(x, 4, 1)) \ 64 = 2 Then cuarto = Asc(Mid(x, 4, 1)) Mod 64 Else Print "mask error last byte" End If Else Print "mask error tercero byte" End If Else Print "mask error second byte" End If Else Print "mask error leading byte" End If total = Clng(262144 * primero + 4096 * segundo + 64 * tercero + cuarto) Case Else Print "more bytes than expected" End Select Return total End Function Dim As Long cp(4) = {65, 246, 1046, 8364, 119070} '[{&H0041,&H00F6,&H0416,&H20AC,&H1D11E}] Dim As String r, s Dim As integer i, j Print "ch unicode UTF-8 encoded decoded" For i = Lbound(cp) To Ubound(cp) Dim As Long cpi = cp(i) r = unicode_2_utf8(cpi) s = Hex(cpi) Print Chr(cpi); Space(10 - Len(s)); s; s = "" For j = 1 To Len(r) s &= Hex(Asc(Mid(r, j, 1))) & " " Next j Print Space(16 - Len(s)); s; s = Hex(utf8_2_unicode(r)) Print Space(8 - Len(s)); s Next i Sleep</syntaxhighlight> =={{header\|F_Sharp\|F#}}== <~~lang~~syntaxhighlight lang="fsharp"> // Unicode character point to UTF8. Nigel Galloway: March 19th., 2018 let fN g = match List.findIndex (fun n->n>g) [0x80;0x800;0x10000;0x110000] with Line 2,572 ⟶ 2,683: \|2->[0xe0+(g&&&0xf000>>>12);0x80+(g&&&0xfc0>>>6);0x80+(g&&&0x3f)] \|_->[0xf0+(g&&&0x1c0000>>>18);0x80+(g&&&0x3f000>>>12);0x80+(g&&&0xfc0>>>6);0x80+(g&&&0x3f)] </syntaxhighlight> ~~</lang>~~ {{out}} <pre> Line 2,588 ⟶ 2,699: {{works with\|gforth\|0.7.9_20191121}} {{works with\|lxf\|1.6-982-823}} <~~lang~~syntaxhighlight lang="forth">: showbytes ( c-addr u -- ) over + swap ?do i c@ 3 .r loop ; Line 2,600 ⟶ 2,711: \ can also be written as \ 'A' test 'ö' test 'Ж' test '€' test '𝄞' test </syntaxhighlight> ~~</lang>~~ {{out}} <pre> Line 2,612 ⟶ 2,723: If you also want to see the implementation of <code>xc!+</code> and <code>xc@+</code>, here it is (<code>u8!+</code> is the UTF-8 implementation of <code>xc!+</code>, and likewise for <code>u8@+</code>): <~~lang~~syntaxhighlight lang="forth">-77 Constant UTF-8-err $80 Constant max-single-byte Line 2,633 ⟶ 2,744: REPEAT $7F xor 2* or r> BEGIN over $80 u>= WHILE tuck c! 1+ REPEAT nip ; </syntaxhighlight> ~~</lang>~~ =={{header\|Go}}== ===Implementation=== This implementation is missing all checks for invalid data and so is not production-ready, but illustrates the basic UTF-8 encoding scheme. <~~lang~~syntaxhighlight lang="go">package main import ( Line 2,739 ⟶ 2,850: rune(b[3]&mbMask) } }</~~lang~~syntaxhighlight> {{out}} <pre> Line 2,750 ⟶ 2,861: ===Library/language=== <~~lang~~syntaxhighlight lang="go">package main import ( Line 2,775 ⟶ 2,886: fmt.Printf("%-7c U+%04X\t%-12X\t%c\n", codepoint, codepoint, encoded, decoded) } }</~~lang~~syntaxhighlight> {{out}} <pre> Line 2,787 ⟶ 2,898: Alternately: <~~lang~~syntaxhighlight lang="go">package main import ( Line 2,808 ⟶ 2,919: fmt.Printf("%-7c U+%04X\t%-12X\t%c\n", codepoint, codepoint, encoded, decoded) } }</~~lang~~syntaxhighlight> {{out}} <pre> Line 2,821 ⟶ 2,932: =={{header\|Groovy}}== {{trans\|Java}} <~~lang~~syntaxhighlight lang="groovy">import java.nio.charset.StandardCharsets class UTF8EncodeDecode { Line 2,847 ⟶ 2,958: } } }</~~lang~~syntaxhighlight> {{out}} <pre>Char Name Unicode UTF-8 encoded Decoded Line 2,860 ⟶ 2,971: Example makes use of [http://hackage.haskell.org/package/bytestring <tt>bytestring</tt>] and [http://hackage.haskell.org/package/text <tt>text</tt>] packages: <~~lang~~syntaxhighlight lang="haskell">module Main (main) where import qualified Data.ByteString as ByteString (pack, unpack) Line 2,887 ⟶ 2,998: (printf "U+%04X" codepoint :: String) (intercalate " " (map (printf "%02X") values)) codepoint'</~~lang~~syntaxhighlight> {{out}} <pre> Line 2,901 ⟶ 3,012: =={{header\|J}}== '''Solution:''' <~~lang~~syntaxhighlight lang="j">utf8=: 8&u: NB. converts to UTF-8 from unicode or unicode codepoint integer ucp=: 9&u: NB. converts to unicode from UTF-8 or unicode codepoint integer ucp_hex=: hfd@(3 u: ucp) NB. converts to unicode codepoint hexadecimal from UTF-8, unicode or unicode codepoint integer</~~lang~~syntaxhighlight> '''Examples:''' <~~lang~~syntaxhighlight lang="j"> utf8 65 246 1046 8364 119070 AöЖ€𝄞 ucp 65 246 1046 8364 119070 Line 2,921 ⟶ 3,032: 1d11e utf8@dfh ucp_hex utf8 65 246 1046 8364 119070 AöЖ€𝄞</~~lang~~syntaxhighlight> =={{header\|Java}}== {{works with\|Java\|7+}} <~~lang~~syntaxhighlight lang="java">import java.nio.charset.StandardCharsets; import java.util.Formatter; Line 2,954 ⟶ 3,065: } } }</~~lang~~syntaxhighlight> {{out}} <pre> Line 2,967 ⟶ 3,078: =={{header\|JavaScript}}== An implementation in ECMAScript 2015 (ES6): <~~lang~~syntaxhighlight lang="javascript"> /**************************************************************************\ \| Pure UTF-8 handling without detailed error reporting functionality. \| Line 3,027 ⟶ 3,138: ?( m&0x07)<<18\|( n&0x3f)<<12\|( o&0x3f)<<6\|( p&0x3f)<<0 :(()=>{throw'Invalid UTF-8 encoding!'})() </syntaxhighlight> ~~</lang>~~ The testing inputs: <~~lang~~syntaxhighlight lang="javascript"> const str= Line 3,048 ⟶ 3,159: :[ [ a,b,c]] ,inputs=zip3(str,cps,cus); </syntaxhighlight> ~~</lang>~~ The testing code: <~~lang~~syntaxhighlight lang="javascript"> console.log(`\ ${'Character'.padEnd(16)}\ Line 3,066 ⟶ 3,177: ${`[${[...utf8encode(cp)].map(n=>n.toString(0x10))}]`.padEnd(16)}\ ${utf8decode(cu).toString(0x10).padStart(8,'U+000000')}`) </syntaxhighlight> ~~</lang>~~ and finally, the output from the test: <pre> Line 3,082 ⟶ 3,193: '''Preliminaries''' <~~lang~~syntaxhighlight lang="jq"># input: a decimal integer # output: the corresponding binary array, most significant bit first def binary_digits: Line 3,096 ⟶ 3,207: .result += .power $b \| .power = 2) \| .result;</~~lang~~syntaxhighlight> '''Encode to UTF-8''' <~~lang~~syntaxhighlight lang="jq"># input: an array of decimal integers representing the utf-8 bytes of a Unicode codepoint. # output: the corresponding decimal number of that codepoint. def utf8_encode: Line 3,118 ⟶ 3,229: end \| map(binary_to_decimal) end;</~~lang~~syntaxhighlight> '''Decode an array of UTF-8 bytes''' <~~lang~~syntaxhighlight lang="jq"># input: an array of decimal integers representing the utf-8 bytes of a Unicode codepoint. # output: the corresponding decimal number of that codepoint. def utf8_decode: Line 3,138 ⟶ 3,249: else $d[0][-3:] + $d[1][$mb:] + $d[2][$mb:] + $d[3][$mb:] end \| binary_to_decimal ;</~~lang~~syntaxhighlight> '''Task''' <~~lang~~syntaxhighlight lang="jq">def task: [ "A", "ö", "Ж", "€", "𝄞" ][] \| . as $glyph Line 3,148 ⟶ 3,259: \| "Glyph \($glyph) => \($encoded) => \($decoded) => \([$decoded]\|implode)" ; task</~~lang~~syntaxhighlight> {{out}} <pre> Line 3,159 ⟶ 3,270: =={{header\|Julia}}== Julia supports the UTF-8 encoding (and others through packages). ~~{{works with\|Julia\|0.6}}~~ <syntaxhighlight lang="julia"> ~~Julia supports by default UTF-8 encoding.~~ for t in ("A", "ö", "Ж", "€", "𝄞") ~~<lang~~ ~~julia>for~~ t in println(~~"A"~~t, "~~ö",~~ ~~"Ж",~~→ "€", ~~"𝄞"~~codeunits(t)) end ~~enc = Vector{UInt8}(t)~~ </syntaxhighlight> ~~dec = String(enc)~~ ~~println(dec, " → ", enc)~~ ~~end</lang>~~ {{out}} Line 3,177 ⟶ 3,286: =={{header\|Kotlin}}== <~~lang~~syntaxhighlight lang="scala">// version 1.1.2 fun utf8Encode(codePoint: Int) = String(intArrayOf(codePoint), 0, 1).toByteArray(Charsets.UTF_8) Line 3,196 ⟶ 3,305: System.out.printf("%-${n}s %c\n", s, decoded) } }</~~lang~~syntaxhighlight> {{out}} Line 3,209 ⟶ 3,318: =={{header\|langur}}== <syntaxhighlight lang="langur">writeln "character Unicode UTF-8 encoding (hex)" ~~{{works with\|langur\|0.8.4}}~~ ~~<lang langur>writeln "character Unicode UTF-8 encoding (hex)"~~ for .cp in "AöЖ€𝄞" { val .utf8 = s2b cp2s .cp val .cpstr = b2s .utf8 val .utf8rep = join " ", map ffn .b: $"\{{.b:X02;}}", .utf8 writeln $"\{{.cpstr:-11;}} U+\{{.cp:X04:-8;}} \{{.utf8rep;}}" } ~~}</lang>~~ </syntaxhighlight> {{out}} Line 3,235 ⟶ 3,344: - byteArray.toHexString (intStart, intLen): returns hex string representation of byte array (e.g. for printing)<br /> - byteArray.readRawString (intLen, [strCharSet="UTF-8"]): reads a fixed number of bytes as a string <~~lang~~syntaxhighlight ~~Lingo~~lang="lingo">chars = ["A", "ö", "Ж", "€", "𝄞"] put "Character Unicode (int) UTF-8 (hex) Decoded" repeat with c in chars ba = bytearray(c) put col(c, 12) & col(charToNum(c), 16) & col(ba.toHexString(1, ba.length), 14) & ba.readRawString(ba.length) end repeat</~~lang~~syntaxhighlight> Helper function for table formatting <~~lang~~syntaxhighlight ~~Lingo~~lang="lingo">on col (val, len) str = string(val) repeat with i = str.length+1 to len Line 3,248 ⟶ 3,357: end repeat return str end</~~lang~~syntaxhighlight> {{out}} <pre> Line 3,261 ⟶ 3,370: =={{header\|Lua}}== {{works with\|Lua\|5.3}} <syntaxhighlight lang="lua"> ~~<lang Lua>~~ -- Accept an integer representing a codepoint. -- Return the values of the individual octets. Line 3,296 ⟶ 3,405: end end </syntaxhighlight> ~~</lang>~~ {{out}} <pre> Line 3,317 ⟶ 3,426: =={{header\|M2000 Interpreter}}== <syntaxhighlight lang="m2000 interpreter"> ~~<lang M2000 Interpreter>~~ Module EncodeDecodeUTF8 { a$=string$("Hello" as UTF8enc) Line 3,337 ⟶ 3,446: } EncodeDecodeUTF8 </syntaxhighlight> ~~</lang>~~ {{out}} <pre> Line 3,352 ⟶ 3,461: =={{header\|Mathematica}}/{{header\|Wolfram Language}}== <~~lang~~syntaxhighlight ~~Mathematica~~lang="mathematica">utf = ToCharacterCode[ToString["AöЖ€", CharacterEncoding -> "UTF8"]] ToCharacterCode[FromCharacterCode[utf, "UTF8"]]</~~lang~~syntaxhighlight> {{out}} <pre>{65, 195, 182, 208, 150, 226, 130, 172} Line 3,365 ⟶ 3,474: For this purpose, using sequences or bytes is not natural. Here is a way to proceed using the module “unicode”. <~~lang~~syntaxhighlight ~~Nim~~lang="nim">import unicode, sequtils, strformat, strutils const UChars = ["\u0041", "\u00F6", "\u0416", "\u20AC", "\u{1D11E}"] Line 3,385 ⟶ 3,494: r = s.toRune # Display. echo &"""{uchar:>5} U+{r.int.toHex(5)} {s.map(toHex).join(" ")}"""</~~lang~~syntaxhighlight> {{out}} Line 3,398 ⟶ 3,507: In this section, we provide two procedures to convert a Unicode code point to a UTF-8 sequence of bytes and conversely, without using the module “unicode”. We provide also a procedure to convert a sequence of bytes to a string in order to print it. The algorithm is the one used by the Go solution. <~~lang~~syntaxhighlight ~~Nim~~lang="nim">import sequtils, strformat, strutils const Line 3,470 ⟶ 3,579: # Display. echo &"""{s.toString:>5} U+{c.int.toHex(5)} {s.map(toHex).join(" ")}""" </syntaxhighlight> ~~</lang>~~ {{out}} Line 3,476 ⟶ 3,585: =={{header\|Perl}}== <~~lang~~syntaxhighlight lang="perl">#!/usr/bin/perl use strict; use warnings; Line 3,495 ⟶ 3,604: } split //, $utf8; print "\n"; } @chars;</~~lang~~syntaxhighlight> {{out}} Line 3,513 ⟶ 3,622: As requested in the task description: <!--<~~lang~~syntaxhighlight ~~Phix~~lang="phix">--> <span style="color: #008080;">constant</span> <span style="color: #000000;">tests</span> <span style="color: #0000FF;">=</span> <span style="color: #0000FF;">{</span><span style="color: #000000;">#0041</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">#00F6</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">#0416</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">#20AC</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">#1D11E</span><span style="color: #0000FF;">}</span> Line 3,526 ⟶ 3,635: <span style="color: #7060A8;">printf</span><span style="color: #0000FF;">(</span><span style="color: #000000;">1</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"#%04x -> {%s} -> {%s}\n"</span><span style="color: #0000FF;">,{</span><span style="color: #000000;">codepoint</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">hex</span><span style="color: #0000FF;">(</span><span style="color: #000000;">s</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"#%02x"</span><span style="color: #0000FF;">),</span><span style="color: #000000;">hex</span><span style="color: #0000FF;">(</span><span style="color: #000000;">r</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"#%04x"</span><span style="color: #0000FF;">)})</span> <span style="color: #008080;">end</span> <span style="color: #008080;">for</span> <!--</~~lang~~syntaxhighlight>--> {{out}} Line 3,538 ⟶ 3,647: =={{header\|Processing}}== <~~lang~~syntaxhighlight lang="java">import java.nio.charset.StandardCharsets; Integer[] code_points = {0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E}; Line 3,565 ⟶ 3,674: tel_1 += 30; tel_2 = 50; } }</~~lang~~syntaxhighlight> Line 3,572 ⟶ 3,681: The encoding and decoding procedure are kept simple and designed to work with an array of 5 elements for input/output of the UTF-8 encoding for a single code point at a time. It was decided not to use a more elaborate example that would have been able to operate on a buffer to encode/decode more than one code point at a time. <~~lang~~syntaxhighlight lang="purebasic">#UTF8_codePointMaxByteCount = 4 ;UTF-8 encoding uses only a maximum of 4 bytes to encode a codepoint Procedure UTF8_encode(x, Array encoded_codepoint.a(1)) ;x is codepoint to encode, the array will contain output Line 3,683 ⟶ 3,792: Print(#CRLF$ + #CRLF$ + "Press ENTER to exit"): Input() CloseConsole() EndIf</~~lang~~syntaxhighlight> Sample output: <pre> Unicode UTF-8 Decoded Line 3,696 ⟶ 3,805: =={{header\|Python}}== <~~lang~~syntaxhighlight lang="python"> #!/usr/bin/env python3 from unicodedata import name Line 3,713 ⟶ 3,822: chars = ['A', 'ö', 'Ж', '€', '𝄞'] for char in chars: print('{:<11} {:<36} {:<15} {:<15}'.format(char, name(char), unicode_code(char), utf8hex(char)))</~~lang~~syntaxhighlight> {{out}} <pre>Character Name Unicode UTF-8 encoding (hex) Line 3,723 ⟶ 3,832: =={{header\|Racket}}== <~~lang~~syntaxhighlight lang="racket">#lang racket (define char-map Line 3,739 ⟶ 3,848: (map (curryr number->string 16) bites) (bytes->string/utf-8 (list->bytes bites)) name)))</~~lang~~syntaxhighlight> {{out}} <pre>#\A A (41) A LATIN-CAPITAL-LETTER-A Line 3,751 ⟶ 3,860: {{works with\|Rakudo\|2017.02}} Pretty much all built in to the language. <syntaxhighlight lang="raku" ~~perl6~~line>say sprintf("%-18s %-36s\|%8s\| %7s \|%14s \| %s\n", 'Character\|', 'Name', 'Ordinal', 'Unicode', 'UTF-8 encoded', 'decoded'), '-' x 100; for < A ö Ж € 𝄞 😜 👨‍👩‍👧‍👦> -> $char { printf " %-5s \| %-43s \| %6s \| %-7s \| %12s \|%4s\n", $char, $char.uninames.join(','), $char.ords.join(' '), ('U+' X~ $char.ords».base(16)).join(' '), $char.encode('UTF8').list».base(16).Str, $char.encode('UTF8').decode; }</~~lang~~syntaxhighlight> {{out}} <pre>Character\| Name \| Ordinal\| Unicode \| UTF-8 encoded \| decoded Line 3,771 ⟶ 3,880: =={{header\|Ruby}}== <~~lang~~syntaxhighlight lang="ruby"> character_arr = ["A","ö","Ж","€","𝄞"] for c in character_arr do Line 3,779 ⟶ 3,888: puts "" end </syntaxhighlight> ~~</lang>~~ {{out}} <pre> Line 3,805 ⟶ 3,914: =={{header\|Rust}}== <~~lang~~syntaxhighlight lang="rust">fn main() { let chars = vec!('A', 'ö', 'Ж', '€', '𝄞'); chars.iter().for_each(\|c\| { Line 3,815 ⟶ 3,924: }); } </syntaxhighlight> ~~</lang>~~ {{out}} <pre> Line 3,827 ⟶ 3,936: =={{header\|Scala}}== === Imperative solution=== <~~lang~~syntaxhighlight lang="scala">object UTF8EncodeAndDecode extends App { val codePoints = Seq(0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E) Line 3,849 ⟶ 3,958: printf(s"%-${w}c %-36s %-7s %-${16 - w}s%c%n", codePoint, Character.getName(codePoint), leftAlignedHex, s, utf8Decode(bytes)) }</~~lang~~syntaxhighlight> === Functional solution=== <~~lang~~syntaxhighlight lang="scala">import java.nio.charset.StandardCharsets object UTF8EncodeAndDecode extends App { Line 3,876 ⟶ 3,985: println(s"\nSuccessfully completed without errors. [total ${scala.compat.Platform.currentTime - executionStart} ms]") }</~~lang~~syntaxhighlight> === Composable and testable solution=== <~~lang~~syntaxhighlight lang="scala">package example object UTF8EncodeAndDecode extends TheMeat with App { Line 3,911 ⟶ 4,020: } </syntaxhighlight> ~~</lang>~~ =={{header\|Seed7}}== <~~lang~~syntaxhighlight lang="seed7">$ include "seed7_05.s7i"; include "unicode.s7i"; include "console.s7i"; Line 3,928 ⟶ 4,037: writeln("-------------------------------------------------"); for ch range "AöЖ€𝄞" do utf8 := ~~striToUtf8~~toUtf8(str(ch)); writeln(ch rpad 11 <& "U+" <& ord(ch) radix 16 lpad0 4 rpad 7 <& hex(utf8) rpad 22 <& ~~utf8ToStri~~fromUtf8(utf8)); end for; end func;</~~lang~~syntaxhighlight> {{out}} Line 3,946 ⟶ 4,055: =={{header\|Sidef}}== <~~lang~~syntaxhighlight lang="ruby">func utf8_encoder(Number code) { code.chr.encode('UTF-8').bytes.map{.chr} } Line 3,959 ⟶ 4,068: assert_eq(n, decoded.ord) say "#{decoded} -> #{encoded}" }</~~lang~~syntaxhighlight> {{out}} <pre> Line 3,971 ⟶ 4,080: =={{header\|Swift}}== In Swift there's a difference between UnicodeScalar, which is a single unicode code point, and Character which may consist out of multiple UnicodeScalars, usually because of combining characters. <~~lang~~syntaxhighlight ~~Swift~~lang="swift">import Foundation func encode(_ scalar: UnicodeScalar) -> Data { Line 3,992 ⟶ 4,101: print("character: \(decoded), code point: U+\(String(scalar.value, radix: 16)), \tutf-8: \(formattedBytes)") } </syntaxhighlight> ~~</lang>~~ {{out}} <pre> Line 4,004 ⟶ 4,113: =={{header\|Tcl}}== Note: Tcl can handle Unicodes only up to U+FFFD, i.e. the Basic Multilingual Plane (BMP, 16 bits wide). Therefore, the fifth test fails as expected. <~~lang~~syntaxhighlight ~~Tcl~~lang="tcl">proc encoder int { set u [format %c $int] set bytes {} Line 4,023 ⟶ 4,132: lappend res [encoder $test] -> [decoder [encoder $test]] puts $res }</~~lang~~syntaxhighlight> <pre> 0x0041 41 -> A Line 4,035 ⟶ 4,144: While perhaps not as readable as the above, this version handles beyond-BMP codepoints by manually composing the utf-8 byte sequences and emitting raw bytes to the console. <tt>encoding convertto utf-8</tt> command still does the heavy lifting where it can. <~~lang~~syntaxhighlight ~~Tcl~~lang="tcl">proc utf8 {codepoint} { scan $codepoint %llx cp if {$cp < 0x10000} { Line 4,062 ⟶ 4,171: set utf8 [utf8 $codepoint] puts "[format U+%04s $codepoint]\t$utf8\t[hexchars $utf8]" }</~~lang~~syntaxhighlight> {{out}}<pre>U+0041 A 41 Line 4,070 ⟶ 4,179: U+1D11E 𝄞 f0 9d 84 9e </pre> =={{header\|UNIX Shell}}== {{works with\|Bourne Again SHell}} {{works with\|Korn Shell}} {{works with\|Zsh}} Works with locale set to UTF-8. <syntaxhighlight lang="bash">function encode { typeset -i code_point=$1 printf "$(printf '\\U%08X\\n' "$code_point")" } function decode { typeset character=$1 printf 'U+%04X\n' "'$character" set +x } printf 'Char\tCode Point\tUTF-8 Bytes\n' for test in A ö Ж € 𝄞; do code_point=$(decode "$test") utf8=$(encode "$(( 16#${code_point#U+} ))") bytes=$(printf '%b' "$utf8" \| od -An -tx1 \| sed -nE '/./s/^ \| $//p') printf '%-4b\t%-10s\t%s\n' "$utf8" "$code_point" "$bytes" done</syntaxhighlight> {{Out}} <pre style="font-family: Consolas,Courier,monospace">Char Code Point UTF-8 Bytes A U+0041 41 ö U+00F6 c3 b6 Ж U+0416 d0 96 € U+20AC e2 82 ac 𝄞 U+1D11E f0 9d 84 9e</pre> =={{header\|VBA}}== <~~lang~~syntaxhighlight VBlang="vb">Private Function unicode_2_utf8(x As Long) As Byte() Dim y() As Byte Dim r As Long Line 4,193 ⟶ 4,334: Debug.Print String$(8 - Len(s), " "); s Next cpi End Sub</~~lang~~syntaxhighlight>{{out}}<pre>ch unicode UTF-8 encoded decoded A 41 41 41 ö F6 C3 B6 F6 Line 4,200 ⟶ 4,341: ? 1D11E F0 9D 84 9E 1D11E </pre> =={{header\|V (Vlang)}}== <syntaxhighlight lang="v (vlang)">import encoding.hex fn decode(s string) ?[]u8 { return hex.decode(s) } fn main() { println("${'Char':-7} ${'Unicode':7}\tUTF-8 encoded\tDecoded") for codepoint in [`A`, `ö`, `Ж`, `€`, `𝄞`] { encoded := codepoint.bytes().hex() decoded := decode(encoded)? println("${codepoint:-7} U+${codepoint:04X}\t${encoded:-12}\t${decoded.bytestr()}") } }</syntaxhighlight> {{out}} <pre>Char Unicode UTF-8 encoded Decoded A U+0041 41 A ö U+00F6 c3b6 ö Ж U+0416 d096 Ж € U+20AC e282ac € 𝄞 U+1D11E f09d849e 𝄞 </pre> =={{header\|VBScript}}== <syntaxhighlight lang="vb"> Option Explicit Dim m_1,m_2,m_3,m_4 Dim d_2,d_3,d_4 Dim h_0,h_2,h_3,h_4 Dim mc_0,mc_2,mc_3,mc_4 m_1=&h3F d_2=m_1+1 m_2=m_1 d_2 d_3= (m_2 Or m_1)+1 m_3= m_2* d_2 d_4=(m_3 Or m_2 Or m_1)+1 h_0=&h80 h_2=&hC0 h_3=&hE0 h_4=&hF0 mc_0=&h3f mc_2=&h1F mc_3=&hF mc_4=&h7 Function cp2utf8(cp) 'cp as long, returns string If cp<&h80 Then cp2utf8=Chr(cp) ElseIf (cp <=&H7FF) Then cp2utf8=Chr(h_2 or (cp \ d_2) )&Chr(h_0 Or (cp And m_1)) ElseIf (cp <=&Hffff&) Then cp2utf8= Chr(h_3 Or (cp\ d_3)) & Chr(h_0 Or (cp And m_2)\d_2) & Chr(h_0 Or (cp And m_1)) Else cp2utf8= Chr(h_4 Or (cp\d_4))& Chr(h_0 Or ((cp And m_3) \d_3))& Chr(h_0 Or ((cp And m_2)\d_2)) & Chr(h_0 Or (cp And m_1)) End if End Function Function utf82cp(utf) 'utf as string, returns long Dim a,b,m m=strreverse(utf) b= Len(utf) a=asc(mid(m,1,1)) utf82cp=a And &h7f if b=1 Then Exit Function a=asc(mid(m,2,1)) If b=2 Then utf82cp= utf82cp Or (a And mc_2)d_2 :Exit function utf82cp= utf82cp Or (a And m_1)d_2 a=asc(mid(m,3,1)) If b=3 Then utf82cp= utf82cp Or (a And mc_3)d_3 :Exit function utf82cp= utf82cp Or (a And m_1)d_3 Or (a=asc(mid(m,4,1)) And mc_4)*d_4 End Function Sub print(s): On Error Resume Next WScript.stdout.Write (s) If err= &h80070006& Then WScript.Echo " Please run this script with CScript": WScript.quit End Sub Function utf8displ(utf) Dim s,i s="" For i=1 To Len(utf) s=s &" "& Hex(Asc(Mid(utf,i,1))) Next utf8displ= pad(s,12) End function function pad(s,n) if n<0 then pad= right(space(-n) & s ,-n) else pad= left(s& space(n),n) end if :end function Sub check(i) Dim c,c0,c1,c2,u c=b(i):c0=pad(c(0),29) :c1=c(1) :c2=pad(c(2),12):u=cp2utf8(c1) print c0 & " CP:" & pad("U+" & Hex(c1),-8) & " my utf8:" & utf8displ (u) & " should be:" & c2 & " back to CP:" & pad("U+" & Hex(utf82cp(u)),-8)& vbCrLf End Sub Dim b b=Array(_ Array("LATIN CAPITAL LETTER A ",&h41," 41"),_ Array("LATIN SMALL LETTER O WITH DIAERESIS ",&hF6," C3 B6"),_ Array("CYRILLIC CAPITAL LETTER ZHE ",&h416," D0 96"),_ Array("EURO SIGN",&h20AC," E2 82 AC "),_ Array("MUSICAL SYMBOL G CLEF ",&h1D11E," F0 9D 84 9E")) check 0 check 1 check 2 check 3 check 4 </syntaxhighlight> {{out}} <small> <pre> LATIN CAPITAL LETTER A CP: U+41 my utf8: 41 should be: 41 back to CP: U+41 LATIN SMALL LETTER O WITH DIA CP: U+F6 my utf8: C3 B6 should be: C3 B6 back to CP: U+F6 CYRILLIC CAPITAL LETTER ZHE CP: U+416 my utf8: D0 96 should be: D0 96 back to CP: U+416 EURO SIGN CP: U+20AC my utf8: E2 82 AC should be: E2 82 AC back to CP: U+20AC MUSICAL SYMBOL G CLEF CP: U+1D11E my utf8: F0 9D 84 9E should be: F0 9D 84 9E back to CP: U+1D11E </pre> </small> =={{header\|Wren}}== The utf8_decode function was translated from the Go entry. <~~lang~~syntaxhighlight ~~ecmascript~~lang="wren">import "./fmt" for Fmt var utf8_encode = Fn.new { \|cp\| String.fromCodePoint(cp).bytes.toList } Line 4,242 ⟶ 4,506: var uni = String.fromCodePoint(cp2) System.print("%(Fmt.s(-11, uni)) %(Fmt.s(-37, test[0])) U+%(Fmt.s(-8, Fmt.Xz(4, cp2))) %(utf8)") }</~~lang~~syntaxhighlight> {{out}} Line 4,256 ⟶ 4,520: =={{header\|zkl}}== <~~lang~~syntaxhighlight lang="zkl">println("Char Unicode UTF-8"); foreach utf,unicode_int in (T( T("\U41;",0x41), T("\Uf6;",0xf6), T("\U416;",0x416), T("\U20AC;",0x20ac), T("\U1D11E;",0x1d11e))){ Line 4,265 ⟶ 4,529: println("%s %s %9s %x".fmt(char,char2,"U+%x".fmt(unicode_int),utf_int)); }</~~lang~~syntaxhighlight> Int.len() --> number of bytes in int. This could be hard coded because UTF-8 has a max of 6 bytes and (0x41).toBigEndian(6) --> 0x41,0,0,0,0,0 which is