UTF-8 encode and decode: Difference between revisions

 
(17 intermediate revisions by 12 users not shown)
Line 3:
As described in [[UTF-8]] and in [[wp:UTF-8|Wikipedia]], UTF-8 is a popular encoding of (multi-byte) [[Unicode]] code-points into eight-bit octets.
 
The goal of this task is to write a encoder that takes a unicode code-point (an integer representing a unicode character) and returns a sequence of 1-41–4 bytes representing that character in the UTF-8 encoding.
 
Then you have to write the corresponding decoder that takes a sequence of 1-41–4 UTF-8 encoded bytes and return the corresponding unicode character.
 
Demonstrate the functionality of your encoder and decoder on the following five characters:
Line 24:
{{trans|Python}}
 
<langsyntaxhighlight lang="11l">F unicode_code(ch)
R ‘U+’hex(ch.code).zfill(4)
 
Line 33:
V chars = [‘A’, ‘ö’, ‘Ж’, ‘€’]
L(char) chars
print(‘#<11 #<15 #<15’.format(char, unicode_code(char), utf8hex(char)))</langsyntaxhighlight>
 
{{out}}
Line 45:
 
=={{header|8th}}==
<langsyntaxhighlight lang="8th">
hex \ so bytes print nicely
 
Line 81:
 
bye
</syntaxhighlight>
</lang>
Output:<pre>
A 41
Line 97:
 
=={{header|Action!}}==
<langsyntaxhighlight Actionlang="action!">TYPE Unicode=[BYTE bc1,bc2,bc3]
BYTE ARRAY hex=['0 '1 '2 '3 '4 '5 '6 '7 '8 '9 'A 'B 'C 'D 'E 'F]
 
Line 246:
StrUnicode(res) PutE()
OD
RETURN</langsyntaxhighlight>
{{out}}
[https://gitlab.com/amarok8bit/action-rosetta-code/-/raw/master/images/UTF-8_encode_and_decode.png Screenshot from Atari 8-bit computer]
Line 269:
{{works with|Ada|Ada|2012}}
 
<langsyntaxhighlight Adalang="ada">with Ada.Strings.Fixed; use Ada.Strings.Fixed;
with Ada.Strings.UTF_Encoding.Wide_Wide_Strings;
with Ada.Integer_Text_IO;
Line 317:
end;
end loop;
end UTF8_Encode_And_Decode;</langsyntaxhighlight>
{{out}}
<pre>Character Unicode UTF-8 encoding (hex)
Line 330:
=={{header|ATS}}==
 
The following code is quite long but consists largely of proofs. UTF-8 is a complicated encoding, but also a well defined encoding,one whichthat lends itself to compile-time verification methods.
 
Despite this complexity, what actually gets generated is highly optimizable C code. Note that the following demonstration requires no ATS-specific library support whatsoever.
<lang ATS>(*
 
<syntaxhighlight lang="ats">(*
 
UTF-8 encoding and decoding in ATS2.
Line 2,109 ⟶ 2,111:
 
println! ("SUCCESS")
end</langsyntaxhighlight>
 
{{out}}
Line 2,116 ⟶ 2,118:
 
=={{header|AutoHotkey}}==
<langsyntaxhighlight AutoHotkeylang="autohotkey">Encode_UTF(hex){
Bytes := hex>=0x10000 ? 4 : hex>=0x0800 ? 3 : hex>=0x0080 ? 2 : hex>=0x0001 ? 1 : 0
Prefix := [0, 0xC0, 0xE0, 0xF0]
Line 2,150 ⟶ 2,152:
DllCall("msvcrt.dll\" v, "Int64", value, "Str", s, "UInt", OutputBase, "CDECL")
return s
}</langsyntaxhighlight>
Examples:<langsyntaxhighlight AutoHotkeylang="autohotkey">data =
(comment
0x0041
Line 2,166 ⟶ 2,168:
}
MsgBox % output
return</langsyntaxhighlight>
{{out}}
<pre>
Line 2,179 ⟶ 2,181:
=={{header|BaCon}}==
BaCon supports UTF8 natively.
<langsyntaxhighlight lang="bacon">DECLARE x TYPE STRING
 
CONST letter$ = "A ö Ж € 𝄞"
Line 2,188 ⟶ 2,190:
FOR x IN letter$
PRINT x, TAB$(1), "U+", HEX$(UCS(x)), TAB$(2), COIL$(LEN(x), HEX$(x[_-1] & 255))
NEXT</langsyntaxhighlight>
{{out}}
<pre>Char Unicode UTF-8 (hex)
Line 2,199 ⟶ 2,201:
 
=={{header|C}}==
<syntaxhighlight lang="c">
<lang C>
#include <stdio.h>
#include <stdlib.h>
Line 2,310 ⟶ 2,312:
return 0;
}
</syntaxhighlight>
</lang>
Output
<syntaxhighlight lang="text">
Character Unicode UTF-8 encoding (hex)
----------------------------------------
Line 2,321 ⟶ 2,323:
𝄞 U+1d11e f0 9d 84 9e
 
</syntaxhighlight>
</lang>
 
=={{header|C sharp|C#}}==
 
<langsyntaxhighlight lang="csharp">using System;
using System.Text;
 
Line 2,352 ⟶ 2,354:
20AC € E2-82-AC
1D11E 𝄞 F0-9D-84-9E */
</syntaxhighlight>
</lang>
 
 
Line 2,360 ⟶ 2,362:
Helper functions
 
<langsyntaxhighlight lang="lisp">
(defun ascii-byte-p (octet)
"Return t if octet is a single-byte 7-bit ASCII char.
Line 2,401 ⟶ 2,403:
when (= (nth i templates) (logand (nth i bitmasks) octet))
return i)))
</syntaxhighlight>
</lang>
 
Encoder
 
<langsyntaxhighlight lang="lisp">
(defun unicode-to-utf-8 (int)
"Take a unicode code point, return a list of one to four UTF-8 encoded bytes (octets)."
Line 2,436 ⟶ 2,438:
;; return the list of UTF-8 encoded bytes.
byte-list))
</syntaxhighlight>
</lang>
 
Decoder
 
<langsyntaxhighlight lang="lisp">
(defun utf-8-to-unicode (byte-list)
"Take a list of one to four utf-8 encoded bytes (octets), return a code point."
Line 2,460 ⟶ 2,462:
(error "calculated number of bytes doesnt match the length of the byte list")))
(error "first byte in the list isnt a lead byte"))))))
</syntaxhighlight>
</lang>
 
The test
 
<langsyntaxhighlight lang="lisp">
(defun test-utf-8 ()
"Return t if the chosen unicode points are encoded and decoded correctly."
Line 2,479 ⟶ 2,481:
;; return t if all are t
(every #'= unicodes-orig unicodes-test)))
</syntaxhighlight>
</lang>
 
Test output
 
<langsyntaxhighlight lang="lisp">
CL-USER> (test-utf-8)
character A, code point: 41, utf-8: 41
Line 2,491 ⟶ 2,493:
character 𝄞, code point: 1D11E, utf-8: F0 9D 84 9E
T
</syntaxhighlight>
</lang>
 
=={{header|D}}==
<langsyntaxhighlight Dlang="d">import std.conv;
import std.stdio;
 
Line 2,506 ⟶ 2,508:
writefln("%s %7X [%(%X, %)]", c, unicode, bytes);
}
}</langsyntaxhighlight>
 
{{out}}
Line 2,517 ⟶ 2,519:
 
=={{header|Elena}}==
ELENA 46.x :
<langsyntaxhighlight lang="elena">import system'routines;
import extensions;
Line 2,530 ⟶ 2,532:
string printAsUTF8Array()
{
self.toByteArray().forEach::(b){ console.print(b.toString(16)," ") }
}
string printAsUTF32()
{
self.toArray().forEach::(c){ console.print("U+",c.toInt().toString(16)," ") }
}
}
Line 2,555 ⟶ 2,557:
"𝄞".printAsString().printAsUTF8Array().printAsUTF32();
console.printLine();
}</langsyntaxhighlight>
{{out}}
<pre>
Line 2,563 ⟶ 2,565:
€ E2 82 AC U+20AC
𝄞 F0 9D 84 9E U+1D11E</pre>
 
=={{header|FreeBASIC}}==
{{trans|VBScript}}
<syntaxhighlight lang="vbnet">Function unicode_2_utf8(x As Long) As String
Dim As String y
Dim As Long r
Select Case x
Case 0 To &H7F
y = Chr(x)
Case &H80 To &H7FF
y = Chr(192 + x \ 64) + Chr(128 + x Mod 64)
Case &H800 To &H7FFF, 32768 To 65535
r = x \ 64
y = Chr(224 + r \ 64) + Chr(128 + r Mod 64) + Chr(128 + x Mod 64)
Case &H10000 To &H10FFFF
r = x \ 4096
y = Chr(240 + r \ 64) + Chr(128 + r Mod 64) + Chr(128 + (x \ 64) Mod 64) + Chr(128 + x Mod 64)
Case Else
Print "what else? " & x & " " & Hex(x)
End Select
Return y
End Function
 
Function utf8_2_unicode(x As String) As Long
Dim As Long primero, segundo, tercero, cuarto
Dim As Long total
Select Case Len(x)
Case 1 'one byte
If Asc(x) < 128 Then
total = Asc(x)
Else
Print "highest bit set error"
End If
Case 2 'two bytes and assume primero byte is leading byte
If Asc(x) \ 32 = 6 Then
primero = Asc(x) Mod 32
If Asc(Mid(x, 2, 1)) \ 64 = 2 Then
segundo = Asc(Mid(x, 2, 1)) Mod 64
Else
Print "mask error"
End If
Else
Print "leading byte error"
End If
total = 64 * primero + segundo
Case 3 'three bytes and assume primero byte is leading byte
If Asc(x) \ 16 = 14 Then
primero = Asc(x) Mod 16
If Asc(Mid(x, 2, 1)) \ 64 = 2 Then
segundo = Asc(Mid(x, 2, 1)) Mod 64
If Asc(Mid(x, 3, 1)) \ 64 = 2 Then
tercero = Asc(Mid(x, 3, 1)) Mod 64
Else
Print "mask error last byte"
End If
Else
Print "mask error middle byte"
End If
Else
Print "leading byte error"
End If
total = 4096 * primero + 64 * segundo + tercero
Case 4 'four bytes and assume primero byte is leading byte
If Asc(x) \ 8 = 30 Then
primero = Asc(x) Mod 8
If Asc(Mid(x, 2, 1)) \ 64 = 2 Then
segundo = Asc(Mid(x, 2, 1)) Mod 64
If Asc(Mid(x, 3, 1)) \ 64 = 2 Then
tercero = Asc(Mid(x, 3, 1)) Mod 64
If Asc(Mid(x, 4, 1)) \ 64 = 2 Then
cuarto = Asc(Mid(x, 4, 1)) Mod 64
Else
Print "mask error last byte"
End If
Else
Print "mask error tercero byte"
End If
Else
Print "mask error second byte"
End If
Else
Print "mask error leading byte"
End If
total = Clng(262144 * primero + 4096 * segundo + 64 * tercero + cuarto)
Case Else
Print "more bytes than expected"
End Select
Return total
End Function
 
Dim As Long cp(4) = {65, 246, 1046, 8364, 119070} '[{&H0041,&H00F6,&H0416,&H20AC,&H1D11E}]
Dim As String r, s
Dim As integer i, j
Print "ch unicode UTF-8 encoded decoded"
For i = Lbound(cp) To Ubound(cp)
Dim As Long cpi = cp(i)
r = unicode_2_utf8(cpi)
s = Hex(cpi)
Print Chr(cpi); Space(10 - Len(s)); s;
s = ""
For j = 1 To Len(r)
s &= Hex(Asc(Mid(r, j, 1))) & " "
Next j
Print Space(16 - Len(s)); s;
s = Hex(utf8_2_unicode(r))
Print Space(8 - Len(s)); s
Next i
 
Sleep</syntaxhighlight>
 
=={{header|F_Sharp|F#}}==
<langsyntaxhighlight lang="fsharp">
// Unicode character point to UTF8. Nigel Galloway: March 19th., 2018
let fN g = match List.findIndex (fun n->n>g) [0x80;0x800;0x10000;0x110000] with
Line 2,572 ⟶ 2,683:
|2->[0xe0+(g&&&0xf000>>>12);0x80+(g&&&0xfc0>>>6);0x80+(g&&&0x3f)]
|_->[0xf0+(g&&&0x1c0000>>>18);0x80+(g&&&0x3f000>>>12);0x80+(g&&&0xfc0>>>6);0x80+(g&&&0x3f)]
</syntaxhighlight>
</lang>
{{out}}
<pre>
Line 2,588 ⟶ 2,699:
{{works with|gforth|0.7.9_20191121}}
{{works with|lxf|1.6-982-823}}
<langsyntaxhighlight lang="forth">: showbytes ( c-addr u -- )
over + swap ?do
i c@ 3 .r loop ;
Line 2,600 ⟶ 2,711:
\ can also be written as
\ 'A' test 'ö' test 'Ж' test '€' test '𝄞' test
</syntaxhighlight>
</lang>
{{out}}
<pre>
Line 2,612 ⟶ 2,723:
If you also want to see the implementation of <code>xc!+</code> and <code>xc@+</code>, here it is (<code>u8!+</code> is the UTF-8 implementation of <code>xc!+</code>, and likewise for <code>u8@+</code>):
 
<langsyntaxhighlight lang="forth">-77 Constant UTF-8-err
 
$80 Constant max-single-byte
Line 2,633 ⟶ 2,744:
REPEAT $7F xor 2* or r>
BEGIN over $80 u>= WHILE tuck c! 1+ REPEAT nip ;
</syntaxhighlight>
</lang>
 
=={{header|Go}}==
===Implementation===
This implementation is missing all checks for invalid data and so is not production-ready, but illustrates the basic UTF-8 encoding scheme.
<langsyntaxhighlight lang="go">package main
 
import (
Line 2,739 ⟶ 2,850:
rune(b[3]&mbMask)
}
}</langsyntaxhighlight>
{{out}}
<pre>
Line 2,750 ⟶ 2,861:
 
===Library/language===
<langsyntaxhighlight lang="go">package main
 
import (
Line 2,775 ⟶ 2,886:
fmt.Printf("%-7c U+%04X\t%-12X\t%c\n", codepoint, codepoint, encoded, decoded)
}
}</langsyntaxhighlight>
{{out}}
<pre>
Line 2,787 ⟶ 2,898:
 
Alternately:
<langsyntaxhighlight lang="go">package main
 
import (
Line 2,808 ⟶ 2,919:
fmt.Printf("%-7c U+%04X\t%-12X\t%c\n", codepoint, codepoint, encoded, decoded)
}
}</langsyntaxhighlight>
{{out}}
<pre>
Line 2,821 ⟶ 2,932:
=={{header|Groovy}}==
{{trans|Java}}
<langsyntaxhighlight lang="groovy">import java.nio.charset.StandardCharsets
 
class UTF8EncodeDecode {
Line 2,847 ⟶ 2,958:
}
}
}</langsyntaxhighlight>
{{out}}
<pre>Char Name Unicode UTF-8 encoded Decoded
Line 2,860 ⟶ 2,971:
Example makes use of [http://hackage.haskell.org/package/bytestring <tt>bytestring</tt>] and [http://hackage.haskell.org/package/text <tt>text</tt>] packages:
 
<langsyntaxhighlight lang="haskell">module Main (main) where
import qualified Data.ByteString as ByteString (pack, unpack)
Line 2,887 ⟶ 2,998:
(printf "U+%04X" codepoint :: String)
(intercalate " " (map (printf "%02X") values))
codepoint'</langsyntaxhighlight>
{{out}}
<pre>
Line 2,901 ⟶ 3,012:
=={{header|J}}==
'''Solution:'''
<langsyntaxhighlight lang="j">utf8=: 8&u: NB. converts to UTF-8 from unicode or unicode codepoint integer
ucp=: 9&u: NB. converts to unicode from UTF-8 or unicode codepoint integer
ucp_hex=: hfd@(3 u: ucp) NB. converts to unicode codepoint hexadecimal from UTF-8, unicode or unicode codepoint integer</langsyntaxhighlight>
 
'''Examples:'''
<langsyntaxhighlight lang="j"> utf8 65 246 1046 8364 119070
AöЖ€𝄞
ucp 65 246 1046 8364 119070
Line 2,921 ⟶ 3,032:
1d11e
utf8@dfh ucp_hex utf8 65 246 1046 8364 119070
AöЖ€𝄞</langsyntaxhighlight>
 
=={{header|Java}}==
{{works with|Java|7+}}
<langsyntaxhighlight lang="java">import java.nio.charset.StandardCharsets;
import java.util.Formatter;
 
Line 2,954 ⟶ 3,065:
}
}
}</langsyntaxhighlight>
{{out}}
<pre>
Line 2,967 ⟶ 3,078:
=={{header|JavaScript}}==
An implementation in ECMAScript 2015 (ES6):
<langsyntaxhighlight lang="javascript">
/***************************************************************************\
|* Pure UTF-8 handling without detailed error reporting functionality. *|
Line 3,027 ⟶ 3,138:
?( m&0x07)<<18|( n&0x3f)<<12|( o&0x3f)<<6|( p&0x3f)<<0
:(()=>{throw'Invalid UTF-8 encoding!'})()
</syntaxhighlight>
</lang>
The testing inputs:
<langsyntaxhighlight lang="javascript">
const
str=
Line 3,048 ⟶ 3,159:
:[ [ a,b,c]]
,inputs=zip3(str,cps,cus);
</syntaxhighlight>
</lang>
The testing code:
<langsyntaxhighlight lang="javascript">
console.log(`\
${'Character'.padEnd(16)}\
Line 3,066 ⟶ 3,177:
${`[${[...utf8encode(cp)].map(n=>n.toString(0x10))}]`.padEnd(16)}\
${utf8decode(cu).toString(0x10).padStart(8,'U+000000')}`)
</syntaxhighlight>
</lang>
and finally, the output from the test:
<pre>
Line 3,082 ⟶ 3,193:
 
'''Preliminaries'''
<langsyntaxhighlight lang="jq"># input: a decimal integer
# output: the corresponding binary array, most significant bit first
def binary_digits:
Line 3,096 ⟶ 3,207:
.result += .power * $b
| .power *= 2)
| .result;</langsyntaxhighlight>
'''Encode to UTF-8'''
<langsyntaxhighlight lang="jq"># input: an array of decimal integers representing the utf-8 bytes of a Unicode codepoint.
# output: the corresponding decimal number of that codepoint.
def utf8_encode:
Line 3,118 ⟶ 3,229:
end
| map(binary_to_decimal)
end;</langsyntaxhighlight>
'''Decode an array of UTF-8 bytes'''
<langsyntaxhighlight lang="jq"># input: an array of decimal integers representing the utf-8 bytes of a Unicode codepoint.
# output: the corresponding decimal number of that codepoint.
def utf8_decode:
Line 3,138 ⟶ 3,249:
else $d[0][-3:] + $d[1][$mb:] + $d[2][$mb:] + $d[3][$mb:]
end
| binary_to_decimal ;</langsyntaxhighlight>
'''Task'''
<langsyntaxhighlight lang="jq">def task:
[ "A", "ö", "Ж", "€", "𝄞" ][]
| . as $glyph
Line 3,148 ⟶ 3,259:
| "Glyph \($glyph) => \($encoded) => \($decoded) => \([$decoded]|implode)" ;
 
task</langsyntaxhighlight>
{{out}}
<pre>
Line 3,159 ⟶ 3,270:
 
=={{header|Julia}}==
Julia supports the UTF-8 encoding (and others through packages).
{{works with|Julia|0.6}}
 
<syntaxhighlight lang="julia">
Julia supports by default UTF-8 encoding.
for t in ("A", "ö", "Ж", "€", "𝄞")
 
<lang julia>for t in println("A"t, "ö", "Ж", "€", "𝄞"codeunits(t))
end
enc = Vector{UInt8}(t)
</syntaxhighlight>
dec = String(enc)
println(dec, " → ", enc)
end</lang>
 
{{out}}
Line 3,177 ⟶ 3,286:
 
=={{header|Kotlin}}==
<langsyntaxhighlight lang="scala">// version 1.1.2
 
fun utf8Encode(codePoint: Int) = String(intArrayOf(codePoint), 0, 1).toByteArray(Charsets.UTF_8)
Line 3,196 ⟶ 3,305:
System.out.printf("%-${n}s %c\n", s, decoded)
}
}</langsyntaxhighlight>
 
{{out}}
Line 3,209 ⟶ 3,318:
 
=={{header|langur}}==
<syntaxhighlight lang="langur">writeln "character Unicode UTF-8 encoding (hex)"
{{works with|langur|0.8.4}}
<lang langur>writeln "character Unicode UTF-8 encoding (hex)"
 
for .cp in "AöЖ€𝄞" {
val .utf8 = s2b cp2s .cp
val .cpstr = b2s .utf8
val .utf8rep = join " ", map ffn .b: $"\{{.b:X02;}}", .utf8
writeln $"\{{.cpstr:-11;}} U+\{{.cp:X04:-8;}} \{{.utf8rep;}}"
}
}</lang>
</syntaxhighlight>
 
{{out}}
Line 3,235 ⟶ 3,344:
- byteArray.toHexString (intStart, intLen): returns hex string representation of byte array (e.g. for printing)<br />
- byteArray.readRawString (intLen, [strCharSet="UTF-8"]): reads a fixed number of bytes as a string
<langsyntaxhighlight Lingolang="lingo">chars = ["A", "ö", "Ж", "€", "𝄞"]
put "Character Unicode (int) UTF-8 (hex) Decoded"
repeat with c in chars
ba = bytearray(c)
put col(c, 12) & col(charToNum(c), 16) & col(ba.toHexString(1, ba.length), 14) & ba.readRawString(ba.length)
end repeat</langsyntaxhighlight>
Helper function for table formatting
<langsyntaxhighlight Lingolang="lingo">on col (val, len)
str = string(val)
repeat with i = str.length+1 to len
Line 3,248 ⟶ 3,357:
end repeat
return str
end</langsyntaxhighlight>
{{out}}
<pre>
Line 3,261 ⟶ 3,370:
=={{header|Lua}}==
{{works with|Lua|5.3}}
<syntaxhighlight lang="lua">
<lang Lua>
-- Accept an integer representing a codepoint.
-- Return the values of the individual octets.
Line 3,296 ⟶ 3,405:
end
end
</syntaxhighlight>
</lang>
{{out}}
<pre>
Line 3,317 ⟶ 3,426:
 
=={{header|M2000 Interpreter}}==
<syntaxhighlight lang="m2000 interpreter">
<lang M2000 Interpreter>
Module EncodeDecodeUTF8 {
a$=string$("Hello" as UTF8enc)
Line 3,337 ⟶ 3,446:
}
EncodeDecodeUTF8
</syntaxhighlight>
</lang>
{{out}}
<pre>
Line 3,352 ⟶ 3,461:
 
=={{header|Mathematica}}/{{header|Wolfram Language}}==
<langsyntaxhighlight Mathematicalang="mathematica">utf = ToCharacterCode[ToString["AöЖ€", CharacterEncoding -> "UTF8"]]
ToCharacterCode[FromCharacterCode[utf, "UTF8"]]</langsyntaxhighlight>
{{out}}
<pre>{65, 195, 182, 208, 150, 226, 130, 172}
Line 3,365 ⟶ 3,474:
For this purpose, using sequences or bytes is not natural. Here is a way to proceed using the module “unicode”.
 
<langsyntaxhighlight Nimlang="nim">import unicode, sequtils, strformat, strutils
 
const UChars = ["\u0041", "\u00F6", "\u0416", "\u20AC", "\u{1D11E}"]
Line 3,385 ⟶ 3,494:
r = s.toRune
# Display.
echo &"""{uchar:>5} U+{r.int.toHex(5)} {s.map(toHex).join(" ")}"""</langsyntaxhighlight>
 
{{out}}
Line 3,398 ⟶ 3,507:
In this section, we provide two procedures to convert a Unicode code point to a UTF-8 sequence of bytes and conversely, without using the module “unicode”. We provide also a procedure to convert a sequence of bytes to a string in order to print it. The algorithm is the one used by the Go solution.
 
<langsyntaxhighlight Nimlang="nim">import sequtils, strformat, strutils
 
const
Line 3,470 ⟶ 3,579:
# Display.
echo &"""{s.toString:>5} U+{c.int.toHex(5)} {s.map(toHex).join(" ")}"""
</syntaxhighlight>
</lang>
 
{{out}}
Line 3,476 ⟶ 3,585:
 
=={{header|Perl}}==
<langsyntaxhighlight lang="perl">#!/usr/bin/perl
use strict;
use warnings;
Line 3,495 ⟶ 3,604:
} split //, $utf8;
print "\n";
} @chars;</langsyntaxhighlight>
 
{{out}}
Line 3,513 ⟶ 3,622:
As requested in the task description:
 
<!--<langsyntaxhighlight Phixlang="phix">-->
<span style="color: #008080;">constant</span> <span style="color: #000000;">tests</span> <span style="color: #0000FF;">=</span> <span style="color: #0000FF;">{</span><span style="color: #000000;">#0041</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">#00F6</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">#0416</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">#20AC</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">#1D11E</span><span style="color: #0000FF;">}</span>
Line 3,526 ⟶ 3,635:
<span style="color: #7060A8;">printf</span><span style="color: #0000FF;">(</span><span style="color: #000000;">1</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"#%04x -> {%s} -> {%s}\n"</span><span style="color: #0000FF;">,{</span><span style="color: #000000;">codepoint</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">hex</span><span style="color: #0000FF;">(</span><span style="color: #000000;">s</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"#%02x"</span><span style="color: #0000FF;">),</span><span style="color: #000000;">hex</span><span style="color: #0000FF;">(</span><span style="color: #000000;">r</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"#%04x"</span><span style="color: #0000FF;">)})</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
<!--</langsyntaxhighlight>-->
 
{{out}}
Line 3,538 ⟶ 3,647:
 
=={{header|Processing}}==
<langsyntaxhighlight lang="java">import java.nio.charset.StandardCharsets;
 
Integer[] code_points = {0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E};
Line 3,565 ⟶ 3,674:
tel_1 += 30; tel_2 = 50;
}
}</langsyntaxhighlight>
 
 
Line 3,572 ⟶ 3,681:
The encoding and decoding procedure are kept simple and designed to work with an array of 5 elements for input/output of the UTF-8 encoding for a single code point at a time. It was decided not to use a more elaborate example that would have been able to operate on a buffer to encode/decode more than one code point at a time.
 
<langsyntaxhighlight lang="purebasic">#UTF8_codePointMaxByteCount = 4 ;UTF-8 encoding uses only a maximum of 4 bytes to encode a codepoint
 
Procedure UTF8_encode(x, Array encoded_codepoint.a(1)) ;x is codepoint to encode, the array will contain output
Line 3,683 ⟶ 3,792:
Print(#CRLF$ + #CRLF$ + "Press ENTER to exit"): Input()
CloseConsole()
EndIf</langsyntaxhighlight>
Sample output:
<pre> Unicode UTF-8 Decoded
Line 3,696 ⟶ 3,805:
 
=={{header|Python}}==
<langsyntaxhighlight lang="python">
#!/usr/bin/env python3
from unicodedata import name
Line 3,713 ⟶ 3,822:
chars = ['A', 'ö', 'Ж', '€', '𝄞']
for char in chars:
print('{:<11} {:<36} {:<15} {:<15}'.format(char, name(char), unicode_code(char), utf8hex(char)))</langsyntaxhighlight>
{{out}}
<pre>Character Name Unicode UTF-8 encoding (hex)
Line 3,723 ⟶ 3,832:
 
=={{header|Racket}}==
<langsyntaxhighlight lang="racket">#lang racket
 
(define char-map
Line 3,739 ⟶ 3,848:
(map (curryr number->string 16) bites)
(bytes->string/utf-8 (list->bytes bites))
name)))</langsyntaxhighlight>
{{out}}
<pre>#\A A (41) A LATIN-CAPITAL-LETTER-A
Line 3,751 ⟶ 3,860:
{{works with|Rakudo|2017.02}}
Pretty much all built in to the language.
<syntaxhighlight lang="raku" perl6line>say sprintf("%-18s %-36s|%8s| %7s |%14s | %s\n", 'Character|', 'Name', 'Ordinal', 'Unicode', 'UTF-8 encoded', 'decoded'), '-' x 100;
 
for < A ö Ж € 𝄞 😜 👨‍👩‍👧‍👦> -> $char {
printf " %-5s | %-43s | %6s | %-7s | %12s |%4s\n", $char, $char.uninames.join(','), $char.ords.join(' '),
('U+' X~ $char.ords».base(16)).join(' '), $char.encode('UTF8').list».base(16).Str, $char.encode('UTF8').decode;
}</langsyntaxhighlight>
{{out}}
<pre>Character| Name | Ordinal| Unicode | UTF-8 encoded | decoded
Line 3,771 ⟶ 3,880:
=={{header|Ruby}}==
 
<langsyntaxhighlight lang="ruby">
character_arr = ["A","ö","Ж","€","𝄞"]
for c in character_arr do
Line 3,779 ⟶ 3,888:
puts ""
end
</syntaxhighlight>
</lang>
{{out}}
<pre>
Line 3,805 ⟶ 3,914:
 
=={{header|Rust}}==
<langsyntaxhighlight lang="rust">fn main() {
let chars = vec!('A', 'ö', 'Ж', '€', '𝄞');
chars.iter().for_each(|c| {
Line 3,815 ⟶ 3,924:
});
}
</syntaxhighlight>
</lang>
{{out}}
<pre>
Line 3,827 ⟶ 3,936:
=={{header|Scala}}==
=== Imperative solution===
<langsyntaxhighlight lang="scala">object UTF8EncodeAndDecode extends App {
 
val codePoints = Seq(0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E)
Line 3,849 ⟶ 3,958:
printf(s"%-${w}c %-36s %-7s %-${16 - w}s%c%n",
codePoint, Character.getName(codePoint), leftAlignedHex, s, utf8Decode(bytes))
}</langsyntaxhighlight>
=== Functional solution===
<langsyntaxhighlight lang="scala">import java.nio.charset.StandardCharsets
 
object UTF8EncodeAndDecode extends App {
Line 3,876 ⟶ 3,985:
 
println(s"\nSuccessfully completed without errors. [total ${scala.compat.Platform.currentTime - executionStart} ms]")
}</langsyntaxhighlight>
=== Composable and testable solution===
<langsyntaxhighlight lang="scala">package example
 
object UTF8EncodeAndDecode extends TheMeat with App {
Line 3,911 ⟶ 4,020:
 
}
</syntaxhighlight>
</lang>
 
=={{header|Seed7}}==
<langsyntaxhighlight lang="seed7">$ include "seed7_05.s7i";
include "unicode.s7i";
include "console.s7i";
Line 3,928 ⟶ 4,037:
writeln("-------------------------------------------------");
for ch range "AöЖ€𝄞" do
utf8 := striToUtf8toUtf8(str(ch));
writeln(ch rpad 11 <& "U+" <& ord(ch) radix 16 lpad0 4 rpad 7 <&
hex(utf8) rpad 22 <& utf8ToStrifromUtf8(utf8));
end for;
end func;</langsyntaxhighlight>
 
{{out}}
Line 3,946 ⟶ 4,055:
 
=={{header|Sidef}}==
<langsyntaxhighlight lang="ruby">func utf8_encoder(Number code) {
code.chr.encode('UTF-8').bytes.map{.chr}
}
Line 3,959 ⟶ 4,068:
assert_eq(n, decoded.ord)
say "#{decoded} -> #{encoded}"
}</langsyntaxhighlight>
{{out}}
<pre>
Line 3,971 ⟶ 4,080:
=={{header|Swift}}==
In Swift there's a difference between UnicodeScalar, which is a single unicode code point, and Character which may consist out of multiple UnicodeScalars, usually because of combining characters.
<langsyntaxhighlight Swiftlang="swift">import Foundation
 
func encode(_ scalar: UnicodeScalar) -> Data {
Line 3,992 ⟶ 4,101:
print("character: \(decoded), code point: U+\(String(scalar.value, radix: 16)), \tutf-8: \(formattedBytes)")
}
</syntaxhighlight>
</lang>
{{out}}
<pre>
Line 4,004 ⟶ 4,113:
=={{header|Tcl}}==
Note: Tcl can handle Unicodes only up to U+FFFD, i.e. the Basic Multilingual Plane (BMP, 16 bits wide). Therefore, the fifth test fails as expected.
<langsyntaxhighlight Tcllang="tcl">proc encoder int {
set u [format %c $int]
set bytes {}
Line 4,023 ⟶ 4,132:
lappend res [encoder $test] -> [decoder [encoder $test]]
puts $res
}</langsyntaxhighlight>
<pre>
0x0041 41 -> A
Line 4,035 ⟶ 4,144:
While perhaps not as readable as the above, this version handles beyond-BMP codepoints by manually composing the utf-8 byte sequences and emitting raw bytes to the console. <tt>encoding convertto utf-8</tt> command still does the heavy lifting where it can.
 
<langsyntaxhighlight Tcllang="tcl">proc utf8 {codepoint} {
scan $codepoint %llx cp
if {$cp < 0x10000} {
Line 4,062 ⟶ 4,171:
set utf8 [utf8 $codepoint]
puts "[format U+%04s $codepoint]\t$utf8\t[hexchars $utf8]"
}</langsyntaxhighlight>
 
{{out}}<pre>U+0041 A 41
Line 4,070 ⟶ 4,179:
U+1D11E 𝄞 f0 9d 84 9e
</pre>
 
=={{header|UNIX Shell}}==
{{works with|Bourne Again SHell}}
{{works with|Korn Shell}}
{{works with|Zsh}}
 
Works with locale set to UTF-8.
 
<syntaxhighlight lang="bash">function encode {
typeset -i code_point=$1
printf "$(printf '\\U%08X\\n' "$code_point")"
}
function decode {
typeset character=$1
printf 'U+%04X\n' "'$character"
set +x
}
printf 'Char\tCode Point\tUTF-8 Bytes\n'
for test in A ö Ж € 𝄞; do
code_point=$(decode "$test")
utf8=$(encode "$(( 16#${code_point#U+} ))")
bytes=$(printf '%b' "$utf8" | od -An -tx1 | sed -nE '/./s/^ *| *$//p')
printf '%-4b\t%-10s\t%s\n' "$utf8" "$code_point" "$bytes"
done</syntaxhighlight>
 
{{Out}}
<pre style="font-family: Consolas,Courier,monospace">Char Code Point UTF-8 Bytes
A U+0041 41
ö U+00F6 c3 b6
Ж U+0416 d0 96
€ U+20AC e2 82 ac
𝄞 U+1D11E f0 9d 84 9e</pre>
 
=={{header|VBA}}==
<langsyntaxhighlight VBlang="vb">Private Function unicode_2_utf8(x As Long) As Byte()
Dim y() As Byte
Dim r As Long
Line 4,193 ⟶ 4,334:
Debug.Print String$(8 - Len(s), " "); s
Next cpi
End Sub</langsyntaxhighlight>{{out}}<pre>ch unicode UTF-8 encoded decoded
A 41 41 41
ö F6 C3 B6 F6
Line 4,200 ⟶ 4,341:
? 1D11E F0 9D 84 9E 1D11E
</pre>
 
=={{header|V (Vlang)}}==
<syntaxhighlight lang="v (vlang)">import encoding.hex
fn decode(s string) ?[]u8 {
return hex.decode(s)
}
 
fn main() {
println("${'Char':-7} ${'Unicode':7}\tUTF-8 encoded\tDecoded")
for codepoint in [`A`, `ö`, `Ж`, `€`, `𝄞`] {
encoded := codepoint.bytes().hex()
decoded := decode(encoded)?
println("${codepoint:-7} U+${codepoint:04X}\t${encoded:-12}\t${decoded.bytestr()}")
}
}</syntaxhighlight>
{{out}}
<pre>Char Unicode UTF-8 encoded Decoded
A U+0041 41 A
ö U+00F6 c3b6 ö
Ж U+0416 d096 Ж
€ U+20AC e282ac €
𝄞 U+1D11E f09d849e 𝄞
</pre>
 
=={{header|VBScript}}==
<syntaxhighlight lang="vb">
Option Explicit
Dim m_1,m_2,m_3,m_4
Dim d_2,d_3,d_4
Dim h_0,h_2,h_3,h_4
Dim mc_0,mc_2,mc_3,mc_4
 
m_1=&h3F
d_2=m_1+1
m_2=m_1 * d_2
d_3= (m_2 Or m_1)+1
m_3= m_2* d_2
d_4=(m_3 Or m_2 Or m_1)+1
 
h_0=&h80
h_2=&hC0
h_3=&hE0
h_4=&hF0
 
mc_0=&h3f
mc_2=&h1F
mc_3=&hF
mc_4=&h7
 
Function cp2utf8(cp) 'cp as long, returns string
If cp<&h80 Then
cp2utf8=Chr(cp)
ElseIf (cp <=&H7FF) Then
cp2utf8=Chr(h_2 or (cp \ d_2) )&Chr(h_0 Or (cp And m_1))
ElseIf (cp <=&Hffff&) Then
cp2utf8= Chr(h_3 Or (cp\ d_3)) & Chr(h_0 Or (cp And m_2)\d_2) & Chr(h_0 Or (cp And m_1))
Else
cp2utf8= Chr(h_4 Or (cp\d_4))& Chr(h_0 Or ((cp And m_3) \d_3))& Chr(h_0 Or ((cp And m_2)\d_2)) & Chr(h_0 Or (cp And m_1))
End if
End Function
 
Function utf82cp(utf) 'utf as string, returns long
Dim a,b,m
m=strreverse(utf)
b= Len(utf)
a=asc(mid(m,1,1))
utf82cp=a And &h7f
if b=1 Then Exit Function
a=asc(mid(m,2,1))
If b=2 Then utf82cp= utf82cp Or (a And mc_2)*d_2 :Exit function
utf82cp= utf82cp Or (a And m_1)*d_2
a=asc(mid(m,3,1))
If b=3 Then utf82cp= utf82cp Or (a And mc_3)*d_3 :Exit function
utf82cp= utf82cp Or (a And m_1)*d_3 Or (a=asc(mid(m,4,1)) And mc_4)*d_4
End Function
 
Sub print(s):
On Error Resume Next
WScript.stdout.Write (s)
If err= &h80070006& Then WScript.Echo " Please run this script with CScript": WScript.quit
End Sub
Function utf8displ(utf)
Dim s,i
s=""
For i=1 To Len(utf)
s=s &" "& Hex(Asc(Mid(utf,i,1)))
Next
utf8displ= pad(s,12)
End function
 
function pad(s,n) if n<0 then pad= right(space(-n) & s ,-n) else pad= left(s& space(n),n) end if :end function
 
Sub check(i)
Dim c,c0,c1,c2,u
c=b(i):c0=pad(c(0),29) :c1=c(1) :c2=pad(c(2),12):u=cp2utf8(c1)
print c0 & " CP:" & pad("U+" & Hex(c1),-8) & " my utf8:" & utf8displ (u) & " should be:" & c2 & " back to CP:" & pad("U+" & Hex(utf82cp(u)),-8)& vbCrLf
End Sub
 
Dim b
b=Array(_
Array("LATIN CAPITAL LETTER A ",&h41," 41"),_
Array("LATIN SMALL LETTER O WITH DIAERESIS ",&hF6," C3 B6"),_
Array("CYRILLIC CAPITAL LETTER ZHE ",&h416," D0 96"),_
Array("EURO SIGN",&h20AC," E2 82 AC "),_
Array("MUSICAL SYMBOL G CLEF ",&h1D11E," F0 9D 84 9E"))
 
check 0
check 1
check 2
check 3
check 4
</syntaxhighlight>
{{out}}
<small>
<pre>
LATIN CAPITAL LETTER A CP: U+41 my utf8: 41 should be: 41 back to CP: U+41
LATIN SMALL LETTER O WITH DIA CP: U+F6 my utf8: C3 B6 should be: C3 B6 back to CP: U+F6
CYRILLIC CAPITAL LETTER ZHE CP: U+416 my utf8: D0 96 should be: D0 96 back to CP: U+416
EURO SIGN CP: U+20AC my utf8: E2 82 AC should be: E2 82 AC back to CP: U+20AC
MUSICAL SYMBOL G CLEF CP: U+1D11E my utf8: F0 9D 84 9E should be: F0 9D 84 9E back to CP: U+1D11E
</pre>
</small>
 
=={{header|Wren}}==
The utf8_decode function was translated from the Go entry.
<langsyntaxhighlight ecmascriptlang="wren">import "./fmt" for Fmt
 
var utf8_encode = Fn.new { |cp| String.fromCodePoint(cp).bytes.toList }
Line 4,242 ⟶ 4,506:
var uni = String.fromCodePoint(cp2)
System.print("%(Fmt.s(-11, uni)) %(Fmt.s(-37, test[0])) U+%(Fmt.s(-8, Fmt.Xz(4, cp2))) %(utf8)")
}</langsyntaxhighlight>
 
{{out}}
Line 4,256 ⟶ 4,520:
 
=={{header|zkl}}==
<langsyntaxhighlight lang="zkl">println("Char Unicode UTF-8");
foreach utf,unicode_int in (T( T("\U41;",0x41), T("\Uf6;",0xf6),
T("\U416;",0x416), T("\U20AC;",0x20ac), T("\U1D11E;",0x1d11e))){
Line 4,265 ⟶ 4,529:
 
println("%s %s %9s %x".fmt(char,char2,"U+%x".fmt(unicode_int),utf_int));
}</langsyntaxhighlight>
Int.len() --> number of bytes in int. This could be hard coded because UTF-8
has a max of 6 bytes and (0x41).toBigEndian(6) --> 0x41,0,0,0,0,0 which is
885

edits