Unicode strings: Difference between revisions

add link to a Tom Scott video
(Lingo added)
(add link to a Tom Scott video)
 
(81 intermediate revisions by 32 users not shown)
Line 2:
As the world gets smaller each day, internationalization becomes more and more important.   For handling multiple languages, [[Unicode]] is your best friend.
 
It is a very capable and [https://www.youtube.com/watch?v=MijmeoH9LT4 remarquable] tool, but also quite complex compared to older single- and double-byte character encodings.
 
How well prepared is your programming language for Unicode?
Line 30:
*   [[Terminal control/Display an extended character]]
<br><br>
 
=={{header|11l}}==
11l source code is specified to be UTF-8 encoded.
 
All strings in 11l are UTF-16 encoded.
 
=={{header|80386 Assembly}}==
Line 83 ⟶ 88:
{{works with|ALGOL 68G|Any - tested with release [http://sourceforge.net/projects/algol68/files/algol68g/algol68g-1.18.0/algol68g-1.18.0-9h.tiny.el5.centos.fc11.i386.rpm/download 1.18.0-9h.tiny].}}
{{wont work with|ELLA ALGOL 68|Any (with appropriate job cards) - tested with release [http://sourceforge.net/projects/algol68/files/algol68toc/algol68toc-1.8.8d/algol68toc-1.8-8d.fc9.i386.rpm/download 1.8-8d] - due to extensive use of '''format'''[ted] ''transput''.}}
<langsyntaxhighlight lang="algol68">#!/usr/local/bin/a68g --script #
# -*- coding: utf-8 -*- #
 
Line 383 ⟶ 388:
))
 
)</langsyntaxhighlight>
{{out}}
<pre>
Line 393 ⟶ 398:
1 < 2 is T, 111 < 11 is F, 111 < 12 is T, ♥ < ♦ is T, ♥Q < ♥K is F & ♥J < ♥K is T
</pre>
 
=={{header|Arturo}}==
 
<syntaxhighlight lang="rebol">text: "你好"
 
print ["text:" text]
print ["length:" size text]
print ["contains string '好'?:" contains? text "好"]
print ["contains character '平'?:" contains? text `平`]
print ["text as ascii:" as.ascii text]</syntaxhighlight>
 
{{out}}
 
<pre>text: 你好
length: 2
contains string '好'?: true
contains character '平'?: false
text as ascii: Ni Hao</pre>
 
=={{header|AutoHotkey}}==
Line 423 ⟶ 446:
How broad/deep does the language support Unicode? What encodings (e.g. UTF-8, UTF-16, etc) can be used? - There is no inbuilt support for Unicode, but all encodings can be represented through hexadecimal strings.
 
=={{header|BBC BASIC}}==
==={{header|BBC BASIC}}===
{{works with|BBC BASIC for Windows}}
* How easy is it to present Unicode strings in source code?
Line 439 ⟶ 463:
'''Code example:'''
(whether this listing displays correctly will depend on your browser)
<langsyntaxhighlight lang="bbcbasic"> VDU 23,22,640;512;8,16,16,128+8 : REM Select UTF-8 mode
*FONT Times New Roman, 20
Line 496 ⟶ 520:
B$ += CHR$?A%
NEXT
= LEFT$(B$)</langsyntaxhighlight>
[[Image:unicode_bbc.gif]]
 
Line 512 ⟶ 536:
 
=={{header|C}}==
C is not the most unicode friendly language, to put it mildly. Generally using unicode in C requires dealing with locales, manage data types carefully, and checking various aspects of your compiler. Directly embedding unicode strings in your C source might be a bad idea, too; it's safer to use their hex values. Here's a short example of doing the simplest string handling: print it.<langsyntaxhighlight Clang="c">#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
Line 523 ⟶ 547:
 
int main() {
/* Set the locale to alert C's multibyte output routines */
if (!setlocale(LC_CTYPE, "")) {
fprintf(stderr, "Locale failure, check your env vars\n");
return 1;
}
}
 
#ifdef __STDC_ISO_10646__
/* C99 compilers should understand these */
printf("%lc\n", 0x2708); /* ✈ */
printf("%ls\n", poker); /* ♥♦♣♠ */
printf("%ls\n", four_two); /* 四十二 */
#else
/* oh well */
printf("airplane\n");
printf("club diamond club spade\n");
printf("for ty two\n");
#endif
return 0;
}</langsyntaxhighlight>
 
=={{header|C sharp|C#}}==
In C#, the native string representation is actually determined by the Common Language Runtime. In CLR, the string data type is a sequence of char, and the char data type represents a UTF-16 code unit. The native string representation is essentially UTF-16, except that strings can contain sequences of UTF-16 code units that aren't valid in UTF-16 if the string contains incorrectly-used high and low surrogates.
 
C# string literals support the \u escape sequence for 4-digit hexadecimal Unicode code points, \U for 6-digit code points, and UTF-encoded source code is also supported so that "Unicode strings" can be included in the source code as-is.
 
C# benefits from the extensive support for Unicode in the .NET Base Class Library, including
* Various UTF encodings
* String normalization
* Unicode character database subset
* Breaking strings into text elements
 
=={{header|Common Lisp}}==
Default unicode strings for most implementations. Unicode chars can be used on variable and function names.
Tested in SBCL 1.2.7 and ECL 13.5.1
<langsyntaxhighlight lang="lisp">
(defvar ♥♦♣♠ "♥♦♣♠")
(defun ✈ () "a plane unicode function")
</syntaxhighlight>
</lang>
 
=={{header|D}}==
<syntaxhighlight lang="d">import std.stdio;
import std.uni; // standard package for normalization, composition/decomposition, etc..
import std.utf; // standard package for decoding/encoding, etc...
 
void main() {
// normal identifiers are allowed
int a;
// unicode characters are allowed for identifers
int δ;
 
char c; // 1 to 4 byte unicode character
wchar w; // 2 or 4 byte unicode character
dchar d; // 4 byte unicode character
 
writeln("some text"); // strings by default are UTF8
writeln("some text"c); // optional suffix for UTF8
writeln("こんにちは"c); // unicode charcters are just fine (stored in the string type)
writeln("Здравствуйте"w); // also avaiable are UTF16 string (stored in the wstring type)
writeln("שלום"d); // and UTF32 strings (stored in the dstring type)
 
// escape sequences like what is defined in C are also allowed inside of strings and characters.
}</syntaxhighlight>
 
=={{header|DWScript}}==
Line 557 ⟶ 616:
 
Strings are UTF-16.
 
=={{header|Elena}}==
ELENA supports both UTF8 and UTF16 strings, Unicode identifiers are also supported:
 
ELENA 6.x:
<lang elena>#import system.
<syntaxhighlight lang="elena">public program()
 
{
#symbol program =
var 四十二 := "♥♦♣♠"; // UTF8 string
[
#var 四十二строка := "♥♦♣♠Привет". w; // UTF8UTF16 string
#var строка := "Привет"w. // UTF16 string
console.writeLine(строка);
console writeLine:строка.writeLine(四十二);
}</syntaxhighlight>
console writeLine:四十二.
].</lang>
{{out}}
<pre>
Line 577 ⟶ 636:
 
=={{header|Elixir}}==
Elixir has exceptionally good Unicode support in Strings. Its String module is fully compliant with the Unicode Standard, version 6.3.0. Internally, Strings are encoded in UTF-8. As source files are also typically Unicode encoded, String literals can be either written directly or via escape sequences. However, non-ASCII Unicode identifiers (variables, functions, ...) are not allowed.
 
=={{header|Erlang}}==
The simplified explanation is that Erlang allows Unicode in comments/data/file names/etc, but not in function or variable names.
 
=={{header|FreeBASIC}}==
FreeBASIC has decent support for Unicode, although not as complete as some other languages.
 
* How easy is it to present Unicode strings in source code?
FreeBASIC can handle ASCII files with Unicode escape sequences (\u), and can also parse source (.bas) or header (.bi) files into UTF-8, UTF-16LE, UTF-16BE. , UTF-32LE and UTF-32BE. These files can be freely mixed with other source or header files in the same project.
 
* Can Unicode literals be written directly, or be part of identifiers/keywords/etc?
String literals can be written in the original non-Latin alphabet, you just need to use a text editor that supports some of the mentioned Unicode formats.
 
* How well can the language communicate with the rest of the world?
FreeBASIC can communicate with other programs and systems that use Unicode. However, manipulating Unicode strings can be more complicated because many string functions become more complex.
 
* Is it good at input/output with Unicode?
The <code>Open</code> function supports UTF-8, UTF-16LE and UTF-32LE files with the encoding specifier.
The <code>Input#</code> and <code>Line Input#</code> functions as well as <code>Print#</code> <code>Write#</code> can be used normally, and any conversion between Unicode and ASCII is done automatically if necessary. The <code>Print</code> function also supports Unicode output.
 
* Is it convenient to manipulate Unicode strings in the language?
Although FreeBASIC supports wide characters in a string, it does not support dynamic strings. However, there are some libraries included with FreeBASIC to decode UTF-8 to wstring.
 
* What encodings (e.g. UTF-8, UTF-16, etc) can be used?
Unicode support in FreeBASIC is quite extensive, but not as deep as in other programming languages. It can handle most basic Unicode tasks, but more advanced tasks may require additional libraries.
 
* What encodings (e.g. UTF-8, UTF-16, etc) can be used?
FreeBASIC supports several encodings, including UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE.
 
* Does it support normalization?
FreeBASIC does not have built-in support for Unicode normalization. However, it is possible to use external libraries to perform normalization.
 
For example, <syntaxhighlight lang="vbnet">' Define a Unicode string
Dim unicodeString As String
unicodeString = "こんにちは, 世界! 🌍"
 
' Print the Unicode string
Print unicodeString
 
' Wait for the user to press a key before closing the console
Sleep</syntaxhighlight>
 
 
=={{header|Go}}==
Line 588 ⟶ 686:
The <code>string</code> data type represents a read-only sequence of bytes, conventionally but not necessarily representing UTF-8-encoded text.
A number of built-in features interpret <code>string</code>s as UTF-8. For example,
<langsyntaxhighlight lang="go"> var i int
var u rune
for i, u = range "voilà" {
fmt.Println(i, u)
}</langsyntaxhighlight>
{{out}}
<pre>
Line 605 ⟶ 703:
 
In contrast,
<langsyntaxhighlight lang="go"> w := "voilà"
for i := 0; i < len(w); i++ {
fmt.Println(i, w[i])
}
</syntaxhighlight>
</lang>
{{out}}
<pre>
Line 636 ⟶ 734:
 
Unicode is built-in in Haskell, so it can be used in strings and functions names.
 
 
=={{header|J}}==
Line 642 ⟶ 739:
Unicode characters can be represented directly in J strings:
 
<langsyntaxhighlight lang="j"> '♥♦♣♠'
♥♦♣♠</langsyntaxhighlight>
 
By default, they are represented as utf-8:
 
<langsyntaxhighlight lang="j"> #'♥♦♣♠'
12</langsyntaxhighlight>
 
The above string requires 12 literal elements to represent the four characters using utf-8.
Line 654 ⟶ 751:
However, they can be represented as utf-16 instead:
 
<langsyntaxhighlight lang="j"> 7 u:'♥♦♣♠'
♥♦♣♠
#7 u:'♥♦♣♠'
4</langsyntaxhighlight>
 
The above string requires 4 literal elements to represent the four characters using utf-16. (7 u: string produces a utf-16 result.)
Line 663 ⟶ 760:
These forms are not treated as equivalent:
 
<langsyntaxhighlight lang="j"> '♥♦♣♠' -: 7 u:'♥♦♣♠'
0</langsyntaxhighlight>
 
The utf-8 string of literals is a different string of literals from the utf-16 string.
Line 670 ⟶ 767:
unless the character literals themselves are equivalent:
 
<langsyntaxhighlight lang="j"> 'abcd'-:7 u:'abcd'
1</langsyntaxhighlight>
 
Here, we were dealing with ascii characters, so the four literals needed to represent the characters using utf-8 matched the four literals needed to represent the characters using utf-16.
Line 677 ⟶ 774:
When this is likely to be an issue, you should enforce a single representation. For example:
 
<langsyntaxhighlight lang="j"> '♥♦♣♠' -:&(7&u:) 7 u:'♥♦♣♠'
1
'♥♦♣♠' -:&(8&u:) 7 u:'♥♦♣♠'
1</langsyntaxhighlight>
 
Here, we see that even when comparing non-ascii characters, we can coerce both arguments to be utf-8 or utf-16 andor inutf-32 either caseand the resulting literal strings would match. (8 u: string produces a utf-8 result.)
 
Output uses characters in whatever format they happen to be in.
Line 713 ⟶ 810:
Starting in J2SE 5 (1.5), Java has fairly convenient methods for dealing with true Unicode characters, even supplementary ones. Many methods that deal with characters have versions for both <code>char</code> and <code>int</code>. For example, <code>String</code> has the <code>codePointAt</code> method, analogous to the <code>charAt</code> method.
 
How broad/deep does the language support Unicode? What encodings (e.g. UTF-8, UTF-16, etc) can be used? Normalization?
 
=={{header|jq}}==
Line 733 ⟶ 830:
=={{header|Julia}}==
Non-ASCII strings in Julia are UTF8-encoded by default, and Unicode identifiers are also supported:
<langsyntaxhighlight Julialang="julia">julia> 四十二 = "voilà";
julia> println(四十二)
voilà</langsyntaxhighlight>
And you can also specify unicode characters by ordinal:
<langsyntaxhighlight Julialang="julia">julia>println("\u2708")
✈</langsyntaxhighlight>
 
=={{header|Kotlin}}==
In the version of Kotlin targetting the JVM, Kotlin strings are mapped to Java strings and so everything that has already been said in the Java entry for this task applies equally to Kotlin.
 
I would only add that normalization of strings is supported in both languages via the java.text.Normalizer class.
 
Here's a simple example of using both unicode identifiers and unicode strings in Kotlin:
<syntaxhighlight lang="scala">// version 1.1.2
 
fun main(args: Array<String>) {
val åäö = "as⃝df̅ ♥♦♣♠ 頰"
println(åäö)
}</syntaxhighlight>
 
{{out}}
<pre>
as⃝df̅ ♥♦♣♠ 頰
</pre>
 
=={{header|langur}}==
Source code in langur is pure UTF-8 without a BOM and without surrogate codes.
 
Identifiers are ASCII only. Comments and string literals may use Unicode.
 
Indexing on a string indexes by code point. The index may be a single number, a range, or a list of such things.
 
Conversion between code point numbers, graphemes, and strings can be done with the cp2s(), s2cp(), and s2gc() functions. Conversion between UTF-8 byte lists and langur strings can be done with b2s() and s2b() functions.
 
The len() function returns the number of code points in a string.
 
Normalization can be handled with the functions nfc(), nfd(), nfkc(), and nfkd().
 
Using a for of loop over a string gives the code point indices, and using a for in loop over a string gives the code point numbers.
 
Interpolation modifiers allow limiting a string by code points or by graphemes.
 
See langurlang.org for more details.
 
=={{header|Lasso}}==
Line 747 ⟶ 881:
Variable names can not contain anything but ASCII.
 
<langsyntaxhighlight Lassolang="lasso">local(unicode = '♥♦♣♠')
#unicode -> append('\u9830')
#unicode
Line 753 ⟶ 887:
#unicode -> get (2)
'<br />'
#unicode -> get (4) -> integer</langsyntaxhighlight>
{{out}}
<pre>♥♦♣♠頰
Line 764 ⟶ 898:
 
Here is example UFT-8 encoding:
<langsyntaxhighlight lang="lisp">
> (set encoded (binary ("åäö ð" utf8)))
#B(195 165 195 164 195 182 32 195 176)
</langsyntaxhighlight>
 
Display it in native Erlang format:
 
<langsyntaxhighlight lang="lisp">
> (io:format "~tp~n" (list encoded))
<<"åäö ð"/utf8>>
</syntaxhighlight>
</lang>
 
Example UFT-8 decoding:
<langsyntaxhighlight lang="lisp">
> (unicode:characters_to_list encoded 'utf8)
"åäö ð"
</syntaxhighlight>
</lang>
 
=={{header|Lingo}}==
In recent versions (since v11.5) of Lingo's only implementation "Director" UTF-8 is the default encoding for both scripts and strings. Therefor unicodeUnicode string literals can be specified directly in the code, and also variable names support Unicode. To represent/deal with string data in other encodings, you have to use the ByteArray data type. Various ByteArray as well as FileIO methods support an optional 'charSet' parameter that allows to transcode data to/from UTF-8 on the fly. The supported 'charSet' strings can be displayed like this:
<syntaxhighlight lang ="lingo">put _system.getInstalledCharSets()
-- ["big5", "cp1026", "cp866", "ebcdic-cp-us", "gb2312", "ibm437", "ibm737",
-- ["big5", "cp1026", "cp866", "ebcdic-cp-us", "gb2312", "ibm437", "ibm737", "ibm775", "ibm850", "ibm852", "ibm857", "ibm861", "ibm869", "iso-8859-1", "iso-8859-15", "iso-8859-2", "iso-8859-4", "iso-8859-5", "iso-8859-7", "iso-8859-9", "johab", "koi8-r", "koi8-u", "ks_c_5601-1987", "macintosh", "shift_jis", "us-ascii", "utf-16", "utf-16be", "utf-7", "utf-8", "windows-1250", "windows-1251", "windows-1252", "windows-1253", "windows-1254", "windows-1255", "windows-1256", "windows-1257", "windows-1258", "windows-874", "x-ebcdic-greekmodern", "x-mac-ce", "x-mac-cyrillic", "x-mac-greek", "x-mac-icelandic", "x-mac-turkish"]</lang>
"ibm775", "ibm850", "ibm852", "ibm857", "ibm861", "ibm869", "iso-8859-1",
"iso-8859-15", "iso-8859-2", "iso-8859-4", "iso-8859-5", "iso-8859-7",
"iso-8859-9", "johab", "koi8-r", "koi8-u", "ks_c_5601-1987", "macintosh",
"shift_jis", "us-ascii", "utf-16", "utf-16be", "utf-7", "utf-8", "windows-1250",
"windows-1251", "windows-1252", "windows-1253", "windows-1254", "windows-1255",
"windows-1256", "windows-1257", "windows-1258", "windows-874",
"x-ebcdic-greekmodern", "x-mac-ce", "x-mac-cyrillic", "x-mac-greek",
"x-mac-icelandic", "x-mac-turkish"]</syntaxhighlight>
 
=={{header|Locomotive Basic}}==
Line 809 ⟶ 951:
It should be added however that the character set can be easily redefined from BASIC with the SYMBOL and SYMBOL AFTER commands, so the CPC character set can be turned into e.g. Latin-1. As two-byte UTF-8 characters can be converted to Latin-1, at least a subset of Unicode can be printed in this way:
 
<langsyntaxhighlight lang="locobasic">10 CLS:DEFINT a-z
20 ' define German umlauts as in Latin-1
30 SYMBOL AFTER 196
Line 830 ⟶ 972:
200 ' zero-terminated UTF-8 string
210 DATA &48,&C3,&A4,&6C,&6C,&C3,&B6,&20,&4C,&C3,&BC,&64,&77,&69,&67,&2E,&20,&C3,&84,&C3,&96,&C3,&9C
220 DATA &20,&C3,&A4,&C3,&B6,&C3,&BC,&20,&56,&69,&65,&6C,&65,&20,&47,&72,&C3,&BC,&C3,&9F,&65,&21,&00</langsyntaxhighlight>
 
Produces this (slightly nonsensical) output:
 
[[File:Unicode print locomotive basic.png]]
 
=={{header|Mathematica Lua}}==
 
By default, Lua doesn't support Unicode. Most string methods will work properly on the ASCII range only like [[String case#Lua|case transformation]]. But there is a [https://www.lua.org/manual/5.4/manual.html#6.5 <code>utf8</code>] module that add some very basic support with a very limited number of functions. For example, this module brings a new [[String length#Lua|length method]] adapted for UTF-8. But there is no method to transform the case of Unicode string correctly. So globally the Unicode support is very limited and not by default.
 
=={{header|M2000 Interpreter}}==
* How easy is it to present Unicode strings in source code?
 
We copy them in M2000 editor. Internal M2000 use UTF16LE but programs saved in UTF-8 format.
 
* Can Unicode literals be written directly, or be part of identifiers/keywords/etc?
 
Yes
 
* How well can the language communicate with the rest of the world?
 
GUI support unicode. We can use filenames with names from any language. Text files are UTF-16LE or Ansi (we can use WIDE for specify UNICODE, also we can specify the locale for ANSI), and we can load/Save Document object using UTF-8, UTF-16LE, UTF-16BE and ANSI (we can specify the locale). Clipboard is unicode also.
 
* Is it good at input/output with Unicode?
 
If we use proportional text we are ok, byt simple text output/input to console break each word to letter and then send it to console, so we get always a left to right output. We can use diacritical marks combining to same letter.
 
* Is it convenient to manipulate Unicode strings in the language?
 
Strings are same as Visual Basic 6 strings (M2000 Interpreter written in VB6). We can get the length of a string as display length wich calculate diacritical marks. We can make strings as json type, using same symbols to represent unicode letters (we can do the reverse, to produce from unicode the proper json string)
 
* How broad/deep does the language support Unicode?
 
From variables/keys to files and screen/printer output. Also we can use external COM objects using unicode strings,
 
* What encodings (e.g. UTF-8, UTF-16, etc) can be used?
 
A string may contain a one byte char array or a two byte char array. The Len() function always return the two byte length, so 3 bytes string return a length of 1.5. Encoding is not bound to string but with function which process the string. So there are functions to process in UTF-16LE, other to process in ANSI, and one function for conversions from and to UTF-8. UTF-16BE supported only for loading/Saving a document object (internal is always in UTF16-LE)
 
* Does it support normalization?
 
No. A string may contain any value including zero. Max size is 2GB. Also we can make strings in buffers with specific length, and any value.
 
 
<syntaxhighlight lang="m2000 interpreter">
Font "Arial"
Mode 32
' M2000 internal editor can display left to rigtht languages if text is in same line, and same color.
a$="لم أجد هذا الكتاب القديم"
' We can use console to display text, using proportional spacing
Print Part $(4), a$
Print
' We can display right to left using
' the legend statement which render text at a given
' graphic point, specify the font type and font size
' and optional:
' rotation angle, justification (1 for right,2 for center, 3 for left)
'quality (0 or non 0, which 0 no antialliasing)
' letter spacing in twips (not good for arabic language)
move 6000,6000
legend a$, "Arial", 32, pi/4, 2, 0
' Variables can use any unicode letter.
' Here we can't display it as in M2000 editor.
' in the editor we see at the left the variable name
' and at the right the value
القديم=10
Print القديم+1=11 ' true
</syntaxhighlight>
 
=={{header|Mathematica}}/{{header|Wolfram Language}}==
Mathematica supports full Unicode throughout -- in strings, symbols, graphics and external operations -- allowing immediate streamlined use of all standard international character sets, integrated with native text entry.
The global variable $CharacterEncodings is an option for input and output functions which specifies what raw character encoding should be used.
Line 871 ⟶ 1,075:
 
=={{header|Nim}}==
 
Strings are assumed to be UTF-8 in Nim.
::– How easy is it to present Unicode strings in source code?
<lang nim>let c = "abcdé"
It is very easy, provided that the editor understands UTF-8. Indeed, Nim considers that source is encoded in UTF-8.
let Δ = 12
 
let e = "$abcde¢£¤¥©ÇßçIJijŁłʒλπ•₠₡₢₣₤₥₦₧₨₩₪₫€₭₮₯₰₱₲₳₴₵₵←→⇒∙⌘☺☻ア字文𪚥"
::– Can Unicode literals be written directly, or be part of identifiers/keywords/etc?
echo e</lang>
Unicode literals can be written directly in strings, again provided that the editor (and the font) are able to display them. It is of course possible to use the \uXXXX form. Identifiers may contain Unicode characters but the restrictions regarding the
allowed characters apply. That means than the first character must be a letter (for instance “é” is allowed as is “Δ”) and the following characters may be letters, “_” or digits. For instance “x²” is allowed.
 
::– How well can the language communicate with the rest of the world?
Nim strings may contain any UTF-8 combination. Note however that, by default, no check is done regarding the validity of the string. This is the usual behavior for system languages for obvious performance reasons.
 
Note also that Nim strings are easily converted to C string. Interoperability with C is an important feature in Nim (helped by the fact that C is one of the intermediate languages used to produce native code).
 
::– Is it good at input/output with Unicode?
 
Again, this is more a question regarding the environment than the language itself. For instance, if a terminal accepts Unicode strings as input and output, Nim programs will be able to read and write them as UTF-8 strings. So, in practice, with modern systems, reading and writing Unicode is flawless (I write this as a user of Unicode myself).
 
::– Is it convenient to manipulate Unicode strings in the language?
 
It depends. If manipulation consists to read and write, this is easy. But strings are not aware that their contents is encoded in UTF-8. That means that if you loop on the string, you get eight bits character values and not code points.
 
To operate on code points, Nim provides a module “unicode”. In this module, there are only two types: strings which contains values encoded in UTF-8 and runes which are, in fact, UTF-32 values. The module provides procedures to convert strings to and from sequences of runes. It provides also iterators to get the code points of a UTF-8 encoded string, a set of operations such as find or append and, of course, a procedure to check the validity of a string (which is not done by default).
 
::– How broad/deep does the language support Unicode?
 
The module “unicode” provides only basic functionalities. For more advanced Unicode handling, third party modules are available from community.
 
::– What encodings (e.g. UTF-8, UTF-16, etc) can be used?
 
UTF-8 is the default encoding. UTF-32 is available with the module “unicode”. The standard module “encodings” gives access to many other encodings. The available encodings depends on the operating system. On Unix/Linux, the “iconv” library is used. On Windows, this is the Windows API.
 
::– Does it support normalization?
 
The “unicode” module provides only basic Unicode handling. It doesn’t provide normalization. But there exists at least one library which offers this functionality: “nim-normalize” (https://github.com/nitely/nim-normalize).
 
=={{header|Oforth}}==
Line 884 ⟶ 1,117:
 
All methods on strings are UTF8 manipulations.
 
=={{header|Ol}}==
 
* How easy is it to present Unicode strings in source code?
 
Easy. Just write the Unicode characters. Or, use "\x...;" key sequence if you wish.
 
* Can Unicode literals be written directly, or be part of identifiers/keywords/etc?
 
Yes, it can. Moreover, in source codes widely used short "λ" (Unicode lambda character) instead of wide "lambda" keyword, that is part of language.
 
* How well can the language communicate with the rest of the world?
 
FFI language extension automatically converts strings into external ansi or UTF-16 strings (and vice versa).
 
* Is it good at input/output with Unicode?
 
Yes.
 
* Is it convenient to manipulate Unicode strings in the language?
 
Pretty well. There are no different string functions for ANSI and Unicode strings. Only way to differ Unicode and ANSI is to check directly the string type (for ANSI it's type-string and for Unicode type-unicode).
 
* How broad/deep does the language support Unicode?
 
Deep.
 
* What encodings (e.g. UTF-8, UTF-16, etc) can be used?
 
The internal Unicode string representation is plane Unicode without any encodings. Two builtin functions support UTF-8 encoding ('string->bytes' and 'bytes->string').
 
=={{header|Perl}}==
In Perl, "Unicode" means "UTF-8". If you want to include utf8 characters in your source file, unless you have set <code>PERL_UNICODE</code> environment correctly, you should do<syntaxhighlight lang Perl="perl">use utf8;</langsyntaxhighlight> or you risk the parser treating the file as raw bytes.
 
Inside the script, utf8 characters can be used both as identifiers and literal strings, and built-in string functions will respect it:<langsyntaxhighlight Perllang="perl">$四十二 = "voilà";
print "$四十二"; # voilà
print uc($四十二); # VOILÀ</langsyntaxhighlight>
or you can specify unicode characters by name or ordinal:<langsyntaxhighlight Perllang="perl">use charnames qw(greek);
$x = "\N{sigma} \U\N{sigma}";
$y = "\x{2708}";
print scalar reverse("$x $y"); # ✈ Σ σ</langsyntaxhighlight>
 
Regular expressions also have support for unicode based on properties, for example, finding characters that's normally written from right to left:<langsyntaxhighlight Perllang="perl">print "Say עִבְרִית" =~ /(\p{BidiClass:R})/g; # עברית</langsyntaxhighlight>
 
When it comes to IO, one should specify whether a file is to be opened in utf8 or raw byte mode:<langsyntaxhighlight Perllang="perl">open IN, "<:utf8", "file_utf";
open OUT, ">:raw", "file_byte";</langsyntaxhighlight>
The default of IO behavior can also be set in <code>PERL_UNICODE</code>.
 
However, when your program interacts with the environment, you may still run into tricky spots if you have incompatible locale settings or your OS is not using unicode; that's not what Perl has control over, unfortunately.
 
=={{header|Perl 6}}==
Perl 6 programs and strings are all in Unicode and operate at a grapheme abstraction level, which is agnostic to underlying encodings or normalizations. (These are generally handled at program boundaries.) Opened files default to UTF-8 encoding. All Unicode character properties are in play, so any appropriate characters may be used as parts of identifiers, whitespace, or user-defined operators. For instance:
 
<lang perl6>sub prefix:<∛> (\𝐕) { 𝐕 ** (1/3) }
say ∛27; # prints 3</lang>
 
Non-Unicode strings are represented as Buf types rather than Str types, and Unicode operations may not be applied to Buf types without some kind of explicit conversion. Only ASCIIish operations are allowed on buffers.
 
As things develop, Perl 6 intends to support Unicode even better than Perl 5, which already does a great job in recent versions of accessing nearly all Unicode 6.0 functionality. Perl 6 improves on Perl 5 primarily by offering explicitly typed strings that always know which operations are sensical and which are not.
 
=={{header|Phix}}==
{{libheader|Phix/basics}}
Source code files can be ansi or UTF-8.<br>
Strings containing the escape sequences \xHH, \uHHHH, or \UHHHHHHHH are permitted, with 16 and 32-bit unicode points converted to a UTF-8 substring.<br>
Line 925 ⟶ 1,179:
=={{header|PicoLisp}}==
PicoLisp can directly handle _only_ Unicode (UTF-8) strings. So the problem is rather how to handle non-Unicode strings: They must be pre- or post-processed by external tools, typically with pipes during I/O. For example, to read a line from a file in 8859 encoding:
<langsyntaxhighlight PicoLisplang="picolisp">(in '(iconv "-f" "ISO-8859-15" "file.txt") (line))</langsyntaxhighlight>
 
=={{header|Pike}}==
All strings in Pike are Unicode internally, the charset of the source
can at any line be changed with the "#charset" pre processor
directive. The default charset is ISO-8859-1, and any of the ~400
charsets supported in the Charset module can be used in source
code. It is also possible to implement your own charset and load it
with #charset directives.
 
Regardless of source charset it's always possible to enter strings as
literals.
 
All IO is untouched bit-streams, but since the terminal probably does
not want a stream of Unicode we manually encode it as UTF8 before
writing it out.
 
<syntaxhighlight lang="pike">
#charset utf8
void main()
{
string nånsense = "\u03bb \0344 \n";
string hello = "你好";
string 水果 = "pineapple";
string 真相 = sprintf("%s, %s goes really well on pizza\n", hello, 水果);
write( string_to_utf8(真相) );
write( string_to_utf8(nånsense) );
}
</syntaxhighlight>
{{Out}}
<pre>
你好, pineapple goes really well on pizza
λ ä
</pre>
 
=={{header|PowerShell}}==
 
Unicode escape sequence (added in PowerShell 6<ref>https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_special_characters?view=powershell-7.3</ref>):
 
<syntaxhighlight lang="powershell">
# `u{x}
"I`u{0307}" # => İ
</syntaxhighlight>
 
=={{header|Python}}==
Python supports writing Unicode literals in any encoding, but you have to declare the encoding being used. This is done by including a special comment as either the first or second line of the source file:
<langsyntaxhighlight Pythonlang="python">#!/usr/bin/env python
# -*- coding: latin-1 -*-
 
u = 'abcdé'
print(ord(u[-1]))</langsyntaxhighlight>
In Python 3, the default encoding is UTF-8. Before that it was ASCII.
 
Line 940 ⟶ 1,236:
=={{header|Racket}}==
 
<syntaxhighlight lang="racket">
<lang Racket>
#lang racket
 
Line 958 ⟶ 1,254:
;; and in fact the standard language makes use of some of these
(λ(x) x) ; -> an identity function
</syntaxhighlight>
</lang>
 
Further points:
Line 967 ⟶ 1,263:
 
* Racket includes additional related functionality, like some Unicode functions (normalization etc), and IO encoding based on iconv to do IO of many other encodings.
 
=={{header|Raku}}==
(formerly Perl 6)
 
Raku programs and strings are all in Unicode and operate at a grapheme abstraction level, which is agnostic to underlying encodings. (These are generally handled at program boundaries.) Opened files default to UTF-8 encoding. All Unicode character properties are in play, so any appropriate characters may be used as parts of identifiers, whitespace, or user-defined operators. For instance:
 
<syntaxhighlight lang="raku" line>sub prefix:<∛> (\𝐕) { 𝐕 ** ⅓ }
say ∛27; # prints 3</syntaxhighlight>
 
Non-Unicode strings are represented as Buf types rather than Str types, and Unicode operations may not be applied to Buf types without some kind of explicit conversion. Only ASCIIish operations are allowed on buffers.
 
Raku tracks the Unicode consortium standards releases and is generally up to the latest standard within a few months or so of its release. (currently at 15.0 as of February 2023)
 
* Supports the normalized forms NFC, NFD, NFKC, and NFKD, and character equivalence as specified in [http://unicode.org/reports/tr15/ Unicode technical report #15].
* Built-in routines provide access to character classifications (Letter, Numeric. White-space, etc) and sub-classifications: (Letter-lowercase, Letter-uppercase, Numeric-digit, etc.)
* Allows users to use any Unicode character that has a Numeric property '''as''' that numeric value.
* Provides Unicode aware upper-case, lower-case and fold-case routines.
* Implements the [https://unicode.org/reports/tr10/ Unicode technical standard #10] collation algorithm, (though not all optional mappings are supported yet).
* Provides built-in routines to access character names, do name-to-character character-to-ordinal and ordinal-to-character conversions.
* Works seamlessly with upper plane and private use plane character codepoints.
* Provides tools to deal with strings that contain invalid Unicode characters.
 
In general, it tries to make dealing with Unicode "just work".
 
Raku intends to support Unicode even better than Perl 5, which already does a great job in recent versions of accessing large swaths of Unicode spec. functionality. Raku improves on Perl 5 primarily by offering explicitly typed strings that always know which operations are sensical and which are not.
 
A very important distinctive characteristic of Raku to keep in mind is that it applies normalization (Unicode NFC form (Normalization Form Canonical)) automatically by default to all strings as showcased and explained on the [[String comparison#Unicode_normalization_by_default|String comparison page]].
 
=={{header|REXX}}==
(Modeled after the AWK entry.)
 
 
''How well prepared is the programming language for Unicode?'' &nbsp; &nbsp; ─── &nbsp; Not really prepared, &nbsp; REXX is a language for manipulating ASCII or EBCDIC eight-bit characters.
 
H''ow easy is it to present Unicode strings in source code?'' &nbsp; &nbsp; ─── &nbsp; Somewhat easy, &nbsp; they can be represented in hexadecimal.
 
''Can Unicode literals be written directly?'' &nbsp; &nbsp; ─── &nbsp; No.
 
''Can Unicode glyphs be part of identifiers/keywords/etc?'' &nbsp; &nbsp; ─── &nbsp; No.
 
''How well can the language communicate with the rest of the world?'' &nbsp; &nbsp; ─── &nbsp; The language is not good at external communications via the internet, but can utilize external tools.
 
''Is it good at input/output with Unicode?''&nbsp; &nbsp; ─── &nbsp; No.
 
''Is it convenient to manipulate Unicode strings in the language?'' &nbsp; &nbsp; ─── &nbsp; No.
 
''How broad/deep does the language support Unicode?'' &nbsp; &nbsp; ─── &nbsp; The language doesn't support Unicode.
 
''What encodings (e.g. UTF-8, UTF-16, etc) can be used?'' &nbsp; &nbsp; ─── &nbsp; There is no inbuilt support for Unicode, but all encodings can be represented through hexadecimal strings.
<br><br>
 
=={{header|Ring}}==
<syntaxhighlight lang="ring">
see "Hello, World!"
 
func ringvm_see cText
ring_see("I can play with the string before displaying it" + nl)
ring_see("I can convert it :D" + nl)
ring_see("Original Text: " + cText + nl)
if cText = "Hello, World!"
# Convert it from English to Hindi
cText = "नमस्ते दुनिया!"
ok
ring_see("Converted To (Hindi): " + cText + nl)
</syntaxhighlight>
{{out}}
<pre>
I can play with the string before displaying it
I can convert it :D
Original Text: Hello, World!
Converted To (Hindi): नमस्ते दुनिया!
</pre>
 
=={{header|Ruby}}==
Ruby has hardly any specific support for Unicode; however since it focuses on encodings (exactly 100 encodings are supported in Ruby 2.1.0). itIt includes pretty much all known Unicode Transformation Formats, including UTF-8 which is the default encoding since 2.1.0 .
 
Most Unicode support is to be found in the Regexp engine, for instance /\p{Sc}/ matches everything from the Symbol: Currency category; \p{} matches a character’s Unicode script, like /\p{Linear_B}/.
 
Unicode strings are no problem:
 
<langsyntaxhighlight lang="ruby">str = "你好"
str.include?("好") # => true</langsyntaxhighlight>
 
Unicode code is no problem either:
 
<langsyntaxhighlight lang="ruby">def Σ(array)
array.inject(:+)
end
 
puts Σ([4,5,6]) #=>15
</syntaxhighlight>
</lang>
Ruby 2.2 introduced a method to normalize unicode strings:
<langsyntaxhighlight lang="ruby">
p bad = "¿como\u0301 esta\u0301s?" # => "¿comó estás?"
p bad.unicode_normalized? # => false
p bad.unicode_normalize! # => "¿comó estás?"
p bad.unicode_normalized? # => true
</syntaxhighlight>
</lang>
 
Since Ruby 2.4 Ruby strings have full Unicode case mapping.
The unicode gem (an external library) is for difficult things like lowercase\uppercase outside the ASCII region.
 
=={{header|Scala}}==
{{libheader|Scala}}
<langsyntaxhighlight lang="scala">object UTF8 extends App {
 
def charToInt(s: String) = {
Line 1,025 ⟶ 1,393:
val a = "$abcde¢£¤¥©ÇßçIJijŁłʒλπ•₠₡₢₣₤₥₦₧₨₩₪₫€₭₮₯₰₱₲₳₴₵₵←→⇒∙⌘☺☻ア字文𪚥".
map(c => "%s\t\\u%04X".format(c, c.toInt)).foreach(println)
}</langsyntaxhighlight>
{{out}}
<pre style="height:20ex;overflow:scroll">true true
Line 1,031 ⟶ 1,399:
𪚥
𝄞
$ \u0024
a \u0061
b \u0062
c \u0063
d \u0064
e \u0065
¢ \u00A2
£ \u00A3
¤ \u00A4
¥ \u00A5
© \u00A9
Ç \u00C7
ß \u00DF
ç \u00E7
IJ \u0132
ij \u0133
Ł \u0141
ł \u0142
ʒ \u0292
λ \u03BB
π \u03C0
\u2022
\u20A0
\u20A1
\u20A2
\u20A3
\u20A4
\u20A5
\u20A6
\u20A7
\u20A8
\u20A9
\u20AA
\u20AB
\u20AC
\u20AD
\u20AE
\u20AF
\u20B0
\u20B1
\u20B2
\u20B3
\u20B4
\u20B5
\u20B5
\u2190
\u2192
\u21D2
\u2219
\u2318
\u263A
\u263B
\u30A2
\u5B57
\u6587
? \uD869
? \uDEA5
\uF8FF
</pre>
 
=={{header|Swift}}==
{{libheader|Swift}}
Swift has an [https://swiftdoc.org/v5.1/type/string/ advanced string type] that defaults to i18n operations and exposes encoding through views:
 
<syntaxhighlight lang="swift">let flag = "🇵🇷"
print(flag.characters.count)
// Prints "1"
print(flag.unicodeScalars.count)
// Prints "2"
print(flag.utf16.count)
// Prints "4"
print(flag.utf8.count)
// Prints "8"
 
let nfc = "\u{01FA}"//Ǻ LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE
let nfd = "\u{0041}\u{030A}\u{0301}"//Latin Capital Letter A + ◌̊ COMBINING RING ABOVE + ◌́ COMBINING ACUTE ACCENT
let nfkx = "\u{FF21}\u{030A}\u{0301}"//Fullwidth Latin Capital Letter A + ◌̊ COMBINING RING ABOVE + ◌́ COMBINING ACUTE ACCENT
print(nfc == nfd) //NFx: true
print(nfc == nfkx) //NFKx: false
</syntaxhighlight>
 
Swift [https://forums.swift.org/t/string-s-abi-and-utf-8/17676 apparently uses a null terminiated char array] for storage to provide compatibility with C, but does a lot of work under-the-covers to make things more ergonomic:
 
<blockquote>Although strings in Swift have value semantics, strings use a copy-on-write strategy to store their data in a buffer. This buffer can then be shared by different copies of a string. A string’s data is only copied lazily, upon mutation, when more than one string instance is using the same buffer. Therefore, the first in any sequence of mutating operations may cost O(n) time and space.
</blockquote>
 
See also:
 
* 'smol': [https://swift.org/blog/utf8-string/ a stack allocated string type] which can "store up to" 10 (on 32 bit systems) or 15 (on 64 bit systems) UTF-8 "code units" (code points?).
* [https://forums.swift.org/t/string-s-abi-and-utf-8/17676 String’s ABI and UTF-8] mentions data-structures that can be used to share the backing UTF-8 data.
 
=={{header|Rust}}==
 
Source code must be encoded in UTF-8.
Non-ASCII characters are, however, acceptable only in character and string literals.
Literals may specify Unicode characters in the form of escape sequences as well.
An escape sequence has the form of <code>\u{X}</code> where <code>X</code> is the hexadecimal code of the character (up to 6 digits).
 
Unicode characters can be represented by built-in type <code>char</code>.
Unicode strings can be represented by several types, respecting the ownership of the string.
The most primitive string type is built-in type <code>str</code>, called also string slice.
Other string types allow usually borrowing a string slice for accessing the actual string data.
 
String slices are stored as UTF-8 encoded byte sequences.
This kind of representation does not allow string indexing in constant time, which is usual in many other languages.
As the result, string handling requires usually a slightly different approach that leverages general concepts like slices and iterators.
 
Rust is very strict in correct Unicode representation.
It has even distinct types for filesystem paths, because these may contain byte sequences that do not form valid Unicode characters.
Conversions between different string representations may result in errors rather than producing an invalid or inexact result.
Some inexact (lossy) conversions can be requested explicitly.
 
Besides UTF-8, the standard library provides functions for handling UTF-16.
More advanced Unicode operations (e.g., segmentation into graphemes) are available in third-party libraries (crates).
 
 
=={{header|Seed7}}==
Line 1,106 ⟶ 1,530:
=={{header|Sidef}}==
Sidef use UTF-8 encoding for pretty much everything, such as source files, chars, strings, stdout, stderr and stdin.
<langsyntaxhighlight lang="ruby"># International class; name and street
class 国際( なまえ, Straße ) {
 
Line 1,124 ⟶ 1,548:
民族.each { |garçon|
garçon.言え;
}</langsyntaxhighlight>
{{out}}
<pre>
Line 1,131 ⟶ 1,555:
I am Stanisław Lec from południow
</pre>
 
=={{header|Stata}}==
 
See ''[https://www.stata.com/features/overview/unicode/ Unicode support]'' on Stata web site. See also the help on [https://www.stata.com/help.cgi?unicode Unicode utilities], and the section 12.4.2 "Handling Unicode strings" of the PDF [https://www.stata.com/manuals/u.pdf User's guide]. Unicode support was added in Stata 14.
 
# How easy is it to present Unicode strings in source code?
#:One can include any Unicode character in the source code. Code is stored as UTF-8 text files with extension .do, .ado or .mata. The ''Output window'' can print Unicode characters as well.
# Can Unicode literals be written directly, or be part of identifiers/keywords/etc?
#:Yes. Unicode literals can be part of variable names (in all places : datasets, scalar and matrix variables, and Mata variables).
# How well can the language communicate with the rest of the world?
#:Stata datasets (extension .dta) are stored in UTF-8. I/O with CSV files can use any encoding supported by Java (see the list [https://docs.oracle.com/en/java/javase/11/intl/supported-encodings.html here]). There are also commands to convert legacy .dta files and text files to Unicode, see the link to Unicode utiilities above.
# Is it good at input/output with Unicode?
#:Yes.
# Is it convenient to manipulate Unicode strings in the language?
#:Stata has string functions to manipulate Unicode strings. It also has legacy functions to manipulate strings as byte sequences: the unicode flavor is prefixed by "u". For instance, ''strtrim'' for the ASCII function and ''ustrtrim'' for the Unicode function.
# How broad/deep does the language support Unicode?
#:Unicode support is good. There is one missing function: while it's easy to get the character from the numeric value of a Unicode code point, with the [https://www.stata.com/help.cgi?uchar() uchar] function, the converse is not easy. However, it's possible to convert a Unicode string to ''escaped'' hex values, e.g. <code>ustrtohex("Ж")</code> returns "\u0416" and <code>ustrtohex("🖖")</code> returns "\U0001f596". The converse operation is done with [https://www.stata.com/help.cgi?ustrunescape() ustrunescape]. Depending on the font used, some characters may not print correctly: for instance the [https://en.wikipedia.org/wiki/Vulcan_salute Vulcan salute] character is not rendered by Courier New, even though it's stored correctly, as shown by the result of ustrtohex.
# What encodings (e.g. UTF-8, UTF-16, etc) can be used?
#:Data and code are stored in UTF-8. I/O with CSV data files can be done in any encoding supported by Java, which includes UTF-8, UTF-16 and UTF-32 with and without BOM.
# Does it support normalization?
#:Yes. See the help for the [https://www.stata.com/help.cgi?ustrnormalize() ustrnormalize] function. It supports the NFC, NFD, NFKC and NFKD [https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization forms].
 
=={{header|Tcl}}==
Line 1,158 ⟶ 1,603:
 
Japanese test case:
<langsyntaxhighlight TXRlang="txr">@{TITLE /[あ-ん一-耙]+/} (@ROMAJI/@ENGLISH)
@(freeform)
@(coll)@{STANZA /[^\n\x3000 ]+/}@(end)@/.*/
</syntaxhighlight>
</lang>
 
Test data: Japanese traditional song:
Line 1,224 ⟶ 1,669:
 
* How broad/deep does the language support Unicode? There is no inbuilt support for Unicode, but all encodings can be represented through hexadecimal strings.
 
=={{header|Vala}}==
 
Vala strings are UTF-8 encoded by default. In order to print them correctly on the screen, use stdout.printf instead of print.
<syntaxhighlight lang="vala">stdout.printf ("UTF-8 encoded string. Let's go to a café!");</syntaxhighlight>
 
=={{header|Visual Basic .NET}}==
See the C# for some general information about the .NET runtime.
Below is an example of certain parts based of the information in the D entry.
<syntaxhighlight lang="text">Module Module1
 
Sub Main()
Console.OutputEncoding = Text.Encoding.Unicode
 
' normal identifiers allowed
Dim a = 0
' unicode characters allowed
Dim δ = 1
 
' ascii strings
Console.WriteLine("some text")
' unicode strings strings
Console.WriteLine("こんにちは")
Console.WriteLine("Здравствуйте")
Console.WriteLine("שלום")
' escape sequences
Console.WriteLine(vbTab + "text" + vbTab + ChrW(&H2708) + """blue")
Console.ReadLine()
End Sub
 
End Module</syntaxhighlight>
{{out}}
<pre>some text
こんにちは
Здравствуйте
שלום
text ✈"blue</pre>
 
=={{header|WDTE}}==
 
WDTE supports Unicode in both identifiers and strings. WDTE is very loose about identifier rules. If it doesn't conflict with a syntactic structure, such as a keyword, literal, or operator, than it's allowed as an identifier.
 
<syntaxhighlight lang="wdte">let プリント t => io.writeln io.stdout t;
 
プリント 'これは実験です。';</syntaxhighlight>
 
=={{header|Wren}}==
{{libheader|Wren-upc}}
Wren source code files are interpreted as UTF-8 encoded and so it is easy to include Unicode strings within scripts.
 
Although Unicode literals can be written directly, identifiers or keywords are limited to ASCII letters/digits or underscores.
 
The String type represents an immutable sequence of bytes and is usually interpreted as UTF-8 but doesn't have to be. It has methods to represent a string as either a list of bytes or a list of Unicode code-points.
 
If strings are iterated directly, they are considered to be a list of Unicode 'characters'. Likewise, the 'count' property returns the number of such characters, not the number of bytes.
 
However, string indexing (including methods which use or return an index) is done by byte offset as the use of code-point indexing is relatively inefficient.
 
The standard library does not support normalization but the above module does allow one to split a string into ''user perceived characters'' (or ''graphemes'').
<syntaxhighlight lang="wren">import "./upc" for Graphemes
 
var w = "voilà"
for (c in w) {
System.write("%(c) ") // prints the 5 Unicode 'characters'.
}
System.print("\nThe length of %(w) is %(w.count)")
 
 
System.print("\nIts code-points are:")
for (cp in w.codePoints) {
System.write("%(cp) ") // prints the code-points as numbers
}
 
System.print("\n\nIts bytes are: ")
for (b in w.bytes) {
System.write("%(b) ") // prints the bytes as numbers
}
 
var zwe = "👨‍👩‍👧"
System.print("\n\n%(zwe) has:")
System.print(" %(zwe.bytes.count) bytes: %(zwe.bytes.toList.join(" "))")
System.print(" %(zwe.codePoints.count) code-points: %(zwe.codePoints.toList.join(" "))")
System.print(" %(Graphemes.clusterCount(zwe)) grapheme")</syntaxhighlight>
 
{{out}}
<pre>
v o i l à
The length of voilà is 5
 
Its code-points are:
118 111 105 108 224
 
Its bytes are:
118 111 105 108 195 160
 
👨‍👩‍👧 has:
18 bytes: 240 159 145 168 226 128 141 240 159 145 169 226 128 141 240 159 145 167
5 code-points: 128104 8205 128105 8205 128103
1 grapheme
</pre>
 
=={{header|zkl}}==
Line 1,253 ⟶ 1,798:
 
* How broad/deep does the language support Unicode? What encodings (e.g. UTF-8, UTF-16, etc) can be used? There is no inbuilt support for Unicode, but all encodings can be represented through hexadecimal strings. A decoder and output routine would need to be written, but this is easy to do on the Spectrum.
 
=={{header|Zig}}==
 
The encoding of a string in Zig is de-facto assumed to be UTF-8. Because Zig source code is UTF-8 encoded, any non-ASCII bytes appearing within a string literal in source code carry their UTF-8 meaning into the content of the string in the Zig program; the bytes are not modified by the compiler. However, it is possible to embed non-UTF-8 bytes into a string literal using \xNN notation.<ref>[https://ziglang.org/documentation/master/#String-Literals-and-Unicode-Code-Point-Literals Zig Documentation - String Literals and Unicode Code Point Literals]</ref>
 
 
 
 
{{omit from|GUISS}}
1,934

edits