Unicode strings: Difference between revisions

← Older edit

Unicode strings (view source)

Revision as of 09:51, 29 March 2024

3,065 bytes added , 1 month ago

add link to a Tom Scott video

Grondilu

1,934

edits

Revision as of 16:30, 30 December 2022 (view source) Noraj (ACCEIS) (talk \| contribs) (add Zig) ← Older edit		Latest revision as of 09:51, 29 March 2024 (view source) Grondilu (talk \| contribs) (add link to a Tom Scott video)
(13 intermediate revisions by 8 users not shown)
Line 2: As the world gets smaller each day, internationalization becomes more and more important.   For handling multiple languages, [[Unicode]] is your best friend. It is a very capable and [https://www.youtube.com/watch?v=MijmeoH9LT4 remarquable] tool, but also quite complex compared to older single- and double-byte character encodings. How well prepared is your programming language for Unicode? Line 446: How broad/deep does the language support Unicode? What encodings (e.g. UTF-8, UTF-16, etc) can be used? - There is no inbuilt support for Unicode, but all encodings can be represented through hexadecimal strings. =={{header\|~~BBC~~ BASIC}}== ==={{header\|BBC BASIC}}=== {{works with\|BBC BASIC for Windows}} * How easy is it to present Unicode strings in source code? Line 619 ⟶ 620: ELENA supports both UTF8 and UTF16 strings, Unicode identifiers are also supported: ELENA 46.x: <syntaxhighlight lang="elena">public program() { Line 625 ⟶ 626: var строка := "Привет"w; // UTF16 string console.writeLine:(строка); console.writeLine:(四十二); }</syntaxhighlight> {{out}} Line 639 ⟶ 640: =={{header\|Erlang}}== The simplified explanation is that Erlang allows Unicode in comments/data/file names/etc, but not in function or variable names. =={{header\|FreeBASIC}}== FreeBASIC has decent support for Unicode, although not as complete as some other languages. * How easy is it to present Unicode strings in source code? FreeBASIC can handle ASCII files with Unicode escape sequences (\u), and can also parse source (.bas) or header (.bi) files into UTF-8, UTF-16LE, UTF-16BE. , UTF-32LE and UTF-32BE. These files can be freely mixed with other source or header files in the same project. * Can Unicode literals be written directly, or be part of identifiers/keywords/etc? String literals can be written in the original non-Latin alphabet, you just need to use a text editor that supports some of the mentioned Unicode formats. * How well can the language communicate with the rest of the world? FreeBASIC can communicate with other programs and systems that use Unicode. However, manipulating Unicode strings can be more complicated because many string functions become more complex. * Is it good at input/output with Unicode? The <code>Open</code> function supports UTF-8, UTF-16LE and UTF-32LE files with the encoding specifier. The <code>Input#</code> and <code>Line Input#</code> functions as well as <code>Print#</code> <code>Write#</code> can be used normally, and any conversion between Unicode and ASCII is done automatically if necessary. The <code>Print</code> function also supports Unicode output. * Is it convenient to manipulate Unicode strings in the language? Although FreeBASIC supports wide characters in a string, it does not support dynamic strings. However, there are some libraries included with FreeBASIC to decode UTF-8 to wstring. * What encodings (e.g. UTF-8, UTF-16, etc) can be used? Unicode support in FreeBASIC is quite extensive, but not as deep as in other programming languages. It can handle most basic Unicode tasks, but more advanced tasks may require additional libraries. * What encodings (e.g. UTF-8, UTF-16, etc) can be used? FreeBASIC supports several encodings, including UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE. * Does it support normalization? FreeBASIC does not have built-in support for Unicode normalization. However, it is possible to use external libraries to perform normalization. For example, <syntaxhighlight lang="vbnet">' Define a Unicode string Dim unicodeString As String unicodeString = "こんにちは, 世界! 🌍" ' Print the Unicode string Print unicodeString ' Wait for the user to press a key before closing the console Sleep</syntaxhighlight> =={{header\|Go}}== Line 739 ⟶ 779: 1</syntaxhighlight> Here, we see that even when comparing non-ascii characters, we can coerce both arguments to be utf-8 or utf-16 ~~and~~or inutf-32 ~~either case~~and the resulting literal strings would match. (8 u: string produces a utf-8 result.) Output uses characters in whatever format they happen to be in. Line 816 ⟶ 856: =={{header\|langur}}== Source code in langur is pure UTF-8 without a BOM and without surrogate codes. Identifiers are ASCII only. Comments and string literals may use Unicode. Indexing on a string indexes by code point. The index may be a single number, a range, or ana ~~array~~list of such things.▼ A string or regex literal using an "any" modifier may include any code point (without using an escape sequence). Otherwise, they are restricted to Graphic, Space, and Private Use Area code points, and a select set of invisible spaces. The idea around the "allowed" characters is to keep source code from having hidden text or codes and to allay confusion and deception. Conversion between code point numbers, graphemes, and strings can be done with the cp2s(), s2cp(), and s2gc() functions. Conversion between UTF-8 byte ~~arrays~~lists and langur strings can be done with b2s() and s2b() functions.▼ ~~The following is an example of using the "any" modifier on a string literal.~~ ~~<syntaxhighlight lang="langur">q:any"any code points here"</syntaxhighlight>~~ ▲Indexing on a string indexes by code point. The index may be a single number, a range, or an array of such things. Conversion between code point numbers and strings can be done with the cp2s() and s2cp() functions. The s2cp() function accepts a single index number or range, returning a single code point number or an array of them. The s2s() function returns a string instead (while allowing you to index by code points). The cp2s() function accepts a single code point or an array and returns a string. ▲Conversion between UTF-8 byte arrays and langur strings can be done with b2s() and s2b() functions. The len() function returns the number of code points in a string. Line 837 ⟶ 869: Using a for of loop over a string gives the code point indices, and using a for in loop over a string gives the code point numbers. Interpolation modifiers allow limiting a string by code points or by graphemes. See langurlang.org for more details. Line 942 ⟶ 976: [[File:Unicode print locomotive basic.png]] =={{header\|Lua}}== By default, Lua doesn't support Unicode. Most string methods will work properly on the ASCII range only like [[String case#Lua\|case transformation]]. But there is a [https://www.lua.org/manual/5.4/manual.html#6.5 <code>utf8</code>] module that add some very basic support with a very limited number of functions. For example, this module brings a new [[String length#Lua\|length method]] adapted for UTF-8. But there is no method to transform the case of Unicode string correctly. So globally the Unicode support is very limited and not by default. =={{header\|M2000 Interpreter}}== Line 1,175 ⟶ 1,213: λ ä </pre> =={{header\|PowerShell}}== Unicode escape sequence (added in PowerShell 6<ref>https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_special_characters?view=powershell-7.3</ref>): <syntaxhighlight lang="powershell"> # `u{x} "I`u{0307}" # => İ </syntaxhighlight> =={{header\|Python}}== Line 1,220 ⟶ 1,267: (formerly Perl 6) Raku programs and strings are all in Unicode and operate at a grapheme abstraction level, which is agnostic to underlying encodings ~~or normalizations~~. (These are generally handled at program boundaries.) Opened files default to UTF-8 encoding. All Unicode character properties are in play, so any appropriate characters may be used as parts of identifiers, whitespace, or user-defined operators. For instance: <syntaxhighlight lang="raku" line>sub prefix:<∛> (\𝐕) { 𝐕 ** ~~(1/3)~~⅓ } say ∛27; # prints 3</syntaxhighlight> Non-Unicode strings are represented as Buf types rather than Str types, and Unicode operations may not be applied to Buf types without some kind of explicit conversion. Only ASCIIish operations are allowed on buffers. Raku tracks the Unicode consortium standards releases and is generally up to the latest standard within a few months or so of its release. (currently at 15.0 as of February 2023) ~~standard within a month or so of its release. (currently at 13.1 as of May 2021)~~ * Supports the normalized forms NFC, NFD, NFKC, and NFKD, and character equivalence as specified in [http://unicode.org/reports/tr15/ Unicode technical report #15]. Line 1,238 ⟶ 1,284: * Works seamlessly with upper plane and private use plane character codepoints. * Provides tools to deal with strings that contain invalid Unicode characters. In general, it tries to make dealing with Unicode "just work". Raku intends to support Unicode even better than Perl 5, which already does a great job in recent versions of accessing large swaths of Unicode spec. functionality. Raku improves on Perl 5 primarily by offering explicitly typed strings that always know which operations are sensical and which are not. A very important distinctive characteristic of Raku to keep in mind is that it applies normalization (Unicode NFC form (Normalization Form Canonical)) automatically by default to all strings as showcased and explained on the [[String comparison#Unicode_normalization_by_default\|String comparison page]]. =={{header\|REXX}}== Line 1,681 ⟶ 1,728: The standard library does not support normalization but the above module does allow one to split a string into ''user perceived characters'' (or ''graphemes''). <syntaxhighlight lang="~~ecmascript~~wren">import "./upc" for Graphemes var w = "voilà"