Unicode strings

From Rosetta Code
Revision as of 16:44, 1 July 2011 by rosettacode>TimToady (→‎{{header|Perl 6}}: answer more of the questions :))
Unicode strings is a draft programming task. It is not yet considered ready to be promoted as a complete task, for reasons that should be found in its talk page.

Demonstrate how one is expected to handle Unicode strings. Some example considerations: can a Unicode string be directly written in the source code? How does one do IO with unicode strings? Can these strings be manipulated easily? Can non-ASCII characters be used for keywords/identifiers/etc? What encodings (UTF-8, UTF-16, etc) can your language accept without much trouble?

J

Unicode characters can be represented directly in J strings:

<lang j> '♥♦♣♠' ♥♦♣♠</lang>

By default, they are represented as utf-8:

<lang j> #'♥♦♣♠' 12</lang>

However, they can be represented as utf-16 instead:

<lang j> 7 u:'♥♦♣♠' ♥♦♣♠

 #7 u:'♥♦♣♠'

4</lang>

These forms are not treated as equivalent:

<lang j> '♥♦♣♠' -: 7 u:'♥♦♣♠' 0</lang>

unless the character literals themselves are equivalent:

<lang j> 'abcd'-:7 u:'abcd' 1</lang>

Output uses characters in whatever format they happen to be in. Character input assumes 8 bit characters but places no additional interpretation on them.

See also: http://www.jsoftware.com/help/dictionary/duco.htm

Unicode characters are not legal tokens or names, within current versions J.

Perl

In Perl, "Unicode" means "UTF-8". If you want to include utf8 characters in your source file, unless you have set PERL_UNICODE environment correctly, you should do<lang Perl>use utf8;</lang> or you risk the parser treating the file as raw bytes.

Inside the script, utf8 characters can be used both as identifiers and literal strings, and built-in string functions will respect it:<lang Perl>$四十二 = "voilà"; print "$四十二"; # voilà print uc($四十二); # VOILÀ</lang> or you can specify unicode characters by name or ordinal:<lang Perl>use charnames qw(greek); $x = "\N{sigma} \U\N{sigma}"; $y = "\x{2708}"; print scalar reverse("$x $y"); # ✈ Σ σ</lang>

Regular expressions also have support for unicode based on properties, for example, finding characters that's normally written from right to left:<lang Perl>print "Say עִבְרִית" =~ /(\p{BidiClass:R})/g; # עברית</lang>

When it comes to IO, one should specify whether a file is to be opened in utf8 or raw byte mode:<lang Perl>open IN, "<:utf8", "file_utf"; open OUT, ">:raw", "file_byte";</lang> The default of IO behavior can also be set in PERL_UNICODE.

However, when your program interacts with the environment, you may still run into tricky spots if you have imcompatible locale settings or your OS is not using unicode; that's not what Perl has control over, unfortunately.

Perl 6

Perl 6 programs and strings are all in Unicode, specced (but not yet entirely implemented) to operate at a grapheme abstraction level, which is agnostic to underlying encodings or normalizations. (These are generally handled at program boundaries.) Opened files default to UTF-8 encoding. All Unicode character properties are in play, so any appropriate characters may be used as parts of identifiers, whitespace, or user-defined operators. For instance:

<lang perl6>sub prefix:<∛> ($𝕏) { $𝕏 ** (1/3) } say ∛27; # prints 3</lang>

Non-Unicode strings are represented as Buf types rather than Str types, and Unicode operations may not be applied to Buf types without some kind of explicit conversion. Only ASCIIish operations are allowed on buffers.

As things develop, Perl 6 intends to support Unicode even better than Perl 5, which already does a great job in recent versions of accessing nearly all Unicode 6.0 functionality. Perl 6 improves on Perl 5 primarily by offering explicitly typed strings that always know which operations are sensical and which are not.

PicoLisp

PicoLisp can directly handle _only_ Unicode (UTF-8) strings. So the problem is rather how to handle non-Unicode strings: They must be pre- or post-processed by external tools, typically with pipes during I/O. For example, to read a line from a file in 8859 encoding: <lang PicoLisp>(in '(iconv "-f" "ISO-8859-15" "file.txt") (line))</lang>

ZX Spectrum Basic

The ZX Spectrum does not have native Unicode support. However, it does support user defined graphics, which makes it is possible to create custom characters in the UDG area. However, there is only 48k of memory available on a traditional rubber key ZX Spectrum, (or 128k on some of the plus versions), and 510k would be needed to store the Unicode characters for display, so Unicode is not really viable on this platform.