Unicode strings

From Rosetta Code
Revision as of 18:14, 22 July 2011 by Sonia (talk | contribs) (Go explanation)
Task
Unicode strings
You are encouraged to solve this task according to the task description, using any language you may know.

The task is to demonstrate how one is expected to handle Unicode strings. The provided solution should optionally show:

  • How Unicode strings are represented in source code
  • How to perform input and output using Unicode strings
  • Examples of string manipulation
  • Encodings supported by the language (eg UTF-8, UTF-16, etc)

See also: Unicode variable names

C

C is not the most unicode friendly language, to put it mildly. Generally using unicode in C requires dealing with locales, manage data types carefully, and checking various aspects of your compiler. Directly embedding unicode strings in your C source might be a bad idea, too; it's safer to use their hex values. Here's a short example of doing the simplest string handling: print it.<lang C>#include <stdio.h>

  1. include <stdlib.h>
  2. include <locale.h>

/* wchar_t is the standard type for wide chars; what it is internally

* depends on the compiler.
*/

wchar_t poker[] = L"♥♦♣♠"; wchar_t four_two[] = L"\x56db\x5341\x4e8c";

int main() { /* Set the locale to alert C's multibyte output routines */ if (!setlocale(LC_CTYPE, "")) { fprintf(stderr, "Locale failure, check your env vars\n"); return 1; }

  1. ifdef __STDC_ISO_10646__

/* C99 compilers should understand these */ printf("%lc\n", 0x2708); /* ✈ */ printf("%ls\n", poker); /* ♥♦♣♠ */ printf("%ls\n", four_two); /* 四十二 */

  1. else

/* oh well */ printf("airplane\n"); printf("club diamond club spade\n"); printf("for ty two\n");

  1. endif

return 0; }</lang>

Go

Go source code is specified to be UTF-8 encoded. Identifiers like variables and fields names can contain non-ASCII characters. String literals containing non-ASCII characters are naturally represented as UTF-8. In addition, certain built-in features interpret strings as UTF-8. For example, <lang go> for i, rune := range "voilà" {

       fmt.Println(i, rune)
   }</lang>

outputs

0 118
1 111
2 105
3 108
4 224

224 being the Unicode code point for the à character.

In contrast, <lang go> w := "voilà"

   for i := 0; i < len(w); i++ {
       fmt.Println(i, w[i])
   }

</lang> outputs

0 118
1 111
2 105
3 108
4 195
5 160

bytes 4 and 5 showing the UTF-8 encoding of à. The expression w[i] in this case has the type of byte rather than int.

The built-in string data type is not limited to UTF-8 and is simply a byte string that can hold arbitrary data and be interpreted as needed. In general, I/O is done on byte strings without any annotation of the encoding used. Proper interpretation of the byte strings must be understood, specified, or negotiated separately. In particular, there is no built-in or automatic handling of byte order marks.

The heavily used standard packages bytes and strings both have functions for working with strings as both as UTF-8 and as encoding-unspecified bytes. The standard packages utf8, utf16, and unicode have additional functions.

Currently there is no standard support for normalization.

J

Unicode characters can be represented directly in J strings:

<lang j> '♥♦♣♠' ♥♦♣♠</lang>

By default, they are represented as utf-8:

<lang j> #'♥♦♣♠' 12</lang>

However, they can be represented as utf-16 instead:

<lang j> 7 u:'♥♦♣♠' ♥♦♣♠

 #7 u:'♥♦♣♠'

4</lang>

These forms are not treated as equivalent:

<lang j> '♥♦♣♠' -: 7 u:'♥♦♣♠' 0</lang>

unless the character literals themselves are equivalent:

<lang j> 'abcd'-:7 u:'abcd' 1</lang>

Output uses characters in whatever format they happen to be in. Character input assumes 8 bit characters but places no additional interpretation on them.

See also: http://www.jsoftware.com/help/dictionary/duco.htm

Unicode characters are not legal tokens or names, within current versions J.

Locomotive Basic

The Amstrad CPC464 does not have native Unicode support. It is possible to represent Unicode by using ASCII based hexadecimal number sequences, or by using a form of escape sequence encoding, such as \uXXXX. However, there is only 48k of memory available and 510k would be needed to store the Unicode characters for display, so Unicode is not really viable on this platform.

Perl

In Perl, "Unicode" means "UTF-8". If you want to include utf8 characters in your source file, unless you have set PERL_UNICODE environment correctly, you should do<lang Perl>use utf8;</lang> or you risk the parser treating the file as raw bytes.

Inside the script, utf8 characters can be used both as identifiers and literal strings, and built-in string functions will respect it:<lang Perl>$四十二 = "voilà"; print "$四十二"; # voilà print uc($四十二); # VOILÀ</lang> or you can specify unicode characters by name or ordinal:<lang Perl>use charnames qw(greek); $x = "\N{sigma} \U\N{sigma}"; $y = "\x{2708}"; print scalar reverse("$x $y"); # ✈ Σ σ</lang>

Regular expressions also have support for unicode based on properties, for example, finding characters that's normally written from right to left:<lang Perl>print "Say עִבְרִית" =~ /(\p{BidiClass:R})/g; # עברית</lang>

When it comes to IO, one should specify whether a file is to be opened in utf8 or raw byte mode:<lang Perl>open IN, "<:utf8", "file_utf"; open OUT, ">:raw", "file_byte";</lang> The default of IO behavior can also be set in PERL_UNICODE.

However, when your program interacts with the environment, you may still run into tricky spots if you have imcompatible locale settings or your OS is not using unicode; that's not what Perl has control over, unfortunately.

Perl 6

Perl 6 programs and strings are all in Unicode, specced (but not yet entirely implemented) to operate at a grapheme abstraction level, which is agnostic to underlying encodings or normalizations. (These are generally handled at program boundaries.) Opened files default to UTF-8 encoding. All Unicode character properties are in play, so any appropriate characters may be used as parts of identifiers, whitespace, or user-defined operators. For instance:

<lang perl6>sub prefix:<∛> ($𝕏) { $𝕏 ** (1/3) } say ∛27; # prints 3</lang>

Non-Unicode strings are represented as Buf types rather than Str types, and Unicode operations may not be applied to Buf types without some kind of explicit conversion. Only ASCIIish operations are allowed on buffers.

As things develop, Perl 6 intends to support Unicode even better than Perl 5, which already does a great job in recent versions of accessing nearly all Unicode 6.0 functionality. Perl 6 improves on Perl 5 primarily by offering explicitly typed strings that always know which operations are sensical and which are not.

PicoLisp

PicoLisp can directly handle _only_ Unicode (UTF-8) strings. So the problem is rather how to handle non-Unicode strings: They must be pre- or post-processed by external tools, typically with pipes during I/O. For example, to read a line from a file in 8859 encoding: <lang PicoLisp>(in '(iconv "-f" "ISO-8859-15" "file.txt") (line))</lang>

Tcl

All characters in Tcl are always Unicode characters, with ordinary string operations (as listed elsewhere on Rosetta Code) always performed on Unicode. Input and output characters are translated from and to the system's native encoding automatically (with this being able to be overridden on a per file-handle basis via fconfigure -encoding). Source files can be written in encodings other than the native encoding — from Tcl 8.5 onwards, the encoding to use for a file can be controlled by the -encoding option to tclsh, wish and source — though it is usually recommended that programmers maximize their portability by writing in the ASCII subset and using the \uXXXX escape sequence for all other characters. Tcl does not handle byte-order marks by default, because that requires deeper understanding of the application level (and sometimes the encoding information is available in metadata anyway, such as when handling HTTP connections).

The way in which characters are encoded in memory is not defined by the Tcl language (the implementation uses byte arrays, UTF-16 arrays and UCS-2 strings as appropriate) and the only characters with any restriction on use as command or variable names are the ASCII parenthesis and colon characters. However, the $var shorthand syntax is much more restricted (to ASCII alphanumeric plus underline only); other cases have to use the more verbose form: [set funny–var–name].

UNIX Shell

The Bourne shell does not have any inbuilt Unicode functionality. However, Unicode can be represented as ASCII based hexadecimal number sequences, or by using form of escape sequence encoding, such as \uXXXX. The shell will produce its output in ASCII, but can call other programs to produce the Unicode output. The shell does not have any inbuilt string manipulation utilities, so uses external tools such as cut, expr, grep, sed and awk. These would typically manipulate the hexadecimal sequences to provide string manipulation, or dedicated Unicode based tools could be used.

ZX Spectrum Basic

The ZX Spectrum does not have native Unicode support. However, it does support user defined graphics, which makes it is possible to create custom characters in the UDG area. It is possible to represent Unicode by using ASCII based hexadecimal number sequences, or by using a form of escape sequence encoding, such as \uXXXX. However, there is only 48k of memory available on a traditional rubber key ZX Spectrum, (or 128k on some of the plus versions), and 510k would be needed to store the Unicode characters for display, so Unicode is not really viable on this platform.