Unicode strings

From Rosetta Code
Task
Unicode strings
You are encouraged to solve this task according to the task description, using any language you may know.

As the world gets smaller each day, internationalization becomes more and more important. For handling multiple languages, Unicode is your best friend. It is a very capable tool, but also quite complex compared to older single- and double-byte character encodings. How well prepared is your programming language for Unicode? Discuss and demonstrate its unicode awareness and capabilities. Some suggested topics:

  • How easy is it to present Unicode strings in source code? Can Unicode literals be written directly, or be part of identifiers/keywords/etc?
  • How well can the language communicate with the rest of the world? Is it good at input/output with Unicode?
  • Is it convenient to manipulate Unicode strings in the language?
  • How broad/deep does the language support Unicode? What encodings (e.g. UTF-8, UTF-16, etc) can be used? Normalization?

Note This task is a bit unusual in that it encourages general discussion rather than clever coding.

See also:

80386 Assembly

  • How well prepared is the programming language for Unicode? - Prepared, in terms of handling: Assembly language can do anything the computer can do. However, it has no Unicode facilities as part of the language.
  • How easy is it to present Unicode strings in source code? - Easy, they are in hexadecimal.
  • Can Unicode literals be written directly - Depends on the compiler. MASM does not allow this. All data in assembly language is created from a series of bytes. Literal characters are not part of the language. They are number crunched down into a byte sequence by the compiler. If the compiler can read Unicode, then you are onto a winner.
  • or be part of identifiers/keywords/etc? - Depends on compiler. Intel notation does not use Unicode identifiers or mnemonics. Assembly language converts to numeric machine code, so everything is represented as mnemonics. You can use your own mnemonics, but you need to be able to compile them. One way to do this is to use a wrapper (which you would create) that converts your Unicode mnemonic notation to the notation that the compiler is expecting.
  • How well can the language communicate with the rest of the world? - Difficult. This is a low level language, so all communication can be done, but you have to set up data structures, and produce many lines of code for just basic tasks.
  • Is it good at input/output with Unicode? - Yes and No. The Unicode bit is easy, but for input/output, we have to set up data structures and produce many lines of code, or link to code libraries.
  • Is it convenient to manipulate Unicode strings in the language? - No. String manipulation requires lots of code. We can link to code libraries though, but it is not as straightforward, as it would be in a higher level language.
  • How broad/deep does the language support Unicode? We can do anything in assembly language, so support is 100%, but nothing is convenient with respect to Unicode. Strings are just a series of bytes, treatment of a series of bytes as a string is down to the compiler, if it provides string support as an extension. You need to be prepared to define data structures containing the values that you want.
  • What encodings (e.g. UTF-8, UTF-16, etc) can be used? All encodings are supported, but again, nothing is convenient with respect to encodings, although hexadecimal notation is good to use in assembly language. Normalization is not supported unless you write lots of code.

ALGOL 68

How well prepared is the programming language for Unicode? - ALGOL 68 is character set agnostic and the standard explicitly permits the use of a universal character set. The standard includes routines like "make conv" to permit the opening of files and devices using alternate characters sets and converting character sets on the fly.

How easy is it to present Unicode strings in source code? - Easy.

Can Unicode literals be written directly - No, a REPR operator must be used to encode the string in UTF8.

Can Unicode literals be part of identifiers/keywords/etc? - Yes... ALGOL 68 is character set agnostic and the standard explicitly permits the use of a universal character set. Implementation for English, German, Polish and Cyrillic have been created. However ALGOL 68G supports only "Worthy" Character sets.

How well can the language communicate with the rest of the world? - Acceptable.

Is it good at input/output with Unicode? - No, although the "make conv" routine is in the standard it is rarely fully implemented.

Is it convenient to manipulate Unicode strings in the language? - Yes

How broad/deep does the language support Unicode? What encodings (e.g. UTF-8, UTF-16, etc) can be used? The attached set of utility routine is currently only for UTF8. Currently the Unicode routines like is_digit, is_letter, is_lower_case etc are not implemented.

Works with: ALGOL 68 version Revision 1 - no extensions to language used.
Works with: ALGOL 68G version Any - tested with release 1.18.0-9h.tiny.

<lang algol68>#!/usr/local/bin/a68g --script #

  1. -*- coding: utf-8 -*- #
  1. UNICHAR/UNICODE must be printed using REPR to convert to UTF8 #

MODE UNICHAR = STRUCT(BITS #31# bits); # assuming bits width >=31 # MODE UNICODE = FLEX[0]UNICHAR;

OP INITUNICHAR = (BITS bits)UNICHAR: (UNICHAR out; bits OF out := #ABS# bits; out); OP INITUNICHAR = (CHAR char)UNICHAR: (UNICHAR out; bits OF out := BIN ABS char; out); OP INITBITS = (UNICHAR unichar)BITS: #BIN# bits OF unichar;

PROC raise value error = ([]UNION(FORMAT,BITS,STRING)argv )VOID: (

 putf(stand error, argv); stop

);

MODE YIELDCHAR = PROC(CHAR)VOID; MODE GENCHAR = PROC(YIELDCHAR)VOID; MODE YIELDUNICHAR = PROC(UNICHAR)VOID; MODE GENUNICHAR = PROC(YIELDUNICHAR)VOID;

PRIO DOCONV = 1;

  1. Convert a stream of UNICHAR into a stream of UTFCHAR #

OP DOCONV = (GENUNICHAR gen unichar, YIELDCHAR yield)VOID:(

 BITS non ascii = NOT 2r1111111;
  1. FOR UNICHAR unichar IN # gen unichar( # ) DO ( #
    1. (UNICHAR unichar)VOID: (
   BITS bits := INITBITS unichar;
   IF (bits AND non ascii) = 2r0 THEN # ascii #
     yield(REPR ABS bits)
   ELSE
     FLEX[6]CHAR buf := "?"*6; # initialise work around #
     INT bytes := 0;
     BITS byte lead bits = 2r10000000;
     FOR ofs FROM UPB buf BY -1 WHILE
       bytes +:= 1;
       buf[ofs]:= REPR ABS (byte lead bits OR bits AND 2r111111);
       bits := bits SHR 6;
   # WHILE # bits NE 2r0 DO
       SKIP 
     OD;
     BITS first byte lead bits = BIN (ABS(2r1 SHL bytes)-2) SHL (UPB buf - bytes + 1);
     buf := buf[UPB buf-bytes+1:];
     buf[1] := REPR ABS(BIN ABS buf[1] OR first byte lead bits);
     FOR i TO UPB buf DO yield(buf[i]) OD
   FI
  1. OD # ))

);

  1. Convert a STRING into a stream of UNICHAR #

OP DOCONV = (STRING string, YIELDUNICHAR yield)VOID: (

 PROC gen char = (YIELDCHAR yield)VOID:
   FOR i FROM LWB string TO UPB string DO yield(string[i]) OD;
 gen char DOCONV yield

);

CO Prosser/Thompson UTF8 encoding scheme Bits Last code point Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6

7   U+007F          0xxxxxxx

11 U+07FF 110xxxxx 10xxxxxx 16 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx 21 U+1FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 26 U+3FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 31 U+7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx END CO

  1. Quickly calculate the length of the UTF8 encoded string #

PROC upb utf8 = (STRING utf8 string)INT:(

 INT bytes to go := 0;
 INT upb := 0;
 FOR i FROM LWB utf8 string TO UPB utf8 string DO
   CHAR byte := utf8 string[i];
   IF bytes to go = 0 THEN # start new utf char #
     bytes to go := 
       IF   ABS byte <= ABS 2r01111111 THEN 1 #  7 bits #
       ELIF ABS byte <= ABS 2r11011111 THEN 2 # 11 bits #
       ELIF ABS byte <= ABS 2r11101111 THEN 3 # 16 bits #
       ELIF ABS byte <= ABS 2r11110111 THEN 4 # 21 bits #
       ELIF ABS byte <= ABS 2r11111011 THEN 5 # 26 bits #
       ELIF ABS byte <= ABS 2r11111101 THEN 6 # 31 bits #
       ELSE raise value error(("Invalid UTF-8 bytes", BIN ABS byte)); ~ FI
   FI;
   bytes to go -:= 1; # skip over trailing bytes #
   IF bytes to go = 0 THEN upb +:= 1 FI
 OD;
 upb

);

  1. Convert a stream of CHAR into a stream of UNICHAR #

OP DOCONV = (GENCHAR gen char, YIELDUNICHAR yield)VOID: (

 INT bytes to go := 0;
 INT lshift;
 BITS mask, out;
 # FOR CHAR byte IN # gen char( # ) DO ( #
 ##   (CHAR byte)VOID: (
     INT bits := ABS byte;
     IF bytes to go = 0 THEN # start new unichar #
       bytes to go := 
         IF   bits <= ABS 2r01111111 THEN 1 #  7 bits #
         ELIF bits <= ABS 2r11011111 THEN 2 # 11 bits #
         ELIF bits <= ABS 2r11101111 THEN 3 # 16 bits #
         ELIF bits <= ABS 2r11110111 THEN 4 # 21 bits #
         ELIF bits <= ABS 2r11111011 THEN 5 # 26 bits #
         ELIF bits <= ABS 2r11111101 THEN 6 # 31 bits #
         ELSE raise value error(("Invalid UTF-8 bytes", BIN bits)); ~ FI;
       IF bytes to go = 1 THEN 
         lshift := 7; mask := 2r1111111
       ELSE 
         lshift := 7 - bytes to go; mask :=  BIN(ABS(2r1 SHL lshift)-1) 
       FI;
       out := mask AND BIN bits;
       lshift := 6; mask := 2r111111 # subsequently pic 6 bits at a time #
     ELSE
       out := (out SHL lshift) OR ( mask AND BIN bits)
     FI;
     bytes to go -:= 1;
     IF bytes to go = 0 THEN yield(INITUNICHAR out) FI
 # OD # ))

);

  1. Convert a string of UNICHAR into a stream of UTFCHAR #

OP DOCONV = (UNICODE unicode, YIELDCHAR yield)VOID:(

 PROC gen unichar = (YIELDUNICHAR yield)VOID:
   FOR i FROM LWB unicode TO UPB unicode DO yield(unicode[i]) OD;
 gen unichar DOCONV yield

);

  1. Some convenience/shorthand U operators #
  2. Convert a BITS into a UNICODE char #

OP U = (BITS bits)UNICHAR:

 INITUNICHAR bits;
  1. Convert a []BITS into a UNICODE char #

OP U = ([]BITS array bits)[]UNICHAR:(

 [LWB array bits:UPB array bits]UNICHAR out;
 FOR i FROM LWB array bits TO UPB array bits DO bits OF out[i]:=array bits[i] OD;
 out

);

  1. Convert a CHAR into a UNICODE char #

OP U = (CHAR char)UNICHAR:

 INITUNICHAR char;
  1. Convert a STRING into a UNICODE string #

OP U = (STRING utf8 string)UNICODE: (

 FLEX[upb utf8(utf8 string)]UNICHAR out;
 INT i := 0; 
  1. FOR UNICHAR char IN # utf8 string DOCONV (
    1. (UNICHAR char)VOID:
      out[i+:=1] := char
  1. OD #);
 out

);

  1. Convert a UNICODE string into a UTF8 STRING #

OP REPR = (UNICODE string)STRING: (

 STRING out;
  1. FOR CHAR char IN # string DOCONV (
    1. (CHAR char)VOID: (
      out +:= char
  1. OD #));
 out

);

  1. define the most useful OPerators on UNICODE CHARacter arrays #
  2. Note: LWB, UPB and slicing works as per normal #

OP + = (UNICODE a,b)UNICODE: (

 [UPB a + UPB b]UNICHAR out; 
 out[:UPB a]:= a; out[UPB a+1:]:= b;
 out

);

OP + = (UNICODE a, UNICHAR b)UNICODE: a+UNICODE(b); OP + = (UNICHAR a, UNICODE b)UNICODE: UNICODE(a)+b; OP + = (UNICHAR a,b)UNICODE: UNICODE(a)+b;

  1. Suffix a character to the end of a UNICODE string #

OP +:= = (REF UNICODE a, UNICODE b)VOID: a := a + b; OP +:= = (REF UNICODE a, UNICHAR b)VOID: a := a + b;

  1. Prefix a character to the beginning of a UNICODE string #

OP +=: = (UNICODE b, REF UNICODE a)VOID: a := b + a; OP +=: = (UNICHAR b, REF UNICODE a)VOID: a := b + a;

OP * = (UNICODE a, INT n)UNICODE: (

 UNICODE out := a;
 FOR i FROM 2 TO n DO out +:= a OD;
 out

); OP * = (INT n, UNICODE a)UNICODE: a * n;

OP * = (UNICHAR a, INT n)UNICODE: UNICODE(a)*n; OP * = (INT n, UNICHAR a)UNICODE: n*UNICODE(a);

OP *:= = (REF UNICODE a, INT b)VOID: a := a * b;

  1. Wirthy Operators #

OP LT = (UNICHAR a,b)BOOL: ABS bits OF a LT ABS bits OF b,

  LE = (UNICHAR a,b)BOOL: ABS bits OF a LE ABS bits OF b,
  EQ = (UNICHAR a,b)BOOL: ABS bits OF a EQ ABS bits OF b,
  NE = (UNICHAR a,b)BOOL: ABS bits OF a NE ABS bits OF b,
  GE = (UNICHAR a,b)BOOL: ABS bits OF a GE ABS bits OF b,
  GT = (UNICHAR a,b)BOOL: ABS bits OF a GT ABS bits OF b;
  1. ASCII OPerators #

OP < = (UNICHAR a,b)BOOL: a LT b,

  <= = (UNICHAR a,b)BOOL: a LE b,
   = = (UNICHAR a,b)BOOL: a EQ b,
  /= = (UNICHAR a,b)BOOL: a NE b,
  >= = (UNICHAR a,b)BOOL: a GE b,
  >  = (UNICHAR a,b)BOOL: a GT b;
  1. Non ASCII OPerators

OP ≤ = (UNICHAR a,b)BOOL: a LE b,

  ≠ = (UNICHAR a,b)BOOL: a NE b,
  ≥ = (UNICHAR a,b)BOOL: a GE b;
  1. Compare two UNICODE strings for equality #

PROC unicode cmp = (UNICODE str a,str b)INT: (

 IF LWB str a > LWB str b THEN exit lt ELIF LWB str a < LWB str b THEN exit gt FI;
 INT min upb = UPB(UPB str a < UPB str b | str a | str b );
 FOR i FROM LWB str a TO min upb DO
   UNICHAR a := str a[i], UNICHAR b := str b[i];
   IF a < b THEN exit lt ELIF a > b THEN exit gt FI
 OD;
 IF UPB str a > UPB str b THEN exit gt ELIF UPB str a < UPB str b THEN exit lt FI;
 exit eq:  0 EXIT
 exit lt: -1 EXIT
 exit gt:  1

);

OP LT = (UNICODE a,b)BOOL: unicode cmp(a,b)< 0,

  LE = (UNICODE a,b)BOOL: unicode cmp(a,b)<=0,
  EQ = (UNICODE a,b)BOOL: unicode cmp(a,b) =0,
  NE = (UNICODE a,b)BOOL: unicode cmp(a,b)/=0,
  GE = (UNICODE a,b)BOOL: unicode cmp(a,b)>=0,
  GT = (UNICODE a,b)BOOL: unicode cmp(a,b)> 0;
  1. ASCII OPerators #

OP < = (UNICODE a,b)BOOL: a LT b,

  <= = (UNICODE a,b)BOOL: a LE b,
   = = (UNICODE a,b)BOOL: a EQ b,
  /= = (UNICODE a,b)BOOL: a NE b,
  >= = (UNICODE a,b)BOOL: a GE b,
  >  = (UNICODE a,b)BOOL: a GT b;
  1. Non ASCII OPerators

OP ≤ = (UNICODE a,b)BOOL: a LE b,

  ≠ = (UNICODE a,b)BOOL: a NE b,
  ≥ = (UNICODE a,b)BOOL: a GE b;

COMMENT - Todo: for all UNICODE and UNICHAR

 Add NonASCII OPerators: ×, ×:=,
 Add ASCII Operators: &, &:=, &=:
 Add Wirthy OPerators: PLUSTO, PLUSAB, TIMESAB for UNICODE/UNICHAR,
 Add UNICODE against UNICHAR comparison OPerators,
 Add char_in_string and string_in_string PROCedures,
 Add standard Unicode functions:
   to_upper_case, to_lower_case, unicode_block, char_count,
   get_directionality, get_numeric_value, get_type, is_defined,
   is_digit, is_identifier_ignorable, is_iso_control,
   is_letter, is_letter_or_digit, is_lower_case, is_mirrored,
   is_space_char, is_supplementary_code_point, is_title_case,
   is_unicode_identifier_part, is_unicode_identifier_start,
   is_upper_case, is_valid_code_point, is_whitespace

END COMMENT

test:(

 UNICHAR aircraft := U16r 2708;
 printf(($"aircraft: "$, $"16r"16rdddd$, UNICODE(aircraft), $g$, " => ", REPR UNICODE(aircraft), $l$));
 UNICODE chinese forty two = U16r 56db + U16r 5341 + U16r 4e8c;
 printf(($"chinese forty two: "$, $g$, REPR chinese forty two, ", length string = ", UPB chinese forty two, $l$));
 UNICODE poker = U "A123456789♥♦♣♠JQK";
 printf(($"poker: "$, $g$, REPR poker, ", length string = ", UPB poker, $l$));
 UNICODE selectric := U"×÷≤≥≠¬∨∧⏨→↓↑□⌊⌈⎩⎧○⊥¢";
 printf(($"selectric: "$, $g$, REPR selectric, $l$));
 printf(($"selectric*4: "$, $g$, REPR(selectric*4), $l$));
 print((
   "1 < 2 is ",  U"1" < U"2", ", ",
   "111 < 11 is ",U"111" < U"11", ", ",
   "111 < 12 is ",U"111" < U"12", ", ",
   "♥ < ♦ is ",  U"♥" < U"♦", ", ",
   "♥Q < ♥K is ",U"♥Q" < U"♥K", " & ",
   "♥J < ♥K is ",U"♥J" < U"♥K", new line
 ))

)</lang> Output:

aircraft: 16r2708 => ✈
chinese forty two: 四十二, length string =          +3
poker: A123456789♥♦♣♠JQK, length string =         +17
selectric: ×÷≤≥≠¬∨∧⏨→↓↑□⌊⌈⎩⎧○⊥¢
selectric*4: ×÷≤≥≠¬∨∧⏨→↓↑□⌊⌈⎩⎧○⊥¢×÷≤≥≠¬∨∧⏨→↓↑□⌊⌈⎩⎧○⊥¢×÷≤≥≠¬∨∧⏨→↓↑□⌊⌈⎩⎧○⊥¢×÷≤≥≠¬∨∧⏨→↓↑□⌊⌈⎩⎧○⊥¢
1 < 2 is T, 111 < 11 is F, 111 < 12 is T, ♥ < ♦ is T, ♥Q < ♥K is F & ♥J < ♥K is T

AutoHotkey

How easy is it to present Unicode strings in source code? - Simple, as long as the script is saved as Unicode and you're using a Unicode build

Can Unicode literals be written directly, or be part of identifiers/keywords/etc? - Yes, see above

How well can the language communicate with the rest of the world? Is it good at input/output with Unicode? - it can create GUI's and send Unicode characters.

Is it convenient to manipulate Unicode strings in the language? - Yes: they act like any other string, apart from lowlevel functions such as NumPut which deal with bytes.

How broad/deep does the language support Unicode? What encodings (e.g. UTF-8, UTF-16, etc) can be used? UTF-8 is most often used. StrPut/StrGet and FileRead/FileAppend allow unicode in AutoHotkey_L (the current build)

AWK

How well prepared is the programming language for Unicode? - Not really prepared. AWK is a tool for manipulating ASCII input.

How easy is it to present Unicode strings in source code? - Easy. They can be represented in hexadecimal.

Can Unicode literals be written directly - No

or be part of identifiers/keywords/etc? - No

How well can the language communicate with the rest of the world? - The language is not good at communications, but can utilize external tools.

Is it good at input/output with Unicode? - No

Is it convenient to manipulate Unicode strings in the language? - No

How broad/deep does the language support Unicode? What encodings (e.g. UTF-8, UTF-16, etc) can be used? There is no inbuilt support for Unicode, but all encodings can be represented through hexadecimal strings.

C

C is not the most unicode friendly language, to put it mildly. Generally using unicode in C requires dealing with locales, manage data types carefully, and checking various aspects of your compiler. Directly embedding unicode strings in your C source might be a bad idea, too; it's safer to use their hex values. Here's a short example of doing the simplest string handling: print it.<lang C>#include <stdio.h>

  1. include <stdlib.h>
  2. include <locale.h>

/* wchar_t is the standard type for wide chars; what it is internally

* depends on the compiler.
*/

wchar_t poker[] = L"♥♦♣♠"; wchar_t four_two[] = L"\x56db\x5341\x4e8c";

int main() { /* Set the locale to alert C's multibyte output routines */ if (!setlocale(LC_CTYPE, "")) { fprintf(stderr, "Locale failure, check your env vars\n"); return 1; }

  1. ifdef __STDC_ISO_10646__

/* C99 compilers should understand these */ printf("%lc\n", 0x2708); /* ✈ */ printf("%ls\n", poker); /* ♥♦♣♠ */ printf("%ls\n", four_two); /* 四十二 */

  1. else

/* oh well */ printf("airplane\n"); printf("club diamond club spade\n"); printf("for ty two\n");

  1. endif

return 0; }</lang>

Go

Go source code is specified to be UTF-8 encoded. Identifiers like variables and fields names can contain non-ASCII characters. String literals containing non-ASCII characters are naturally represented as UTF-8. In addition, certain built-in features interpret strings as UTF-8. For example, <lang go> for i, rune := range "voilà" {

       fmt.Println(i, rune)
   }</lang>

outputs

0 118
1 111
2 105
3 108
4 224

224 being the Unicode code point for the à character.

In contrast, <lang go> w := "voilà"

   for i := 0; i < len(w); i++ {
       fmt.Println(i, w[i])
   }

</lang> outputs

0 118
1 111
2 105
3 108
4 195
5 160

bytes 4 and 5 showing the UTF-8 encoding of à. The expression w[i] in this case has the type of byte rather than int.

The built-in string data type is not limited to UTF-8 and is simply a byte string that can hold arbitrary data and be interpreted as needed. In general, I/O is done on byte strings without any annotation of the encoding used. Proper interpretation of the byte strings must be understood, specified, or negotiated separately. In particular, there is no built-in or automatic handling of byte order marks.

The heavily used standard packages bytes and strings both have functions for working with strings as both as UTF-8 and as encoding-unspecified bytes. The standard packages utf8, utf16, and unicode have additional functions.

Currently there is no standard support for normalization.

J

Unicode characters can be represented directly in J strings:

<lang j> '♥♦♣♠' ♥♦♣♠</lang>

By default, they are represented as utf-8:

<lang j> #'♥♦♣♠' 12</lang>

However, they can be represented as utf-16 instead:

<lang j> 7 u:'♥♦♣♠' ♥♦♣♠

 #7 u:'♥♦♣♠'

4</lang>

These forms are not treated as equivalent:

<lang j> '♥♦♣♠' -: 7 u:'♥♦♣♠' 0</lang>

unless the character literals themselves are equivalent:

<lang j> 'abcd'-:7 u:'abcd' 1</lang>

Output uses characters in whatever format they happen to be in. Character input assumes 8 bit characters but places no additional interpretation on them.

See also: http://www.jsoftware.com/help/dictionary/duco.htm

Unicode characters are not legal tokens or names, within current versions J.

Locomotive Basic

The Amstrad CPC464 does not have native Unicode support. It is possible to represent Unicode by using ASCII based hexadecimal number sequences, or by using a form of escape sequence encoding, such as \uXXXX. However, there is only 48k of memory available and 510k would be needed to store the Unicode characters for display, so Unicode is not really viable on this platform.

  • How well prepared is the programming language for Unicode? - Not good. There are no Unicode symbols in the ROM.
  • How easy is it to present Unicode strings in source code? - Easy, they are in hexadecimal.
  • Can Unicode literals be written directly - No
  • or be part of identifiers/keywords/etc? - No
  • How well can the language communicate with the rest of the world? - Not good. There is no TCP/IP stack, and the computer does not have an Ethernet port.
  • Is it good at input/output with Unicode? - Not good. There are no Unicode symbols in ROM, or on the keyboard.
  • Is it convenient to manipulate Unicode strings in the language? - Moderate. The language is not designed for Unicode, so has no inbuilt Unicode functions. However, it is possible to write manipulation routines, and the language is good at arithmetic, so no problem.
  • How broad/deep does the language support Unicode? What encodings (e.g. UTF-8, UTF-16, etc) can be used? There is no inbuilt support for Unicode, but all encodings can be represented through hexadecimal strings.

Perl

In Perl, "Unicode" means "UTF-8". If you want to include utf8 characters in your source file, unless you have set PERL_UNICODE environment correctly, you should do<lang Perl>use utf8;</lang> or you risk the parser treating the file as raw bytes.

Inside the script, utf8 characters can be used both as identifiers and literal strings, and built-in string functions will respect it:<lang Perl>$四十二 = "voilà"; print "$四十二"; # voilà print uc($四十二); # VOILÀ</lang> or you can specify unicode characters by name or ordinal:<lang Perl>use charnames qw(greek); $x = "\N{sigma} \U\N{sigma}"; $y = "\x{2708}"; print scalar reverse("$x $y"); # ✈ Σ σ</lang>

Regular expressions also have support for unicode based on properties, for example, finding characters that's normally written from right to left:<lang Perl>print "Say עִבְרִית" =~ /(\p{BidiClass:R})/g; # עברית</lang>

When it comes to IO, one should specify whether a file is to be opened in utf8 or raw byte mode:<lang Perl>open IN, "<:utf8", "file_utf"; open OUT, ">:raw", "file_byte";</lang> The default of IO behavior can also be set in PERL_UNICODE.

However, when your program interacts with the environment, you may still run into tricky spots if you have incompatible locale settings or your OS is not using unicode; that's not what Perl has control over, unfortunately.

Perl 6

Perl 6 programs and strings are all in Unicode, specced (but not yet entirely implemented) to operate at a grapheme abstraction level, which is agnostic to underlying encodings or normalizations. (These are generally handled at program boundaries.) Opened files default to UTF-8 encoding. All Unicode character properties are in play, so any appropriate characters may be used as parts of identifiers, whitespace, or user-defined operators. For instance:

<lang perl6>sub prefix:<∛> ($𝕏) { $𝕏 ** (1/3) } say ∛27; # prints 3</lang>

Non-Unicode strings are represented as Buf types rather than Str types, and Unicode operations may not be applied to Buf types without some kind of explicit conversion. Only ASCIIish operations are allowed on buffers.

As things develop, Perl 6 intends to support Unicode even better than Perl 5, which already does a great job in recent versions of accessing nearly all Unicode 6.0 functionality. Perl 6 improves on Perl 5 primarily by offering explicitly typed strings that always know which operations are sensical and which are not.

PicoLisp

PicoLisp can directly handle _only_ Unicode (UTF-8) strings. So the problem is rather how to handle non-Unicode strings: They must be pre- or post-processed by external tools, typically with pipes during I/O. For example, to read a line from a file in 8859 encoding: <lang PicoLisp>(in '(iconv "-f" "ISO-8859-15" "file.txt") (line))</lang>

Seed7

The Unicode encoding of Seed7 characters and strings is UTF-32. Seed7 source files use UTF-8 encoding. Character literals and string literals are therefore written with UTF-8 encoding. Unicode characters are allowed in comments, but not in identifiers and keywords. Functions, which send strings to the operating system convert them to the encoding used by the OS. Strings received by the operating system are converted to UTF-32. Seed7 supports reading and writing Latin-1, UTF-8 and UTF-16 files. Because of UTF-32 there is no distinction between byte and character position.

Tcl

All characters in Tcl are always Unicode characters, with ordinary string operations (as listed elsewhere on Rosetta Code) always performed on Unicode. Input and output characters are translated from and to the system's native encoding automatically (with this being able to be overridden on a per file-handle basis via fconfigure -encoding). Source files can be written in encodings other than the native encoding — from Tcl 8.5 onwards, the encoding to use for a file can be controlled by the -encoding option to tclsh, wish and source — though it is usually recommended that programmers maximize their portability by writing in the ASCII subset and using the \uXXXX escape sequence for all other characters. Tcl does not handle byte-order marks by default, because that requires deeper understanding of the application level (and sometimes the encoding information is available in metadata anyway, such as when handling HTTP connections).

The way in which characters are encoded in memory is not defined by the Tcl language (the implementation uses byte arrays, UTF-16 arrays and UCS-2 strings as appropriate) and the only characters with any restriction on use as command or variable names are the ASCII parenthesis and colon characters. However, the $var shorthand syntax is much more restricted (to ASCII alphanumeric plus underline only); other cases have to use the more verbose form: [set funny–var–name].

UNIX Shell

The Bourne shell does not have any inbuilt Unicode functionality. However, Unicode can be represented as ASCII based hexadecimal number sequences, or by using form of escape sequence encoding, such as \uXXXX. The shell will produce its output in ASCII, but can call other programs to produce the Unicode output. The shell does not have any inbuilt string manipulation utilities, so uses external tools such as cut, expr, grep, sed and awk. These would typically manipulate the hexadecimal sequences to provide string manipulation, or dedicated Unicode based tools could be used.

  • How well prepared is the programming language for Unicode? - Fine. All Unicode strings can be represented as hexadecimal sequences.
  • How easy is it to present Unicode strings in source code? - Easy, they are in hexadecimal.
  • Can Unicode literals be written directly - No
  • or be part of identifiers/keywords/etc? - No
  • How well can the language communicate with the rest of the world? - Extremely well. The shell makes use of all of the tools that a Unix box has to offer.
  • Is it good at input/output with Unicode? - This language is weak on input/output anyway, so its Unicode input/output is also weak. However, the shell makes use of all installed tools, so this is not a problem in real terms.
  • Is it convenient to manipulate Unicode strings in the language? - Not really, the shell is not good at string manipulation. Howver, it makes good use of external programs, so Unicode string manipulation should not be a problem.
  • How broad/deep does the language support Unicode? There is no inbuilt support for Unicode, but all encodings can be represented through hexadecimal strings.

ZX Spectrum Basic

The ZX Spectrum does not have native Unicode support. However, it does support user defined graphics, which makes it is possible to create custom characters in the UDG area. It is possible to represent Unicode by using ASCII based hexadecimal number sequences, or by using a form of escape sequence encoding, such as \uXXXX. However, there is only 48k of memory available on a traditional rubber key ZX Spectrum, (or 128k on some of the plus versions), and 510k would be needed to store the Unicode characters for display, so Unicode is not really viable on this platform.

  • How well prepared is the programming language for Unicode? - Not good. There are no Unicode symbols in the ROM.
  • How easy is it to present Unicode strings in source code? - Easy, they are in hexadecimal.
  • Can Unicode literals be written directly - No
  • or be part of identifiers/keywords/etc? - No
  • How well can the language communicate with the rest of the world? - Not good. There is no TCP/IP stack, and the computer does not have an Ethernet port.
  • Is it good at input/output with Unicode? - Not good. There are no Unicode symbols in ROM, or on the keyboard.
  • Is it convenient to manipulate Unicode strings in the language? - Moderate. The language is not designed for Unicode, so has no inbuilt Unicode functions. However, it is possible to write manipulation routines, and the language is good at arithmetic, so no problem.
  • How broad/deep does the language support Unicode? What encodings (e.g. UTF-8, UTF-16, etc) can be used? There is no inbuilt support for Unicode, but all encodings can be represented through hexadecimal strings.