Unicode strings: Difference between revisions

From Rosetta Code
Content added Content deleted
(init task)
 
(perl example)
Line 2: Line 2:
==Task==
==Task==
Demonstrate how one is expected to handle Unicode strings. Some example considerations: can a Unicode string be directly written in the source code? How does one do IO with unicode strings? Can these strings be manipulated easily? Can non-ASCII characters be used for keywords/identifiers/etc? What encodings (UTF-8, UTF-16, etc) can your language accept without much trouble?
Demonstrate how one is expected to handle Unicode strings. Some example considerations: can a Unicode string be directly written in the source code? How does one do IO with unicode strings? Can these strings be manipulated easily? Can non-ASCII characters be used for keywords/identifiers/etc? What encodings (UTF-8, UTF-16, etc) can your language accept without much trouble?
=={{header|Perl}}==
In Perl, "Unicode" means "UTF-8". If you want to include utf8 characters in your source file, unless you have set <code>PERL_UNICODE</code> environment correctly, you should do<lang Perl>use utf8;</lang> or you rick the parser treating the file as raw bytes.

Inside the script, utf8 characters can be used both as identifiers and literal strings, and built-in string functions will respect it:<lang Perl>$四十二 = "voilà";
print "$四十二"; # voilà
print uc($四十二); # VOILÀ</lang>
or you can specify unicode characters by name or ordinal:<lang Perl>use charnames qw(greek);
$x = "\N{sigma} \U\N{sigma}";
$y = "\x{2708}";
print scalar reverse("$x $y"); # ✈ Σ σ</lang>

Regular expressions also have support for unicode based on properties, for example, finding characters that's normally written from right to left:<lang Perl>print "Say עִבְרִית" =~ /(\p{BidiClass:R})/g; # עברית</lang>

When it comes to IO, one should specify whether a file is to be opened in utf8 or raw byte mode:<lang Perl>open IN, "<:utf8", "file_utf";
open OUT, ">:raw", "file_byte";</lang>
The default of IO behavior can also be set in <code>PERL_UNICODE</code>.

However, when your program interacts with the environment, you may still run into tricky spots if you have imcompatible locale settings or your OS is not using unicode; that's not what Perl has control over, unfortunately.

Revision as of 10:54, 30 June 2011

Unicode strings is a draft programming task. It is not yet considered ready to be promoted as a complete task, for reasons that should be found in its talk page.

Task

Demonstrate how one is expected to handle Unicode strings. Some example considerations: can a Unicode string be directly written in the source code? How does one do IO with unicode strings? Can these strings be manipulated easily? Can non-ASCII characters be used for keywords/identifiers/etc? What encodings (UTF-8, UTF-16, etc) can your language accept without much trouble?

Perl

In Perl, "Unicode" means "UTF-8". If you want to include utf8 characters in your source file, unless you have set PERL_UNICODE environment correctly, you should do<lang Perl>use utf8;</lang> or you rick the parser treating the file as raw bytes.

Inside the script, utf8 characters can be used both as identifiers and literal strings, and built-in string functions will respect it:<lang Perl>$四十二 = "voilà"; print "$四十二"; # voilà print uc($四十二); # VOILÀ</lang> or you can specify unicode characters by name or ordinal:<lang Perl>use charnames qw(greek); $x = "\N{sigma} \U\N{sigma}"; $y = "\x{2708}"; print scalar reverse("$x $y"); # ✈ Σ σ</lang>

Regular expressions also have support for unicode based on properties, for example, finding characters that's normally written from right to left:<lang Perl>print "Say עִבְרִית" =~ /(\p{BidiClass:R})/g; # עברית</lang>

When it comes to IO, one should specify whether a file is to be opened in utf8 or raw byte mode:<lang Perl>open IN, "<:utf8", "file_utf"; open OUT, ">:raw", "file_byte";</lang> The default of IO behavior can also be set in PERL_UNICODE.

However, when your program interacts with the environment, you may still run into tricky spots if you have imcompatible locale settings or your OS is not using unicode; that's not what Perl has control over, unfortunately.