Unicode strings: Difference between revisions
(init task) |
(perl example) |
||
Line 2: | Line 2: | ||
==Task== |
==Task== |
||
Demonstrate how one is expected to handle Unicode strings. Some example considerations: can a Unicode string be directly written in the source code? How does one do IO with unicode strings? Can these strings be manipulated easily? Can non-ASCII characters be used for keywords/identifiers/etc? What encodings (UTF-8, UTF-16, etc) can your language accept without much trouble? |
Demonstrate how one is expected to handle Unicode strings. Some example considerations: can a Unicode string be directly written in the source code? How does one do IO with unicode strings? Can these strings be manipulated easily? Can non-ASCII characters be used for keywords/identifiers/etc? What encodings (UTF-8, UTF-16, etc) can your language accept without much trouble? |
||
=={{header|Perl}}== |
|||
In Perl, "Unicode" means "UTF-8". If you want to include utf8 characters in your source file, unless you have set <code>PERL_UNICODE</code> environment correctly, you should do<lang Perl>use utf8;</lang> or you rick the parser treating the file as raw bytes. |
|||
Inside the script, utf8 characters can be used both as identifiers and literal strings, and built-in string functions will respect it:<lang Perl>$四十二 = "voilà"; |
|||
print "$四十二"; # voilà |
|||
print uc($四十二); # VOILÀ</lang> |
|||
or you can specify unicode characters by name or ordinal:<lang Perl>use charnames qw(greek); |
|||
$x = "\N{sigma} \U\N{sigma}"; |
|||
$y = "\x{2708}"; |
|||
print scalar reverse("$x $y"); # ✈ Σ σ</lang> |
|||
Regular expressions also have support for unicode based on properties, for example, finding characters that's normally written from right to left:<lang Perl>print "Say עִבְרִית" =~ /(\p{BidiClass:R})/g; # עברית</lang> |
|||
When it comes to IO, one should specify whether a file is to be opened in utf8 or raw byte mode:<lang Perl>open IN, "<:utf8", "file_utf"; |
|||
open OUT, ">:raw", "file_byte";</lang> |
|||
The default of IO behavior can also be set in <code>PERL_UNICODE</code>. |
|||
However, when your program interacts with the environment, you may still run into tricky spots if you have imcompatible locale settings or your OS is not using unicode; that's not what Perl has control over, unfortunately. |
Revision as of 10:54, 30 June 2011
Task
Demonstrate how one is expected to handle Unicode strings. Some example considerations: can a Unicode string be directly written in the source code? How does one do IO with unicode strings? Can these strings be manipulated easily? Can non-ASCII characters be used for keywords/identifiers/etc? What encodings (UTF-8, UTF-16, etc) can your language accept without much trouble?
Perl
In Perl, "Unicode" means "UTF-8". If you want to include utf8 characters in your source file, unless you have set PERL_UNICODE
environment correctly, you should do<lang Perl>use utf8;</lang> or you rick the parser treating the file as raw bytes.
Inside the script, utf8 characters can be used both as identifiers and literal strings, and built-in string functions will respect it:<lang Perl>$四十二 = "voilà"; print "$四十二"; # voilà print uc($四十二); # VOILÀ</lang> or you can specify unicode characters by name or ordinal:<lang Perl>use charnames qw(greek); $x = "\N{sigma} \U\N{sigma}"; $y = "\x{2708}"; print scalar reverse("$x $y"); # ✈ Σ σ</lang>
Regular expressions also have support for unicode based on properties, for example, finding characters that's normally written from right to left:<lang Perl>print "Say עִבְרִית" =~ /(\p{BidiClass:R})/g; # עברית</lang>
When it comes to IO, one should specify whether a file is to be opened in utf8 or raw byte mode:<lang Perl>open IN, "<:utf8", "file_utf";
open OUT, ">:raw", "file_byte";</lang>
The default of IO behavior can also be set in PERL_UNICODE
.
However, when your program interacts with the environment, you may still run into tricky spots if you have imcompatible locale settings or your OS is not using unicode; that's not what Perl has control over, unfortunately.