Monday, December 31, 2007

Some Perl Unicode Idioms

Q: I want to represent a Unicode code-point as a Perl character
A:
"\x{05d0}"
will represent the Hebrew letter א as a Perl character
pack("U0U*",0x05d0)
will do the same

Q: I want to get the list of bytes in UTF-8 encoding for a Unicode code-point
A:
my @bytes = unpack "C*", Encode::encode("utf8",pack("U0U",0x05d0));


Q: I want to see the list of bytes in UTF-8 encoding for a Unicode code-point represented in hex notation
A: For the Hebrew letter Alef (code point 0x05d0)
my $octets = Encode::encode("utf8","\x{05d0}");
my @bytes = unpack "C*", $octets;
my @chars = map { sprintf "%#x", $_ } @bytes;




Here's a function that returns a blessed hash with a few useful operations:

sub new_char {
my ($code_point,$encoding) = @_;

my $perl_char = pack "U0U", $code_point;
my $octets_str = Encode::encode($encoding,$perl_char);
my @bytes = unpack "C*", $octets_str;
my $hex_str = [map {sprintf "%#x", $_} @bytes];
my $length = @bytes;
bless {
code_point => $code_point,
encoding => $encoding,
perl_char => $perl_char,
octets_str => $octets_str,
bytes => [@bytes],
hex_str => $hex_str,
length => $length,
}, 'Char';
}