Unicode

Unicode support in Raku

Raku has a high level of support of Unicode, with the latest version supporting Unicode 15.0. This document aims to be both an overview as well as description of Unicode features which don't belong in the documentation for routines and methods.

While it is specifically part of the VM used by Rakudo, this overview on MoarVM's internal representation of strings provides some interesting details.

For security reasons, browsers (e.g.: Firefox) may restrict the display of Unicode glyphs not in a trusted font, even if the OS has a font installed that displays the glyph. To overcome this problem for Firefox, set the `privacy.fingerprintingProtection` config option to `False`.

Filehandles and I/O

Normalization

Raku applies normalization by default to all input and output except for file names, which are read and written as UTF8-C8; graphemes, which are user-visible forms of the characters, will use a normalized representation. For example, the grapheme á can be represented in two ways, either using one codepoint:

á (U+E1 "LATIN SMALL LETTER A WITH ACUTE")

Or two codepoints:

a +  ́ (U+61 "LATIN SMALL LETTER A" + U+301 "COMBINING ACUTE ACCENT")

Raku will turn both these inputs into one codepoint, as is specified for Normalization Form C (NFC). In most cases this is useful and means that two inputs that are equivalent are both treated the same. Unicode has a concept of canonical equivalence which allows us to determine the canonical form of a string, thus allowing us to properly compare strings and manipulate them without having to worry about the text losing these properties. By default, any text you process or output from Raku will be in this “canonical” form, even when making modifications or concatenations to the string (see below for how to avoid this). For more detailed information about Normalization Form C and canonical equivalence, see the Unicode Foundation's page on Normalization and Canonical Equivalence.

One case where we don't default to this is for the names of files. This is because the names of files must be accessed exactly as the bytes are written on the disk.

To avoid normalization you can use a special encoding format called UTF8-C8. Using this encoding with any filehandle will allow you to read the exact bytes as they are on disk without normalization. They may look funny when printed out if you use a UTF8 handle. If you print it out to a handle where the output encoding is UTF8-C8, then it will render as you would normally expect as a byte-for-byte exact copy. More technical details on UTF8-C8 on MoarVM are described below.

UTF8-C8

UTF-8 Clean-8 is an encoder/decoder that primarily works as the UTF-8 one. However, upon encountering a byte sequence that will either not decode as valid UTF-8, or that would not round-trip due to normalization, it will use NFG synthetics to keep track of the original bytes involved. This means that encoding back to UTF-8 Clean-8 will be able to recreate the bytes as they originally existed. The synthetics contain four codepoints:

  • The codepoint 0x10FFFD (which is a private use codepoint)

  • The codepoint 'x'

  • The upper 4 bits of the non-decodable byte as a hex char (0..9A..F)

  • The lower 4 bits as the non-decodable byte as a hex char (0..9A..F)

Under normal UTF-8 encoding, this means the unrepresentable characters will come out as something like ?xFF.

UTF-8 Clean-8 is used in places where MoarVM receives strings from the environment, command line arguments, and filesystem queries; for instance when decoding buffers:

say Buf.new(ord('A'), 0xFE, ord('Z')).decode('utf8-c8');
    #  OUTPUT: «A􏿽xFEZ␤»

You can see how the two initial codepoints used by UTF8-C8 show up below right before the 'FE'. You can use this type of encoding to read files with unknown encoding:

my $test-file = "/tmp/test";
    given open($test-file, :w, :bin) {
      .write: Buf.new(ord('A'), 0xFA, ord('B'), 0xFB, 0xFC, ord('C'), 0xFD);
      .close;
    }
say slurp($test-file, enc => 'utf8-c8');
    # OUTPUT: «(65 250 66 251 252 67 253)␤»

Reading with this type of encoding and encoding them back to UTF8-C8 will give you back the original bytes; this would not have been possible with the default UTF-8 encoding.

Please note that this encoding so far is not supported in the JVM implementation of Rakudo.

Entering unicode codepoints and codepoint sequences

You can enter Unicode codepoints by number (decimal as well as hexadecimal). For example, the character named "latin capital letter ae with macron" has decimal codepoint 482 and hexadecimal codepoint 0x1E2:

say "\c[482]"; # OUTPUT: «Ǣ␤»
    say "\x1E2";   # OUTPUT: «Ǣ␤»

You can also access Unicode codepoints by name: Raku supports all Unicode names.

say "\c[PENGUIN]"; # OUTPUT: «🐧␤»
    say "\c[BELL]";    # OUTPUT: «🔔␤» (U+1F514 BELL)

All Unicode codepoint names/named seq/emoji sequences are now case-insensitive: [Starting in Rakudo 2017.02]

say "\c[latin capital letter ae with macron]"; # OUTPUT: «Ǣ␤»
    say "\c[latin capital letter E]";              # OUTPUT: «E␤» (U+0045)

You can specify multiple characters by using a comma separated list with \c[]. You can combine numeric and named styles as well:

say "\c[482,PENGUIN]"; # OUTPUT: «Ǣ🐧␤»

In addition to using \c[] inside interpolated strings, you can also use the uniparse:

say "DIGIT ONE".uniparse;  # OUTPUT: «1␤»
    say uniparse("DIGIT ONE"); # OUTPUT: «1␤»

See uniname and uninames for routines that work in the opposite direction with a single codepoint and multiple codepoints, respectively.

Name aliases

Name Aliases are used mainly for codepoints without an official name, for abbreviations, or for corrections (Unicode names never change). For a full list of them see here.

Control codes without any official name:

say "\c[ALERT]";     # Not visible (U+0007 control code (also accessible as \a))
    say "\c[LINE FEED]"; # Not visible (U+000A same as "\n")

Corrections:

#   Correct name as input:
    say                     "\c[LATIN CAPITAL LETTER GHA]"; # OUTPUT: «Ƣ␤»
    #   Original, erroneous name as output:
    say "Ƣ".uniname; # OUTPUT: «LATIN CAPITAL LETTER OI␤»
# This one is a spelling mistake that was corrected in a Name Alias:
    #   Correct name as input:
    say    "\c[PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET]".uniname;
    #   Original, erroneous name as output:
    # OUTPUT: «PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET␤»

Abbreviations:

say "\c[ZWJ]".uniname;   # OUTPUT: «ZERO WIDTH JOINER␤»
    say "\c[NBSP]".uniname;  # OUTPUT: «NO-BREAK SPACE␤»
    say "\c[NNBSP]".uniname; # OUTPUT: «NARROW NO-BREAK SPACE␤»

Named sequences

You can also use any of the Named Sequences, these are not single codepoints, but sequences of them. [Starting in Rakudo 2017.02]

say "\c[LATIN CAPITAL LETTER E WITH VERTICAL LINE BELOW AND ACUTE]";      # OUTPUT: «É̩␤»
    say "\c[LATIN CAPITAL LETTER E WITH VERTICAL LINE BELOW AND ACUTE]".ords; # OUTPUT: «(201 809)␤»

Emoji sequences

Raku supports Emoji sequences. For all of them see: Emoji ZWJ Sequences and Emoji Sequences. Note that any names with commas should have their commas removed since Raku uses commas to separate different codepoints/sequences inside the same \c sequence.

say "\c[woman gesturing OK]";         # OUTPUT: «🙆‍♀️␤»
    say "\c[family: man woman girl boy]"; # OUTPUT: «👨‍👩‍👧‍👦␤»

Confusability

Because of the number of glyphs in unicode, wider support means you may find some that are confusable. For general tips on how to avoid issues, see Confusability at the unicode.org site.

See Also

Containers

A low-level explanation of Raku containers

Contexts and contextualizers

What are contexts and how to switch into them

Control flow

Statements used to control the flow of execution

Enumeration

An example using the enum type

Exceptions

Using exceptions in Raku

Functions

Functions and functional programming in Raku

Grammars

Parsing and interpreting text

Hashes and maps

Working with associative arrays/dictionaries/hashes

Input/Output the definitive guide

Correctly use Raku IO

Lists, sequences, and arrays

Positional data constructs

Metaobject protocol (MOP)

Introspection and the Raku object system

Native calling interface

Call into dynamic libraries that follow the C calling convention

Raku native types

Using the types the compiler and hardware make available to you

Newline handling in Raku

How the different newline characters are handled, and how to change the behavior

Numerics

Numeric types available in Raku

Object orientation

Object orientation in Raku

Operators

Common Raku infixes, prefixes, postfixes, and more!

Packages

Organizing and referencing namespaced program elements

Performance

Measuring and improving runtime or compile-time performance

Phasers

Program execution phases and corresponding phaser blocks

Pragmas

Special modules that define certain aspects of the behavior of the code

Quoting constructs

Writing strings and word lists, in Raku

Regexes

Pattern matching against strings

Sets, bags, and mixes

Unordered collections of unique and weighted objects in Raku

Signature literals

A guide to signatures in Raku

Statement prefixes

Prefixes that alter the behavior of a statement or a set of them

Data structures

How Raku deals with data structures and what we can expect from them

Subscripts

Accessing data structure elements by index or key

Syntax

General rules of Raku syntax

System interaction

Working with the underlying operating system and running applications

Date and time functions

Processing date and time in Raku

Traits

Compile-time specification of behavior made easy

Unicode versus ASCII symbols

Unicode symbols and their ASCII equivalents

Variables

Variables in Raku

Independent routines

Routines not defined within any class or role.

The Camelia image is copyright 2009 by Larry Wall. "Raku" is trademark of the Yet Another Society. All rights reserved.