Grammars

Parsing and interpreting text

Grammar is a powerful tool used to destructure text and often to return data structures that have been created by interpreting that text.

For example, Raku is parsed and executed using a Raku-style grammar.

An example that's more practical to the common Raku user is the JSON::Tiny module, which can deserialize any valid JSON file; however, the deserializing code is written in less than 100 lines of simple, extensible code.

If you didn't like grammar in school, don't let that scare you off grammars. Grammars allow you to group regexes, just as classes allow you to group methods of regular code.

Named Regexes

The main ingredient of grammars is named regexes. While the syntax of Raku Regexes is outside the scope of this document, named regexes have a special syntax, similar to subroutine definitions: [1]

my regex number { \d+ [ \. \d+ ]? }

In this case, we have to specify that the regex is lexically scoped using the my keyword, because named regexes are normally used within grammars.

Being named gives us the advantage of being able to easily reuse the regex elsewhere:

say so "32.51" ~~ &number;                         # OUTPUT: «True␤»
say so "15 + 4.5" ~~ /<number>\s* '+' \s*<number>/ # OUTPUT: «True␤»

regex isn't the only declarator for named regexes. In fact, it's the least common. Most of the time, the token or rule declarators are used. These are both ratcheting, which means that the match engine won't back up and try again if it fails to match something. This will usually do what you want, but isn't appropriate for all cases:

my regex works-but-slow { .+ q }
my token fails-but-fast { .+ q }
my $s = 'Tokens won\'t backtrack, which makes them fail quicker!';
say so $s ~~ &works-but-slow; # OUTPUT: «True␤»
say so $s ~~ &fails-but-fast; # OUTPUT: «False␤»
                              # the entire string is taken by the .+

Note that non-backtracking works on terms, that is, as the example below, if you have matched something, then you will never backtrack. But when you fail to match, if there is another candidate introduced by | or ||, you will retry to match again.

my token tok-a { .* d  };
my token tok-b { .* d | bd };
say so "bd" ~~ &tok-a;        # OUTPUT: «False␤»
say so "bd" ~~ &tok-b;        # OUTPUT: «True␤»

Rules

The only difference between the token and rule declarators is that the rule declarator causes :sigspace to go into effect for the Regex:

my token token-match { 'once' 'upon' 'a' 'time' }
my rule  rule-match  { 'once' 'upon' 'a' 'time' }
say so 'onceuponatime'    ~~ &token-match; # OUTPUT: «True␤»
say so 'once upon a time' ~~ &token-match; # OUTPUT: «False␤»
say so 'onceuponatime'    ~~ &rule-match;  # OUTPUT: «False␤»
say so 'once upon a time' ~~ &rule-match;  # OUTPUT: «True␤»

Creating grammars

Grammar is the superclass that classes automatically get when they are declared with the grammar keyword instead of class. Grammars should only be used to parse text; if you wish to extract complex data, you can add actions within the grammar or use an action object in conjunction with the grammar.

Proto regexes|Syntax,:sym<>;Language,proto regex;Syntax,grammar

Grammars are composed of rules, tokens and regexes; these are actually methods, since grammars are classes.

[2]

These methods can share a name and functionality in common, and thus can use proto.

For instance, if you have a lot of alternations, it may become difficult to produce readable code or subclass your grammar. In the Calculations class below, the ternary in method TOP is less than ideal and it becomes even worse the more operations we add:

grammar Calculator {
        token TOP { [ <add> | <sub> ] }
        rule  add { <num> '+' <num> }
        rule  sub { <num> '-' <num> }
        token num { \d+ }
    }
class Calculations {
        method TOP ($/) { make $<add> ?? $<add>.made !! $<sub>.made; }
        method add ($/) { make [+] $<num>; }
        method sub ($/) { make [-] $<num>; }
    }
say Calculator.parse('2 + 3', actions => Calculations).made;
# OUTPUT: «5␤»

To make things better, we can use proto regexes that look like :sym<...> adverbs on tokens:

grammar Calculator {
        token TOP { <calc-op> }
proto rule calc-op          {*}
              rule calc-op:sym<add> { <num> '+' <num> }
              rule calc-op:sym<sub> { <num> '-' <num> }
token num { \d+ }
    }
class Calculations {
        method TOP              ($/) { make $<calc-op>.made; }
        method calc-op:sym<add> ($/) { make [+] $<num>; }
        method calc-op:sym<sub> ($/) { make [-] $<num>; }
    }
say Calculator.parse('2 + 3', actions => Calculations).made;
# OUTPUT: «5␤»

In this grammar the alternation has now been replaced with <calc-op>, which is essentially the name of a group of values we'll create. We do so by defining a rule prototype with proto rule calc-op. Each of our previous alternations have been replaced by a new rule calc-op definition and the name of the alternation is attached with :sym<> adverb.

In the class that declares actions, we now got rid of the ternary operator and simply take the .made value from the $<calc-op> match object. And the actions for individual alternations now follow the same naming pattern as in the grammar: method calc-op:sym<add> and method calc-op:sym<sub>.

The real beauty of this method can be seen when you subclass the grammar and action classes. Let's say we want to add a multiplication feature to the calculator:

grammar BetterCalculator is Calculator {
    rule calc-op:sym<mult> { <num> '*' <num> }
}
class BetterCalculations is Calculations {
    method calc-op:sym<mult> ($/) { make [*] $<num> }
}
say BetterCalculator.parse('2 * 3', actions => BetterCalculations).made;
# OUTPUT: «6␤»

All we had to add are an additional rule and action to the calc-op group and the thing works—all thanks to proto regexes.

Special tokens

TOP|Regexes,TOP

grammar Foo {
        token TOP { \d+ }
    }

The TOP token is the default first token attempted to match when parsing with a grammar. Note that if you're parsing with the .parse method, token TOP is automatically anchored to the start and end of the string. If you don't want to parse the whole string, look up .subparse.

Using rule TOP or regex TOP is also acceptable.

A different token can be chosen to be matched first using the :rule named argument to .parse, .subparse, or .parsefile. These are all Grammar methods.

ws|Regexes,ws

The default ws matches zero or more whitespace characters, as long as that point is not within a word (in code form, that's token ws { <!ww> \s* }):

# First <.ws> matches word boundary at the start of the line
    # and second <.ws> matches the whitespace between 'b' and 'c'
    say 'ab   c' ~~ /<.ws> ab <.ws> c /; # OUTPUT: «「ab   c」␤»
# Failed match: there is neither any whitespace nor a word
    # boundary between 'a' and 'b'
    say 'ab' ~~ /. <.ws> b/;             # OUTPUT: «Nil␤»
# Successful match: there is a word boundary between ')' and 'b'
    say ')b' ~~ /. <.ws> b/;             # OUTPUT: «「)b」␤»

Please bear in mind that we're preceding ws with a dot to avoid capturing, which we are not interested in. Since in general whitespace is a separator, this is how it's mostly found.

When rule is used instead of token, :sigspace is enabled by default and any whitespace after terms and closing parenthesis/brackets are turned into a non-capturing call to ws, written as <.ws> where . means non-capturing. That is to say:

rule entry { <key> '=' <value> }

Is the same as:

token entry { <key> <.ws> '=' <.ws> <value> <.ws> }

You can also redefine the default ws token:

grammar Foo {
        rule TOP { \d \d }
    }.parse: "4   \n\n 5"; # Succeeds
grammar Bar {
        rule TOP { \d \d }
        token ws { \h*   }
    }.parse: "4   \n\n 5"; # Fails

And even capture it, but you need to use it explicitly. Notice that in the next example we use token instead of rule, as the latter would cause whitespace to be consumed by the implicit non-capturing .ws.

grammar Foo { token TOP {\d <ws> \d} };
my $parsed = Foo.parse: "3 3";
say $parsed<ws>; # OUTPUT: «「 」␤»

sym|Regexes,<sym>

The <sym> token can be used inside proto regexes to match the string value of the :sym adverb for that particular regex:

grammar Foo {
        token TOP { <letter>+ }
        proto token letter {*}
              token letter:sym<R> { <sym> }
              token letter:sym<a> { <sym> }
              token letter:sym<k> { <sym> }
              token letter:sym<u> { <sym> }
              token letter:sym<*> {   .   }
    }.parse("I ♥ Raku", actions => class {
        method TOP($/) { make $<letter>.grep(*.<sym>).join }
    }).made.say; # OUTPUT: «Raku␤»

This comes in handy when you're already differentiating the proto regexes with the strings you're going to match, as using <sym> token prevents repetition of those strings.

"Always succeed" assertion|Regexes,<?>

The <?> is the always succeed assertion. When used as a grammar token, it can be used to trigger an Action class method. In the following grammar we look for Arabic digits and define a succ token with the always succeed assertion.

In the action class, we use calls to the succ method to do set up (in this case, we prepare a new element in @!numbers). In the digit method, we use the Arabic digit as an index into a list of Devanagari digits and add it to the last element of @!numbers. Thanks to succ, the last element will always be the number for the currently parsed digit digits.

grammar Digifier {
        rule TOP {
            [ <.succ> <digit>+ ]+
        }
        token succ   { <?> }
        token digit { <[0..9]> }
    }
class Devanagari {
        has @!numbers;
        method digit ($/) { @!numbers.tail ~= <०  १  २  ३  ४  ५  ६  ७  ८  ९>[$/] }
        method succ  ($)  { @!numbers.push: ''     }
        method TOP   ($/) { make @!numbers[^(*-1)] }
    }
say Digifier.parse('255 435 777', actions => Devanagari.new).made;
    # OUTPUT: «(२५५ ४३५ ७७७)␤»

Methods in grammars

It's fine to use methods instead of rules or tokens in a grammar, as long as they return a Match:

grammar DigitMatcher {
        method TOP (:$full-unicode) {
            $full-unicode ?? self.num-full !! self.num-basic;
        }
        token num-full  { \d+ }
        token num-basic { <[0..9]>+ }
    }

The grammar above will attempt different matches depending on the argument provided to the subparse methods:

say +DigitMatcher.subparse: '12७१७९०९', args => \(:full-unicode);
# OUTPUT: «12717909␤»
say +DigitMatcher.subparse: '12७१७९०९', args => \(:!full-unicode);
# OUTPUT: «12␤»

Dynamic variables in grammars

Variables can be defined in tokens by prefixing the lines of code defining them with :. Arbitrary code can be embedded anywhere in a token by surrounding it with curly braces. This is useful for keeping state between tokens, which can be used to alter how the grammar will parse text. Using dynamic variables (variables with $*, @*, &*, %* twigils) in tokens cascades down through all tokens defined thereafter within the one where it's defined, avoiding having to pass them from token to token as arguments.

One use for dynamic variables is guards for matches. This example uses guards to explain which regex classes parse whitespace literally:

grammar GrammarAdvice {
        rule TOP {
            :my Int $*USE-WS;
            "use" <type> "for" <significance> "whitespace by default"
        }
        token type {
            | "rules"   { $*USE-WS = 1 }
            | "tokens"  { $*USE-WS = 0 }
            | "regexes" { $*USE-WS = 0 }
        }
        token significance {
            | <?{ $*USE-WS == 1 }> "significant"
            | <?{ $*USE-WS == 0 }> "insignificant"
        }
    }

Here, text such as "use rules for significant whitespace by default" will only match if the state assigned by whether rules, tokens, or regexes are mentioned matches with the correct guard:

say GrammarAdvice.subparse("use rules for significant whitespace by default");
# OUTPUT: «use rules for significant whitespace by default␤»

say GrammarAdvice.subparse("use tokens for insignificant whitespace by default");
# OUTPUT: «use tokens for insignificant whitespace by default␤»

say GrammarAdvice.subparse("use regexes for insignificant whitespace by default");
# OUTPUT: «use regexes for insignificant whitespace by default␤»

say GrammarAdvice.subparse("use regexes for significant whitespace by default");
# OUTPUT: #<failed match>

Attributes in grammars

Attributes may be defined in grammars. However, they can only be accessed by methods. Attempting to use them from within a token will throw an exception because tokens are methods of Match, not of the grammar itself. Note that mutating an attribute from within a method called in a token will only modify the attribute for that token's own match object! Grammar attributes can be accessed in the match returned after parsing if made public:

grammar HTTPRequest {
    has Bool $.invalid;

    token TOP {
        <type> <.ns> <path> <.ns> 'HTTP/1.1' <.crlf>
        [ <field> <.crlf> ]+
        <.crlf>
        $<body>=.*
    }

    token type {
        | [ GET | POST | OPTIONS | HEAD | PUT | DELETE | TRACE | CONNECT ] <.accept>
        | <-[\/]>+ <.error>
    }

    token path {
        | '/' [[\w+]+ % \/] [\.\w+]? <.accept>
        | '*' <.accept>
        | \S+ <.error>
    }

    token field {
        | $<name>=\w+ <.ns> ':' <.ns> $<value>=<-crlf>* <.accept>
        | <-crlf>+ <.error>
    }

    method error(--> ::?CLASS:D) {
        $!invalid = True;
        self;
    }

    method accept(--> ::?CLASS:D) {
        $!invalid = False;
        self;
    }

    token crlf { # network new line (usually seen as "\r\n")
        # Several internet protocols (such as HTTP, RF 2616) mandate
        # the use of ASCII CR+LF (0x0D 0x0A) to terminate lines at
        # the protocol level (even though, in practice, some applications
        # tolerate a single LF).
        # Raku, Raku grammars and strings (Str) adhere to Unicode
        # conformance. Thus, CR+LF cannot be expressed unambiguously
        # as \r\n in in Raku grammars or strings (Str), as Unicode
        # conformance requires \r\n to be interpreted as \n alone.
        \x[0d] \x[0a]
    }
    token ns { # network space
        # <ws> would consume, e.g., newlines, and \h (and \s) would accept
        # more codepoints than just ASCII single space and the tab character.
        [ ' ' | <[\t]> ]*
    }
}

my $crlf = "\x[0d]\x[0a]";
my $header = "GOT /index.html HTTP/1.1{$crlf}Host: docs.raku.org{$crlf}{$crlf}body";
my $m = HTTPRequest.parse($header);
say "type(\"$m.<type>\")={$m.<type>.invalid}";
# OUTPUT: type("GOT ")=True
say "path(\"$m.<path>\")={$m.<path>.invalid}";
# OUTPUT: path("/index.html")=False
say "field(\"$m.<field>[0]\")={$m.<field>[0].invalid}";
# OUTPUT: field("Host: docs.raku.org")=False

Notes: $crlf and token <.crlf> are required if we want to somehow (within the context of this incomplete example) strictly adhere to HTTP/1.1 (RFC 2616). The reason is that Raku, in contrast to RFC 2616, is Unicode conformant, and \r\n needs to be interpreted as a sole \n, thus preventing the grammar to properly parse a string containing \r\n in the sense expected by the HTTP protocol. Notice how attribute invalid is local to each component (e.g., the value for <type> is True, but for <path> is False). Notice also how we have a method for accept, the reason being that attribute invalid would be uninitialized (even if present) otherwise.

Passing arguments into grammars

To pass arguments into a grammar, you can use the named argument of :args on any of the parsing methods of grammar. The arguments passed should be in a list.

grammar demonstrate-arguments {
    rule TOP ($word) {
    "I like" $word
    }
}

# Notice the comma after "sweets" when passed to :args to coerce it to a list
say demonstrate-arguments.parse("I like sweets", :args(("sweets",)));
# OUTPUT: «「I like sweets」␤»

Once the arguments are passed in, they can be used in a call to a named regex inside the grammar.

grammar demonstrate-arguments-again {
    rule TOP ($word) {
    <phrase-stem><added-word($word)>
    }

    rule phrase-stem {
       "I like"
    }

    rule added-word($passed-word) {
       $passed-word
    }
}

say demonstrate-arguments-again.parse("I like vegetables", :args(("vegetables",)));
# OUTPUT: 「I like vegetables」␤»
# OUTPUT:  «phrase-stem => 「I like 」␤»
# OUTPUT:  «added-word => 「vegetables」␤»

Alternatively, you can initialize dynamic variables and use any arguments that way within the grammar.

grammar demonstrate-arguments-dynamic {
   rule TOP ($*word, $*extra) {
      <phrase-stem><added-words>
   }
   rule phrase-stem {
      "I like"
   }
   rule added-words {
      $*word $*extra
   }
}

say demonstrate-arguments-dynamic.parse("I like everything else",
  :args(("everything", "else")));
# OUTPUT: «「I like everything else」␤»
# OUTPUT:  «phrase-stem => 「I like 」␤»
# OUTPUT:  «added-words => 「everything else」␤»

Action objects

A successful grammar match gives you a parse tree of Match objects, and the deeper that match tree gets, and the more branches in the grammar are, the harder it becomes to navigate the match tree to get the information you are actually interested in.

To avoid the need for diving deep into a match tree, you can supply an actions object. After each successful parse of a named rule in your grammar, it tries to call a method of the same name as the grammar rule, giving it the newly created Match object as a positional argument. If no such method exists, it is skipped.

Here is a contrived example of a grammar and actions in action:

grammar TestGrammar {
    token TOP { \d+ }
}

class TestActions {
    method TOP($/) {
        make(2 + $/);
    }
}

my $match = TestGrammar.parse('40', actions => TestActions.new);
say $match;         # OUTPUT: «「40」␤»
say $match.made;    # OUTPUT: «42␤»

An instance of TestActions is passed as named argument actions to the parse call, and when token TOP has matched successfully, it automatically calls method TOP, passing the match object as an argument.

To make it clear that the argument is a match object, the example uses $/ as a parameter name to the action method, though that's just a handy convention, nothing intrinsic; $match would have worked too, though using $/ does give the advantage of providing $<capture> as a shortcut for $/<capture>; we use another argument, anyway, in the action for TOP.

A slightly more involved example follows:

grammar KeyValuePairs {
    token TOP {
        [<pair> \v+]*
    }

    token pair {
        <key=.identifier> '=' <value=.identifier>
    }

    token identifier {
        \w+
    }
}

class KeyValuePairsActions {
    method pair      ($/) {
        make $/<key>.made => $/<value>.made
    }
    method identifier($/) {
        # subroutine `make` is the same as calling .make on $/
        make ~$/
    }

    method TOP ($match) {
        # can use any variable name for parameter, not just $/
        $match.make: $match<pair>».made
    }
}


my $actions = KeyValuePairsActions;
my @res = KeyValuePairs.parse(q:to/EOI/, :$actions).made;
second=b
hits=42
raku=d
EOI

for @res -> $p {
    say "Key: $p.key()\tValue: $p.value()";
}

This produces the following output:

Key: second     Value: b
Key: hits       Value: 42
Key: raku       Value: d

Rule pair, which parsed a pair separated by an equals sign, aliases the two calls to token identifier to separate capture names so that they are available more easily and intuitively, as they will be in the corresponding Action. The corresponding action method constructs a Pair object, and uses the .made property of the sub match objects. So it (like the action method TOP too) exploits the fact that action methods for submatches are called before those of the calling/outer regex. So action methods are called in post-order.

The action method TOP simply collects all the objects that were .made by the multiple matches of the pair rule, and returns them in a list. Please note that, in this case, we need to use the method form of make, since the routine form can only be used if the argument to the action method is $/. Inversely, if the argument of the method is $/, we can use simply make, which is equivalent to $/.make.

Also note that KeyValuePairsActions was passed as a type object to method parse, which was possible because none of the action methods use attributes (which would only be available in an instance).

We can extend above example by using inheritance.

use KeyValuePairs;

unit grammar ConfigurationSets is KeyValuePairs;

token TOP {
    <configuration-element>+ %% \v
}

token configuration-element {
    <pair>+ %% \v
}

token comment {
    \s* '#' .+? $$
}

token pair {
    <key=.identifier> '=' <value=.identifier> <comment>?
}

We are sub-classing (actually, sub-grammaring) the previous example; we have overridden the definition of pair by adding a comment; the previous TOP rule has been demoted to configuration-element, and there's a new TOP which now considers sets of configuration elements separated by vertical space. We can also reuse actions by subclassing the action class:

use KeyValuePairs;

unit class ConfigurationSetsActions is KeyValuePairsActions;

method configuration-element($match) {
    $match.make: $match<pair>».made
}

method TOP ($match) {
    my @made-elements = gather for $match<configuration-element> {
        take $_.made
    };
    $match.make( @made-elements );

}

All existing actions are reused, although obviously new ones have to be written for the new elements in the grammar, including TOP. These can be used together from this script:

use ConfigurationSets;
use ConfigurationSetsActions;

my $actions = ConfigurationSetsActions;
my $sets = ConfigurationSets.parse(q:to/EOI/, :$actions).made;
second=b # Just a thing
hits=42
raku=d

third=c # New one
hits=33
EOI

for @$sets -> $set {
    say "Element→ $set";
}

Which will print

Element→ second b hits 42 raku d
Element→ third c hits 33

In other cases, action methods might want to keep state in attributes. Then of course you must pass an instance to method parse.

Note that token ws is special: when :sigspace is enabled (and it is when we are using rule), it replaces certain whitespace sequences. This is why the spaces around the equals sign in rule pair work just fine and why the whitespace before closing } does not gobble up the newlines looked for in token TOP.

[1]In fact, named regexes can even take extra arguments, using the same syntax as subroutine parameter lists

[2]They are actually a special kind of class, but for the rest of the section, they behave in the same way as a normal class would

See Also

Containers

A low-level explanation of Raku containers

Contexts and contextualizers

What are contexts and how to switch into them

Control flow

Statements used to control the flow of execution

Enumeration

An example using the enum type

Exceptions

Using exceptions in Raku

Functions

Functions and functional programming in Raku

Hashes and maps

Working with associative arrays/dictionaries/hashes

Input/Output the definitive guide

Correctly use Raku IO

Lists, sequences, and arrays

Positional data constructs

Metaobject protocol (MOP)

Introspection and the Raku object system

Native calling interface

Call into dynamic libraries that follow the C calling convention

Raku native types

Using the types the compiler and hardware make available to you

Newline handling in Raku

How the different newline characters are handled, and how to change the behavior

Numerics

Numeric types available in Raku

Object orientation

Object orientation in Raku

Operators

Common Raku infixes, prefixes, postfixes, and more!

Packages

Organizing and referencing namespaced program elements

Performance

Measuring and improving runtime or compile-time performance

Phasers

Program execution phases and corresponding phaser blocks

Pragmas

Special modules that define certain aspects of the behavior of the code

Quoting constructs

Writing strings and word lists, in Raku

Regexes

Pattern matching against strings

Sets, bags, and mixes

Unordered collections of unique and weighted objects in Raku

Signature literals

A guide to signatures in Raku

Statement prefixes

Prefixes that alter the behavior of a statement or a set of them

Data structures

How Raku deals with data structures and what we can expect from them

Subscripts

Accessing data structure elements by index or key

Syntax

General rules of Raku syntax

System interaction

Working with the underlying operating system and running applications

Date and time functions

Processing date and time in Raku

Traits

Compile-time specification of behavior made easy

Unicode versus ASCII symbols

Unicode symbols and their ASCII equivalents

Unicode

Unicode support in Raku

Variables

Variables in Raku

Independent routines

Routines not defined within any class or role.

The Camelia image is copyright 2009 by Larry Wall. "Raku" is trademark of the Yet Another Society. All rights reserved.