Regexes: best practices and gotchas

Some tips on regexes and grammars

To help with robust regexes and grammars, here are some best practices for code layout and readability, what to actually match, and avoiding common pitfalls.

Code layout

Without the :sigspace adverb, whitespace is not significant in Raku regexes. Use that to your own advantage and insert whitespace where it increases readability. Also, insert comments where necessary.

Compare the very compact

my regex float { <[+-]>?\d*'.'\d+[e<[+-]>?\d+]? }

to the more readable

my regex float {
         <[+-]>?        # optional sign
         \d*            # leading digits, optional
         '.'
         \d+
         [              # optional exponent
            e <[+-]>?  \d+
         ]?
    }

As a rule of thumb, use whitespace around atoms and inside groups; put quantifiers directly after the atom; and vertically align opening and closing square brackets and parentheses.

When you use a list of alternations inside parentheses or square brackets, align the vertical bars:

my regex example {
        <preamble>
        [
        || <choice_1>
        || <choice_2>
        || <choice_3>
        ]+
        <postamble>
    }

Keep it small

Regexes are often more compact than regular code. Because they do so much with so little, keep regexes short.

When you can name a part of a regex, it's usually best to put it into a separate, named regex.

For example, you could take the float regex from earlier:

my regex float {
         <[+-]>?        # optional sign
         \d*            # leading digits, optional
         '.'
         \d+
         [              # optional exponent
            e <[+-]>?  \d+
         ]?
    }

And decompose it into parts:

my token sign { <[+-]> }
    my token decimal { \d+ }
    my token exponent { 'e' <sign>? <decimal> }
    my regex float {
        <sign>?
        <decimal>?
        '.'
        <decimal>
        <exponent>?
    }

That helps, especially when the regex becomes more complicated. For example, you might want to make the decimal point optional in the presence of an exponent.

my regex float {
        <sign>?
        [
        || <decimal>?  '.' <decimal> <exponent>?
        || <decimal> <exponent>
        ]
    }

What to match

Often the input data format has no clear-cut specification, or the specification is not known to the programmer. Then, it's good to be liberal in what you expect, but only so long as there are no possible ambiguities.

For example, in ini files:

[section]
key=value

What can be inside the section header? Allowing only a word might be too restrictive. Somebody might write [two words], or use dashes, etc. Instead of asking what's allowed on the inside, it might be worth asking instead: what's not allowed?

Clearly, closing square brackets are not allowed, because [a]b] would be ambiguous. By the same argument, opening square brackets should be forbidden. This leaves us with

token header { '[' <-[ \[\] ]>+ ']' }

which is fine if you are only processing one line. But if you're processing a whole file, suddenly the regex parses

[with a
newline in between]

which might not be a good idea. A compromise would be

token header { '[' <-[ \[\] \n ]>+ ']' }

and then, in the post-processing, strip leading and trailing spaces and tabs from the section header.

Matching whitespace

The :sigspace adverb (or using the rule declarator instead of token or regex) is very handy for implicitly parsing whitespace that can appear in many places.

Going back to the example of parsing ini files, we have

my regex kvpair { \s* <key=identifier> '=' <value=identifier> \n+ }

which is probably not as liberal as we want it to be, since the user might put spaces around the equals sign. So, then we may try this:

my regex kvpair { \s* <key=identifier> \s* '=' \s* <value=identifier> \n+ }

But that's looking unwieldy, so we try something else:

my rule kvpair { <key=identifier> '=' <value=identifier> \n+ }

But wait! The implicit whitespace matching after the value uses up all whitespace, including newline characters, so the \n+ doesn't have anything left to match (and rule also disables backtracking, so no luck there).

Therefore, it's important to redefine your definition of implicit whitespace to whitespace that is not significant in the input format.

This works by redefining the token ws; however, it only works for grammars:

grammar IniFormat {
        token ws { <!ww> \h* }
        rule header { \s* '[' (\w+) ']' \n+ }
        token identifier  { \w+ }
        rule kvpair { \s* <key=identifier> '=' <value=identifier> \n+ }
        token section {
            <header>
            <kvpair>*
        }
token TOP {
            <section>*
        }
    }
my $contents = q:to/EOI/;
        [passwords]
            jack = password1
            joy = muchmoresecure123
        [quotas]
            jack = 123
            joy = 42
    EOI
    say so IniFormat.parse($contents);

Besides putting all regexes into a grammar and turning them into tokens (because they don't need to backtrack anyway), the interesting new bit is

token ws { <!ww> \h* }

which gets called for implicit whitespace parsing. It matches when it's not between two word characters (<!ww>, negated "within word" assertion), and zero or more horizontal space characters. The limitation to horizontal whitespace is important, because newlines (which are vertical whitespace) delimit records and shouldn't be matched implicitly.

Still, there's some whitespace-related trouble lurking. The regex \n+ won't match a string like "\n \n", because there's a blank between the two newlines. To allow such input strings, replace \n+ with \n\s*.

See Also

Classes and objects

A tutorial about creating and using classes in Raku

CompUnits and where to find them

How and when Raku modules are compiled, where they are stored, and how to access them in compiled form.

Concurrency

Concurrency and asynchronous programming

Command line interface

Creating your own CLI in Raku

Grammar tutorial

An introduction to grammars

Input/Output

File-related operations

Inter-process communication

Programs running other programs and communicating with them

Iterating

Functionalities available for visiting all items in a complex data structure

Doing math with Raku

Different mathematical paradigms and how they are implemented in this language

Module packages

Creating module packages for code reuse

Core modules

Core modules that may be useful to module authors

Module development utilities

What can help you write/test/improve your module(s)

Modules

How to create, use, and distribute Raku modules

Creating operators

A short tutorial on how to declare operators and create new ones.

REPL

Read-eval-print loop

Entering unicode characters

Input methods for unicode characters in terminals, the shell, and editors

The Camelia image is copyright 2009 by Larry Wall. "Raku" is trademark of the Yet Another Society. All rights reserved.