AdventDay

N.B: This was posted for Day 15 of the 2019 Raku Advent Calendar on 15 Dec 2019.

Santa had too much eggnog

We’re just over a week from Christmas and Santa is sending his elves the final present lists. Unfortunately, Santa had a bit too much eggnog and so the list that he sent to his elves was … not the greatest. Take a look at some of it:

Johnny
 - 4 bsaeball gluvs
 - 2 batts
 - 2 ballz
Mary
 - 3 fancee dols
 - 1 dressss
 - 1 bbaskebtall

Santa somehow managed to keep a nice format that we could mostly process with regexen, so the elves started hammering away at a nice grammar:

grammar Santa'sList {
  rule TOP        {       <kid's-list>+    }
  rule kid's-list {     <name>     <gift>+ }
  rule gift       { '-' <quantity> <item>  }
  token name      { <-[\n]>   }
  token quantity  { <.digit>+ }
  token item      { <.alpha>+ % \h+ }
}

While the elves figured that they could try to figure out what he meant in an action object, they decided it would be more interesting to create a token that they could reuse not just in the grammar, but in any random regex — these elves are crafty!

They wanted to make a new token that they'd call <fuzzy> that could somehow capture Santa's drunken scribblings (can we call his typed list a scribbling?). But regex syntax doesn't actually allow for doing any kind of fuzzy matching. But here Raku's engine comes to the rescue. So first they created a code block inside of the token. Code blocks are normally defined with just { 🦋 } but because they needed to define the success of a match, they opted instead of the <?{ 🦋 }> block, which will not only run the code, but will also fail if the block returns a false-y value.

  token fuzzy {
    (<.alpha>+ % \h+)
    <?{
      # 🦋 code here
    }>
  }

Before they started writing their code, they did two other things. First they named the capture to be a bit easier to maintain down the road. And secondly, they realized they needed to actually get the list of possible toys into the token somehow. So they added a signature to the token to pass it in.

  token fuzzy(**@toys) {
    $<santa's-text>=(<.alpha>+ % \h+)
    <?{
      # 🦋 code here
    }>
  }

Now they could begin the code itself. They would take Santa’s text, and compare it to each of the possible toys, and decide which one was the closest match:

  token fuzzy(**@toys) {
    $<santa's-text>=(<.alpha>+ % \h+)
    <?{
      my $best = @toys
                   .map({ $^toy, qgram($toy,$<santa's-text>.Str)})
                   .sort( *.tail )
                   .tail;
      say "Santa meant to write {$best[0]}";
    }>
  }

The Q-gram function they used creates N-grams for each word, and compares them to see how many they have in common. With testing they found that the best value for N (the length of each substring) was about half the average length. The way that Raku works, writing the Q-gram function was super easy:

  #| Generate space-padded N-grams of length n for string t.
  sub ngrams = -> \t, \n {
    my \s = (' ' x n - 1)  ~ t ~  (' ' x n - 1);
    do for ^(t.chars + n) { s.substr: $_, n }
  }

  #| Calculate Q-gram score using bag operations
  sub qgram (\a, \b) {
    my \q  = (a.chars + b.chars) div 4;
    my \aₙ = ngrams(a,q).BagHash;
    my \bₙ = ngrams(b,q).BagHash;

    (aₙ ∩ bₙ) / (aₙ ∪ bₙ)      # Coefficient de communauté de Joccard
  }

Raku let the elves calculate N-grams in just two clean lines of code, and then use those to calculate the Joccard-index between the two strings in just four more easy to read lines of code.

Putting this back into their grammar, they ended up with the following:

grammar Santa'sList {
  rule TOP        {       <kid's-list>+    }
  rule kid's-list {     <name>     <gift>+ }
  rule gift       { '-' <quantity> <item>  }
  token name      { <-[\n]>   }
  token quantity  { <.digit>+ }
  token item      { <fuzzy(@gifts)> }
  token fuzzy     { … }
  sub ngrams      { … }
  sub qgrams      { … }
}

That's a pretty handy format, but an important problem remains. How do they get access to the best matched text? If they were to match and request, say, $<kid's-list>[0]<gift>[0]<item> they would only get Santa's original illegible mess. They could do an action but that requires doing a parse with actions, which means the fuzzy token is tied to the vagaries of grammar parsing. Works fine here, but… less reusable.

But elves are good at packing and wrapping. They decide to make a package that wraps the fuzzy token so that both Santa's original and the corrected version are easily accessible in a DWIM manner. This ‘package’ can't be declared with package or module, though, because the wrapping process requires using the special sub EXPORT. Their basic process looks like the following:

sub EXPORT {
  # Make the fuzzy token in the elve's factory
  my token fuzzy (*@words) { … } 

  # Wrap it in wrapping paper (apply a role) so it's prettier (easier to use)
  &fuzzy.wrap( … )

  # Ship out (export) the wrapped version
  %( '&fuzzy' => &fuzzy )
}

Any other special tools the elves need can be included in the EXPORT block, for example, the Q- and N-gram functions. So how will they actually do the wrapping? First, they design the paper, that is, a role that will override the .Str to give the clean/corrected value, but also provide access to a .fuzz function to allow access to older values:

  role Fuzzy[$clear,$fuzz] {
    method Str  { $clear }
    method fuzz { $fuzz  }
  }

Now, the wrapped function could look something like the following:

  &fuzzy.wrap(
    sub (|) {
      my $match = callsame;

      # Failed match evals to false, and is just passed along
      # Successful match gets Fuzzy role mixed in.
      $match
        ?? $match but Fuzzy[$match.??, $match.??]
        !! $match
    }
  );

There's a small problem. The results of the calculations they ran inside of the token aren't available. One solution they thought of involved adding new parameters to to the fuzzy token with the trait is raw so that the values could be passed back, but that felt like something the old C++ elves would do. No, Santa's Raku elves had a better idea: dynamic variables. They made two of them, and refactored the original fuzzy method to assign to them:

  my token fuzzy(**@toys) {
    $<santa's-text>=(<.alpha>+ % \h+)
    <?{
      my $best = @toys
                  .map({ $^toy, qgram($toy,$<santa's-text>.Str)})
                  .sort( *.tail )
                  .tail;
      $*clear = $best[0];
      $*fuzz  = ~$<santa's-text>;
    }>
  }

  &fuzzy.wrap(
    sub (|) {
      my $*fuzz;
      my $*clear;

      my $match = callsame;   # sets $match to result of the original

      $match
        ?? $match but Fuzzy[$*clear, $*fuzz]
        !! $match
    }
  );

They did a test with some values and all went well, until an item wasn't found:

"I like the Norht Pole" ~~ /I like the $<dir>=<fuzzy: <North South> Pole>/;
say $<dir>.clear;   # --> "North"
say $<dir>.fuzz;    # --> "Norht"

"I like the East Pole" ~~ /I like the $<dir>=<fuzzy: <North South> Pole>/;
say $<dir>.clear;   # --> "North"
say $<dir>.fuzz;    # --> "East"

What happened? The elves realized that their token was matching no matter what. This is because the <?{ 🦋 } block will only fail if it returns a false-y value. The last statement, being an assignment of a string, will virtually always be truthy. To fix this, they added a simple conditional to the end of the block to fail if the Q-gram score wasn't sufficiently high.

  my token fuzzy(**@toys) {
    $<santa's-text>=(<.alpha>+ % \h+)
    <?{
      my $best = @toys
                   .map({ $^toy, qgram($toy,$<santa's-text>.Str)})
                   .sort( *.tail )
                   .tail;

      $*clear = $best[0];
      $*fuzz  = ~$<santa's-text>;

      # Arbitrary but effective score cut off.
      $best[1] > 0.33
    }>
  }

With that, they were done, and able to process Santa's horrible typing.

Of course, there were a lot of improvements that the elves could still make to make their fuzzy token more useful. After they had made use of it (and taken the eggnog away from Santa so they wouldn't need it), they polished it up so that it could bring joy to everyone.

With that, I can also announce the release of Regex::FuzzyToken. To use it, just do like the elves and in a grammar or any other code, say use Regex::FuzzyToken and the token fuzzy will be imported into your current scope. It has a few extra features, so take a look at its readme for information on some of its options.

While not everyone will use or have need of a FuzzyToken, I hope that this shows off some interesting possibilities when creating tokens that might be better defined programmatically. For example, the CLDR includes a Rule-base number format (RBNF) that, if implemented in a similar way into a token, could match the number "three" (or also "3") that would being three in string contexts, but 3 in a number context.

The Camelia image is copyright 2009 by Larry Wall. "Raku" is trademark of the Yet Another Society. All rights reserved.