ML::TriesWithFrequencies
Raku ML::TriesWithFrequencies
This Raku package has functions for creation and manipulation of Tries (Prefix trees) with frequencies.
The package provides Machine Learning (ML) functionalities, not "just" a Trie data structure.
This Raku implementation closely follows the Java implementation [AAp3].
The system of function names follows the one used in the Mathematica package [AAp2].
Remark: Below Mathematica and Wolfram Language (WL) are used as synonyms.
Remark: There is a Raku package with an alternative implementation, [AAp6],
made mostly for comparison studies. (See the implementation notes below.)
The package in this repository, ML::TriesWithFrequencies
, is my primary
Tries-with-frequencies package.
Usage
Consider a trie (prefix tree) created over a list of words:
use ML::TriesWithFrequencies;
my $tr = trie-create-by-split( <bar bark bars balm cert cell> );
trie-say($tr);
# TRIEROOT => 6
# ββb => 4
# β ββa => 4
# β ββl => 1
# β β ββm => 1
# β ββr => 3
# β ββk => 1
# β ββs => 1
# ββc => 2
# ββe => 2
# ββl => 1
# β ββl => 1
# ββr => 1
# ββt => 1
Here we convert the trie with frequencies above into a trie with probabilities:
my $ptr = trie-node-probabilities( $tr );
trie-say($ptr);
# TRIEROOT => 1
# ββb => 0.6666666666666666
# β ββa => 1
# β ββl => 0.25
# β β ββm => 1
# β ββr => 0.75
# β ββk => 0.3333333333333333
# β ββs => 0.3333333333333333
# ββc => 0.3333333333333333
# ββe => 1
# ββl => 0.5
# β ββl => 1
# ββr => 0.5
# ββt => 1
Here we shrink the trie with probabilities above:
trie-say(trie-shrink($ptr));
# TRIEROOT => 1
# ββba => 0.6666666666666666
# β ββlm => 0.25
# β ββr => 0.75
# β ββk => 0.3333333333333333
# β ββs => 0.3333333333333333
# ββce => 0.3333333333333333
# ββll => 0.5
# ββrt => 0.5
Here we retrieve a sub-trie with a key:
trie-say(trie-retrieve($ptr, 'bar'.comb))
# r => 0.75
# ββk => 0.3333333333333333
# ββs => 0.3333333333333333
Representation
Each trie is a tree of objects of the class ML::TriesWithFrequencies::Trie
.
Such trees can be nicely represented as hash-maps. For example:
my $tr = trie-shrink(trie-create-by-split(<core cort>));
say $tr.gist;
# {TRIEROOT => {TRIEVALUE => 2, cor => {TRIEVALUE => 2, e => {TRIEVALUE => 1}, t => {TRIEVALUE => 1}}}}
The function trie-say
uses that Hash-representation:
trie-say($tr)
# TRIEROOT => 2
# ββcor => 2
# ββe => 1
# ββt => 1
JSON
The JSON-representation follows the inherent object-tree
representation with ML::TriesWithFrequencies::Trie
:
say $tr.JSON;
# {"key":"TRIEROOT", "value":2, "children":[{"key":"cor", "value":2, "children":[{"key":"t", "value":1, "children":[]}, {"key":"e", "value":1, "children":[]}]}]}
XML
The XML-representation follows (resembles) the Hash-representation
(and output from trie-say
):
say $tr.XML;
# <TRIEROOT>
# <TRIEVALUE>2</TRIEVALUE>
# <cor>
# <TRIEVALUE>2</TRIEVALUE>
# <t>
# <TRIEVALUE>1</TRIEVALUE>
# </t>
# <e>
# <TRIEVALUE>1</TRIEVALUE>
# </e>
# </cor>
# </TRIEROOT>
Using the XML representation allows for XPath searches, say, using the package XML::XPath. Here is an example:
use XML::XPath;
my $tr0 = trie-create-by-split(<bell best>);
trie-say($tr0);
# TRIEROOT => 2
# ββb => 2
# ββe => 2
# ββl => 1
# β ββl => 1
# ββs => 1
# ββt => 1
Convert to XML:
say $tr0.XML;
# <TRIEROOT>
# <TRIEVALUE>2</TRIEVALUE>
# <b>
# <TRIEVALUE>2</TRIEVALUE>
# <e>
# <TRIEVALUE>2</TRIEVALUE>
# <s>
# <TRIEVALUE>1</TRIEVALUE>
# <t>
# <TRIEVALUE>1</TRIEVALUE>
# </t>
# </s>
# <l>
# <TRIEVALUE>1</TRIEVALUE>
# <l>
# <TRIEVALUE>1</TRIEVALUE>
# </l>
# </l>
# </e>
# </b>
# </TRIEROOT>
Search for <b e l>
:
say XML::XPath.new(xml=>$tr0.XML).find('//b/e/l');
# <l>
# <TRIEVALUE>1</TRIEVALUE>
# <l>
# <TRIEVALUE>1</TRIEVALUE>
# </l>
# </l>
WL
The Hash-representation is used in the Mathematica package [AAp2]. Hence, such WL format is provided by the Raku package:
say $tr.WL;
# <|$TrieRoot -> <|$TrieValue -> 2, "cor" -> <|$TrieValue -> 2, "t" -> <|$TrieValue -> 1|>, "e" -> <|$TrieValue -> 1|>|>|>|>
Cloning
All trie-*
functions and ML::TriesWithFrequencies::Trie
methods that manipulate tries produce trie clones.
For performance reasons I considered having in-place trie manipulations, but that, of course, confuses reasoning in development, testing, and usage. Hence, ubiquitous cloning.
Two stiles of pipelining
As it was mentioned above the package was initially developed to have the functional programming design of the Mathematica package [AAp2]. With that design and using the feed operator ==> we can construct pipelines like this one:
my @words2 = <bar barman bask bell belly>;
my @words3 = <call car cast>;
trie-create-by-split(@words2)==>
trie-merge(trie-create-by-split(@words3))==>
trie-node-probabilities==>
trie-shrink==>
trie-say
# TRIEROOT => 1
# ββb => 0.625
# β ββa => 0.6
# β β ββr => 0.6666666666666666
# β β β ββman => 0.5
# β β ββsk => 0.3333333333333333
# β ββell => 0.4
# β ββy => 0.5
# ββca => 0.375
# ββll => 0.3333333333333333
# ββr => 0.3333333333333333
# ββst => 0.3333333333333333
The package also supports "dot pipelining" through chaining of methods:
@words2.&trie-create-by-split
.merge(@words3.&trie-create-by-split)
.node-probabilities
.shrink
.form
# TRIEROOT => 1
# ββb => 0.625
# β ββa => 0.6
# β β ββr => 0.6666666666666666
# β β β ββman => 0.5
# β β ββsk => 0.3333333333333333
# β ββell => 0.4
# β ββy => 0.5
# ββca => 0.375
# ββll => 0.3333333333333333
# ββr => 0.3333333333333333
# ββst => 0.3333333333333333
Remark: The trie-*
functions are implemented through the methods of ML::TriesWithFrequencies::Trie
.
Given the method the corresponding function is derived by adding the prefix trie-
.
(For example, $tr.shrink
vs trie-shrink($tr)
.)
Here is the previous pipeline re-written to use only methods of ML::TriesWithFrequencies::Trie
:
ML::TriesWithFrequencies::Trie.create-by-split(@words2)
.merge(ML::TriesWithFrequencies::Trie.create-by-split(@words3))
.node-probabilities
.shrink
.form
Implementation notes
Performance
This package is a Raku re-implementation of the Java Trie package [AAp3].
The initial implementation was:
β 5-6 times slower than the Mathematica implementation [AAp2]
β 100 times slower than the Java implementation [AAp3]
The initial implementation used:
General types for Trie nodes, i.e.
Str
for the key andNumeric
for the valueArgument type verification with
where
statements in the signatures of thetrie-*
functions
After reading [RAC1] I refactored the code to use native types (num
, str
)
and moved the where
verifications inside the functions.
I also refactored the function trie-merge
to use less copying of data and
to take into account which of the two tries has smaller number of children.
After those changes the current Raku implementation is:
β 2.5 times slower than the Mathematica implementation [AAp2]
β 40 times slower than the Java implementation [AAp3]
After the (monumental) work on the new MoarVM dispatch mechanism, [JW1], was incorporated in standard Rakudo releases (September/October 2021) additional 20% speed-up was obtained. Currently this package is:
β 2.0 times slower than the Mathematica implementation [AAp2]
β 30 times slower than the Java implementation [AAp3]
These speed improvements are definitely not satisfactory. I strongly consider:
Re-implementing in Raku the Mathematica package [AAp2], i.e. to move into Tries that are hashes.
(It turned out option 1 does not produce better results; see [AAp6].)
Re-implementing in C or C++ the Java package [AAp3] and hooking it up to Raku.
Moving from FP design and OOP design
The initial versions of the package -- up to version 0.5.0 -- had exported functions only
in the namespace ML::TriesWithFrequencies
with the prefix trie-
.
Those functions came from a purely Functional Programming (FP) design.
In order to get chains of Object Oriented Programming (OOP) methods application that
are typical in Raku programming the package versions after version 0.6.0 have trie
manipulation transformation methods in the class ML::TriesWithFrequencies::Trie
.
In order to get trie-class methods a fairly fundamental code refactoring was required. Here are the steps:
The old class
ML::TriesWithFrequencies::Trie
was made into the roleML::TriesWithFrequencies::Trieish
.The traversal and remover classes were made to use
ML::TriesWithFrequencies::Trieish
type instead ofML::TriesWithFrequencies::Trie
.The trie functions implementations -- with the prefix "trie-" -- of
ML::TriesWithFrequencies
were moved as methods implementations inML::TriesWithFrequencies::Trie
.The trie functions in
ML::TriesWithFrequencies
were reimplemented using the methods ofML::TriesWithFrequencies::Trie
.
Remark: See the section "Two stiles of pipelining" above for illustrations of the two approaches.
TODO
In the following list the most important items are placed first.
DONE Implement "get words" and "get root-to-leaf paths" functions.
See
trie-words
andtrie-root-to-leaf-paths
.
DONE Convert most of the WL unit tests in [AAp5] into Raku tests.
DONE Implement Trie traversal functions.
The general
trie-map
function is in a separate role.A concrete traversal functionality is a class that does the role and provides additional context.
DONE Implement (sub-)trie removal functions.
DONE By threshold (below and above)
DONE By Pareto principle adherence (top and bottom)
DONE By regex over the keys
TODO Implement optional ULP spec argument for relevant functions:
DONE
trie-root-to-leaf-paths
DONE
trie-words
TODO Membership test functions?
DONE Design and code refactoring so trie objects to have OOP interface.
Instead of just having
trie-words($tr, <c>)
we should be also able to say$tr.trie-words(<c>)
.
TODO Implement
trie-prune
function.TODO Implement Trie-based classification.
TODO Investigate faster implementations.
DONE Re-implement the Trie functionalities using hash representation (instead of a tree of Trie-node objects.)
See [AAp6].
TODO Make a C or C++ implementation and hook it up to Raku.
TODO Document examples of doing Trie-based text mining or data-mining.
TODO Program a trie-form visualization that is "wide", i.e. places the children nodes horizontally.
References
Articles
[AA1] Anton Antonov, "Tries with frequencies for data mining", (2013), MathematicaForPrediction at WordPress.
[AA2] Anton Antonov, "Removal of sub-trees in tries", (2013), MathematicaForPrediction at WordPress.
[AA3] Anton Antonov, "Tries with frequencies in Java", (2017), MathematicaForPrediction at WordPress. GitHub Markdown.
[JW1] Jonathan Worthington, "The new MoarVM dispatch mechanism is here!", (2021), 6guts at WordPress.
[RAC1] Tib, "Day 10: My 10 commandments for Raku performances", (2020), Raku Advent Calendar.
[WK1] Wikipedia entry, Trie.
Packages
[AAp1] Anton Antonov, Tries with frequencies Mathematica Version 9.0 package, (2013), MathematicaForPrediction at GitHub.
[AAp2] Anton Antonov, Tries with frequencies Mathematica package, (2013-2018), MathematicaForPrediction at GitHub.
[AAp3] Anton Antonov, Tries with frequencies in Java, (2017), MathematicaForPrediction at GitHub.
[AAp4] Anton Antonov, Java tries with frequencies Mathematica package, (2017), MathematicaForPrediction at GitHub.
[AAp5] Anton Antonov, Java tries with frequencies Mathematica unit tests, (2017), MathematicaForPrediction at GitHub.
[AAp6] Anton Antonov, ML::HashTriesWithFrequencies Raku package, (2021), GitHub/antononcube.
Videos
[AAv1] Anton Antonov, "Prefix Trees with Frequencies for Data Analysis and Machine Learning", (2017), Wolfram Technology Conference 2017, Wolfram channel at YouTube.