parse-cldr
To use this script, simply execute it. By default, it will process the whole of the CLDR.
Because it must load files in a particular order, it is not easily parallelizable. Thus,
it is recommended to run the command on just a few letters at a time. To run on languages
whose codes begin with a
, i
, and x
, use
raku parse-cldr.raku a i x
You must ensure that a copy of the CLDR data is included in the resources folder, and renamed to
exclude the version information, that is, under cldr-common
(such that the path to English's
data is resources/cldr-common/common/main/en.xml
. This is excluded from distribution to
reduce file size, although technically Unicode's license would permit it.
The basic process for generating the data files is as follows:
1. Read the base XML files (en, es, etc)
2. For each sub file, deep copy using C<from-json to-json %hash.raku>, and apply the new data (en-US, es-ES, etc)
on top of it using C<parse>
3. Then, using C<encode>, generate the two data files -- a binary tree file and a strings file.
4. The strings file is interpreted as a giant array, so that the binary file may easily reference
the strings (many strings are repeated, so this saves space).
The <alias> tag only exists for root, and not for any other language. They are ignored by parse, and the fallback
interpretations are generally handled in the encode
methods. This means data may be duplicated (but duplicate
strings are practically free). The slight increase in memory is well worth the speed improvements.
There are a number of subs available to parse
via Intl::CLDR::Util::XML-Helper
to keep things simple:
- elem $xml, $tag OR $xml.&elem($tag)
Returns a single child element matching the tag when we know there will be only one element. Dies if more than one.
- elems $xml, $tags OR $xml.&elems($tag)
Returns a child elements matching the tag
- contents $xml
Returns the text content of the tag