Lingua::Stem::Russian
Lingua::Stem::Russian Raku package
Introduction
This Raku package is for stemming Russian words. It implements the Snowball algorithm presented in [SNa1].
Usage examples
The RussianStem
function is used to find stems:
use Lingua::Stem::Russian;
say RussianStem('всходы')
# всход
RussianStem
also works with lists of words:
say RussianStem('Всходы урожая ожидаются с терпением, питьем и беконом.'.words)
# (Всход урож ожида с терпением, пит и беконом.)
The function russian-word-stem
can be used as a synonym of RussianStem
.
Command Line Interface (CLI)
The package provides the CLI function RussianStem
. Here is its usage message:
RussianStem --help
# Usage:
# RussianStem <text> [--splitter=<Str>] [--format=<Str>] -- Finds stems of Russian words in text.
# RussianStem [<words> ...] [--format=<Str>] -- Finds stems of Russian words.
# RussianStem [--format=<Str>] -- Finds stems of Russian words in (pipeline) input.
#
# <text> Text to spilt and its words stemmed.
# --splitter=<Str> String to make a split regex with. [default: '\W+']
# --format=<Str> Output format one of 'text', 'lines', or 'raku'. [default: 'text']
# [<words> ...] Words to be stemmed.
Here are example shell commands of using the CLI function RussianStem
:
RussianStem Какие
# Как
RussianStem --format=raku "Модуль Raku, предоставляющий процедуру для русского языка."
# ["Модул", "Raku", "предоставля", "процедур", "для", "русск", "язык", ""]
RussianStem Проверить корректность подбора по словарям и правилам
# Провер корректност подбор по словар и правил
Here is a pipeline example using the CLI function get-tokens
of the package
"Grammar::TokenProcessing",
[AAp1]:
get-tokens ./DataQueryPhrases-template | RussianStem --format=raku
# ("ассоциац", "ассоциирован", "ассоциирова", "безопасн", "восходя", "выбер", "заказа", "комбайн", "крестообразн",
# "поверхност", "мутирова", "обзор", "обобщ", "переименова", "пол", "просмотрет", "разгруппирова", "разделител",
# "распла", "расстав", "символ", "слит", "слиян", "сплит", "табулирова", "тольк", "убыва", "уверен", "форм",
# "формат", "формирова", "формул", "широк")
Remark: These kind of tokens (literals) transformations are used in the packages "DSL::Bulgarian", [AAp2], and "DSL::Russian", [AAp3],
Implementation notes
Reprogrammed to Raku from : https://github.com/neilb/Lingua-Stem-Ru/blob/master/lib/Lingua/Stem/Ru.pm .
TODO
DONE Respect the word case in the returned result.
RussianStem('ТАБЛА')
should return'ТАБЛ'
.(Not
'табл'
as it currently does.)
DONE CLI that can be inserted in UNIX pipelines.
TODO Performance statistics.
TODO More detailed documentation.
References
Articles
[SNa1] Snowball Team, Russian stemming algorithm, (2002), snowball.tartarus.org.
Packages
[AAp1] Anton Antonov, Grammar::TokenProcessing Raku package, (2022), GitHub/antononcube.
[AAp2] Anton Antonov, DSL::Bulgarian Raku package, (2022), GitHub/antononcube.
[AAp3] Anton Antonov, DSL::Russian Raku package, (2023), GitHub/antononcube.