Lingua::Stem::Russian

Package for stemming Russian words.

Lingua::Stem::Russian Raku package

Introduction

This Raku package is for stemming Russian words. It implements the Snowball algorithm presented in [SNa1].

Usage examples

The RussianStem function is used to find stems:

use Lingua::Stem::Russian;
say RussianStem('всходы')

# всход

RussianStem also works with lists of words:

say RussianStem('Всходы урожая ожидаются с терпением, питьем и беконом.'.words)

# (Всход урож ожида с терпением, пит и беконом.)

The function russian-word-stem can be used as a synonym of RussianStem.

Command Line Interface (CLI)

The package provides the CLI function RussianStem. Here is its usage message:

RussianStem --help

# Usage:
#   RussianStem <text> [--splitter=<Str>] [--format=<Str>] -- Finds stems of Russian words in text.
#   RussianStem [<words> ...] [--format=<Str>] -- Finds stems of Russian words.
#   RussianStem [--format=<Str>] -- Finds stems of Russian words in (pipeline) input.
#
#     <text>              Text to spilt and its words stemmed.
#     --splitter=<Str>    String to make a split regex with. [default: '\W+']
#     --format=<Str>      Output format one of 'text', 'lines', or 'raku'. [default: 'text']
#     [<words> ...]       Words to be stemmed.

Here are example shell commands of using the CLI function RussianStem:

RussianStem Какие

# Как

RussianStem --format=raku "Модуль Raku, предоставляющий процедуру для русского языка."

# ["Модул", "Raku", "предоставля", "процедур", "для", "русск", "язык", ""]

RussianStem Проверить корректность подбора по словарям и правилам

# Провер корректност подбор по словар и правил

Here is a pipeline example using the CLI function get-tokens of the package "Grammar::TokenProcessing", [AAp1]:

get-tokens ./DataQueryPhrases-template | RussianStem --format=raku

# ("ассоциац", "ассоциирован", "ассоциирова", "безопасн", "восходя", "выбер", "заказа", "комбайн", "крестообразн",
#  "поверхност", "мутирова", "обзор", "обобщ", "переименова", "пол", "просмотрет", "разгруппирова", "разделител",
#  "распла", "расстав", "символ", "слит", "слиян", "сплит", "табулирова", "тольк", "убыва", "уверен", "форм",
#  "формат", "формирова", "формул", "широк")

Remark: These kind of tokens (literals) transformations are used in the packages "DSL::Bulgarian", [AAp2], and "DSL::Russian", [AAp3],

Implementation notes

Reprogrammed to Raku from : https://github.com/neilb/Lingua-Stem-Ru/blob/master/lib/Lingua/Stem/Ru.pm .

TODO

DONE Respect the word case in the returned result.
- RussianStem('ТАБЛА') should return 'ТАБЛ'.
- (Not 'табл' as it currently does.)
DONE CLI that can be inserted in UNIX pipelines.
TODO Performance statistics.
TODO More detailed documentation.