Lingua::Stem::Portuguese

Package for stemming Portuguese words.

Lingua::Stem::Portuguese Raku package

Introduction

This Raku package is for stemming Portuguese words. It implements the Snowball algorithm presented in [SNa1].

Usage examples

The PortugueseStem function is used to find stems:

use Lingua::Stem::Portuguese;
say PortugueseStem('brotação')

# brot

PortugueseStem also works with lists of words:

say PortugueseStem('Os brotos são aguardados com paciência, bebida e bacon.'.words)

# (Os brot sao aguard com paciencia, beb e bacon.)

The function portuguese-word-stem can be used as a synonym of PortugueseStem.

Command Line Interface (CLI)

The package provides the CLI function PortugueseStem. Here is its usage message:

PortugueseStem --help

# Usage:
#   PortugueseStem <text> [--splitter=<Str>] [--format=<Str>] -- Finds stems of Portuguese words in text.
#   PortugueseStem [<words> ...] [--format=<Str>] -- Finds stems of Portuguese words.
#   PortugueseStem [--format=<Str>] -- Finds stems of Portuguese words in (pipeline) input.
#
#     <text>              Text to spilt and its words stemmed.
#     --splitter=<Str>    String to make a split regex with. [default: '\W+']
#     --format=<Str>      Output format one of 'text', 'lines', or 'raku'. [default: 'text']
#     [<words> ...]       Words to be stemmed.

Here are example shell commands of using the CLI function PortugueseStem:

PortugueseStem Boataria

# Boat

PortugueseStem --format=raku "Módulo Raku que fornece um procedimento para a língua portuguesa."

# ["Modul", "Raku", "que", "fornec", "um", "proced", "par", "a", "lingu", "portugu", ""]

PortugueseStem Verificar a exatidão da seleção usando dicionários e regras

# Verific a exatid da selec us dicion e regr

Here is a pipeline example using the CLI function get-tokens of the package "Grammar::TokenProcessing", [AAp1]:

get-tokens ./DataQueryPhrases-template | PortugueseStem --format=raku

Remark: These kind of tokens (literals) transformations are used in the packages "DSL::Bulgarian", [AAp2], "DSL::Portuguese", [AAp3], and "DSL::Russian", [AAp4],

Implementation notes

Reprogrammed to Raku from : https://github.com/neilb/Lingua-PT-Stemmer/blob/master/lib/Lingua/PT/Stemmer.pm .

TODO

TODO Respect the word case in the returned result.
- PortugueseStem('TABLADO') should return 'TABL'.
- (Not 'tabl' as it currently does.)
DONE CLI that can be inserted in UNIX pipelines.
TODO Gallician stemmer.
TODO Performance statistics.
TODO More detailed documentation.