README-work

Raku ML::Clustering

This repository has the code of a Raku package for Machine Learning (ML) Clustering (or Cluster analysis) functions, [Wk1].

The Clustering framework includes the algorithms K-means and K-medoids, and the distance functions Euclidean, Cosine, Hamming, Manhattan, and others, and their corresponding similarity functions.

The data in the examples below is generated and manipulated with the packages "Data::Generators", "Data::Reshapers", and "Data::Summarizers", described in the article "Introduction to data wrangling with Raku", [AA1].

The plots are made with the package "Text::Plot", [AAp6].

Installation

Via zef-ecosystem:

zef install ML::Clustering

From GitHub:

zef install https://github.com/antononcube/Raku-ML-Clustering

Cluster finding

Here we derive a set of random points, and summarize it:

use Data::Generators;
use Data::Summarizers;
use Text::Plot;

my $n = 100;
my @data1 = (random-variate(NormalDistribution.new(5,1.5), $n) X random-variate(NormalDistribution.new(5,1), $n)).pick(30);
my @data2 = (random-variate(NormalDistribution.new(10,1), $n) X random-variate(NormalDistribution.new(10,1), $n)).pick(50);
my @data3 = [|@data1, |@data2].pick(*);
records-summary(@data3)

Here we plot the points:

use Text::Plot;
text-list-plot(@data3)

Problem: Group the points in such a way that each group has close (or similar) points.

Here is how we use the function find-clusters to give an answer:

use ML::Clustering;
my %res = find-clusters(@data3, 2, prop => 'All');
%res<Clusters>>>.elems

Remark: The function find-clusters can return results of different types controlled with the named argument "prop". Using prop => 'All' returns a hash with all properties of the cluster finding result.

Here are sample points from each found cluster:

.say for %res<Clusters>>>.pick(3);

Here are the centers of the clusters (the mean points):

%res<MeanPoints>

We can verify the result by looking at the plot of the found clusters:

text-list-plot((|%res<Clusters>, %res<MeanPoints>), point-char => <▽ ☐ ●>, title => '▽ - 1st cluster; ☐ - 2nd cluster; ● - cluster centers')

Remark: By default find-clusters uses the K-means algorithm. The functions k-means and k-mediods call find-clusters with the option settings method=>'K-means' and method=>'K-mediods' respectively.

Implementation considerations

UML diagram

Here is a UML diagram that shows package's structure:

image izef_ml_clustering_dist_resources_class_diagram_png not found

The PlantUML spec and diagram were obtained with the CLI script to-uml-spec of the package "UML::Translators", [AAp6].

Here we get the PlantUML spec:

to-uml-spec ML::AssociationRuleLearning > ./resources/class-diagram.puml

Here get the diagram:

to-uml-spec ML::Clustering | java -jar ~/PlantUML/plantuml-1.2022.5.jar -pipe > ./resources/class-diagram.png

Remark: Maybe it is a good idea to have an abstract class named, say, ML::Clustering::AbstractFinder that is a parent of ML::Clustering::KMeans, ML::Clustering::KMedoids, ML::Clustering::BiSectionalKMeans, etc., but I have not found to be necessary. (At this point of development.)