ML::Clustering
Raku ML::Clustering
This repository has the code of a Raku package for Machine Learning (ML) Clustering (or Cluster analysis) functions, [Wk1].
The Clustering framework includes the algorithms K-means and K-medoids, and the distance functions Euclidean, Cosine, Hamming, Manhattan, and others, and their corresponding similarity functions.
The data in the examples below is generated and manipulated with the packages "Data::Generators", "Data::Reshapers", and "Data::Summarizers", described in the article "Introduction to data wrangling with Raku", [AA1].
The plots are made with the package "Text::Plot", [AAp6].
Installation
Via zef-ecosystem:
zef install ML::Clustering
From GitHub:
zef install https://github.com/antononcube/Raku-ML-Clustering
Cluster finding
Here we derive a set of random points, and summarize it:
use Data::Generators;
use Data::Summarizers;
use Text::Plot;
my $n = 100;
my @data1 = (random-variate(NormalDistribution.new(5,1.5), $n) X random-variate(NormalDistribution.new(5,1), $n)).pick(30);
my @data2 = (random-variate(NormalDistribution.new(10,1), $n) X random-variate(NormalDistribution.new(10,1), $n)).pick(50);
my @data3 = [|@data1, |@data2].pick(*);
records-summary(@data3)
# +------------------------------+------------------------------+
# | 1 | 0 |
# +------------------------------+------------------------------+
# | Min => 3.0083171759052374 | Min => 2.99152954550737 |
# | 1st-Qu => 5.3582006670159945 | 1st-Qu => 5.691479144933329 |
# | Mean => 7.988502686038186 | Mean => 8.33540004626669 |
# | Median => 9.128772007411778 | Median => 9.57282202992134 |
# | 3rd-Qu => 10.097184372569952 | 3rd-Qu => 10.270801369302994 |
# | Max => 11.966371289590775 | Max => 12.406483118865026 |
# +------------------------------+------------------------------+
Here we plot the points:
use Text::Plot;
text-list-plot(@data3)
# +--------+----------+----------+----------+----------+-----+
# | |
# + * *** *** + 12.00
# | * ** * * * * |
# + * ********** * + 10.00
# | * *** *** |
# + * * * + 8.00
# | * * |
# | * * |
# + * * * * ** + 6.00
# | * * * * |
# + * * * *** ** * * * + 4.00
# | ** * |
# | |
# +--------+----------+----------+----------+----------+-----+
# 4.00 6.00 8.00 10.00 12.00
Problem: Group the points in such a way that each group has close (or similar) points.
Here is how we use the function find-clusters
to give an answer:
use ML::Clustering;
my %res = find-clusters(@data3, 2, prop => 'All');
%res<Clusters>>>.elems
# (50 30)
Remark: The function find-clusters
can return results of different types controlled with the named argument "prop".
Using prop => 'All'
returns a hash with all properties of the cluster finding result.
Here are sample points from each found cluster:
.say for %res<Clusters>>>.pick(3);
# ((10.197782234303773 9.782034329953607) (12.406483118865026 10.81078915584907) (9.86594364573218 9.055658551283518))
# ((6.793895295914484 5.994720108035477) (6.864989150334569 5.941695089848203) (3.678466387156798 6.503232710889149))
Here are the centers of the clusters (the mean points):
%res<MeanPoints>
# [(9.795425221191257 9.464239286042913) (4.888221803779357 5.039431353925738)]
We can verify the result by looking at the plot of the found clusters:
text-list-plot((|%res<Clusters>, %res<MeanPoints>), point-char => <ā½ ā ā>, title => 'ā½ - 1st cluster; ā - 2nd cluster; ā - cluster centers')
# ā½ - 1st cluster; ā - 2nd cluster; ā - cluster centers
# +-------+-----------+----------+-----------+----------+----+
# + ā½ā½ ā½ + 12.00
# | ā½ā½ā½ ā½ |
# | ā½ ā½ ā½ ā½ā½ ā½ ā½ ā½ |
# + ā½ ā½ā½ā½āā½ā½ā½ā½ā½ā½ ā½ + 10.00
# | ā½ ā½ā½ā½ā½ ā½ā½ |
# + ā½ ā½ ā½ + 8.00
# | ā½ ā½ |
# | ā ā |
# + ā ā ā ā āā + 6.00
# | ā ā āā ā ā |
# | ā ā āāāāā ā ā |
# + ā ā ā ā + 4.00
# | ā |
# +-------+-----------+----------+-----------+----------+----+
# 4.00 6.00 8.00 10.00 12.00
Remark: By default find-clusters
uses the K-means algorithm. The functions k-means
and k-mediods
call find-clusters
with the option settings method=>'K-means'
and method=>'K-mediods'
respectively.
Implementation considerations
UML diagram
Here is a UML diagram that shows package's structure:
image ./resources/class-diagram.png not found
The
PlantUML spec
and
diagram
were obtained with the CLI script to-uml-spec
of the package "UML::Translators", [AAp6].
Here we get the PlantUML spec:
to-uml-spec ML::AssociationRuleLearning > ./resources/class-diagram.puml
Here get the diagram:
to-uml-spec ML::Clustering | java -jar ~/PlantUML/plantuml-1.2022.5.jar -pipe > ./resources/class-diagram.png
Remark: Maybe it is a good idea to have an abstract class named, say,
ML::Clustering::AbstractFinder
that is a parent of
ML::Clustering::KMeans
, ML::Clustering::KMedoids
, ML::Clustering::BiSectionalKMeans
, etc.,
but I have not found to be necessary. (At this point of development.)
TODO
Implement Bi-sectional K-means algorithm, [AAp1].
Implement K-medoids algorithm.
Automatic determination of the number of clusters.
Implement Agglomerate algorithm.
References
Articles
[Wk1] Wikipedia entry, "Cluster Analysis".
[AA1] Anton Antonov, "Introduction to data wrangling with Raku", (2021), RakuForPrediction at WordPress.
Packages
[AAp1] Anton Antonov, Bi-sectional K-means algorithm in Mathematica, (2020), MathematicaForPrediction at GitHub/antononcube.
[AAp2] Anton Antonov, Data::Generators Raku package, (2021), GitHub/antononcube.
[AAp3] Anton Antonov, Data::Reshapers Raku package, (2021), GitHub/antononcube.
[AAp4] Anton Antonov, Data::Summarizers Raku package, (2021), GitHub/antononcube.
[AAp5] Anton Antonov, UML::Translators Raku package, (2022), GitHub/antononcube.
[AAp6] Anton Antonov, Text::Plot Raku package, (2022), GitHub/antononcube.