ML::Clustering

Package for clustering algorithms

Raku ML::Clustering

This repository has the code of a Raku package for Machine Learning (ML) Clustering (or Cluster analysis) functions, [Wk1].

The Clustering framework includes the algorithms K-means and K-medoids, and the distance functions Euclidean, Cosine, Hamming, Manhattan, and others, and their corresponding similarity functions.

The data in the examples below is generated and manipulated with the packages "Data::Generators", "Data::Reshapers", and "Data::Summarizers", described in the article "Introduction to data wrangling with Raku", [AA1].

The plots are made with the package "Text::Plot", [AAp6].

Installation

Via zef-ecosystem:

zef install ML::Clustering

From GitHub:

zef install https://github.com/antononcube/Raku-ML-Clustering

Cluster finding

Here we derive a set of random points, and summarize it:

use Data::Generators;
use Data::Summarizers;
use Text::Plot;

my $n = 100;
my @data1 = (random-variate(NormalDistribution.new(5,1.5), $n) X random-variate(NormalDistribution.new(5,1), $n)).pick(30);
my @data2 = (random-variate(NormalDistribution.new(10,1), $n) X random-variate(NormalDistribution.new(10,1), $n)).pick(50);
my @data3 = [|@data1, |@data2].pick(*);
records-summary(@data3)
# +------------------------------+------------------------------+
# | 1                            | 0                            |
# +------------------------------+------------------------------+
# | Min    => 3.0083171759052374 | Min    => 2.99152954550737   |
# | 1st-Qu => 5.3582006670159945 | 1st-Qu => 5.691479144933329  |
# | Mean   => 7.988502686038186  | Mean   => 8.33540004626669   |
# | Median => 9.128772007411778  | Median => 9.57282202992134   |
# | 3rd-Qu => 10.097184372569952 | 3rd-Qu => 10.270801369302994 |
# | Max    => 11.966371289590775 | Max    => 12.406483118865026 |
# +------------------------------+------------------------------+

Here we plot the points:

use Text::Plot;
text-list-plot(@data3)
# +--------+----------+----------+----------+----------+-----+
# |                                                          |
# +                                     * *** ***            +  12.00
# |                                   *     ** *   *  *  *   |
# +                               *     **********       *   +  10.00
# |                               *        *** ***           |
# +                                   *      * *             +   8.00
# |                                     *          *         |
# |      *       *                                           |
# +   *    * *      *     **                                 +   6.00
# |      *     *  * *                                        |
# +   *  *   *  *** ** * *  *                                +   4.00
# |                **         *                              |
# |                                                          |
# +--------+----------+----------+----------+----------+-----+
#          4.00       6.00       8.00       10.00      12.00

Problem: Group the points in such a way that each group has close (or similar) points.

Here is how we use the function find-clusters to give an answer:

use ML::Clustering;
my %res = find-clusters(@data3, 2, prop => 'All');
%res<Clusters>>>.elems
# (50 30)

Remark: The function find-clusters can return results of different types controlled with the named argument "prop". Using prop => 'All' returns a hash with all properties of the cluster finding result.

Here are sample points from each found cluster:

.say for %res<Clusters>>>.pick(3);
# ((10.197782234303773 9.782034329953607) (12.406483118865026 10.81078915584907) (9.86594364573218 9.055658551283518))
# ((6.793895295914484 5.994720108035477) (6.864989150334569 5.941695089848203) (3.678466387156798 6.503232710889149))

Here are the centers of the clusters (the mean points):

%res<MeanPoints>
# [(9.795425221191257 9.464239286042913) (4.888221803779357 5.039431353925738)]

We can verify the result by looking at the plot of the found clusters:

text-list-plot((|%res<Clusters>, %res<MeanPoints>), point-char => <ā–½ ā˜ ā—>, title => 'ā–½ - 1st cluster; ā˜ - 2nd cluster; ā— - cluster centers')
# ā–½ - 1st cluster; ā˜ - 2nd cluster; ā— - cluster centers
# +-------+-----------+----------+-----------+----------+----+
# +                                      ā–½ā–½     ā–½            +  12.00
# |                                         ā–½ā–½ā–½  ā–½           |
# |                                   ā–½  ā–½  ā–½ ā–½ā–½    ā–½  ā–½   ā–½ |
# +                               ā–½      ā–½ā–½ā–½ā—ā–½ā–½ā–½ā–½ā–½ā–½        ā–½ +  10.00
# |                               ā–½         ā–½ā–½ā–½ā–½ ā–½ā–½          |
# +                                    ā–½     ā–½  ā–½            +   8.00
# |                                      ā–½          ā–½        |
# |     ā˜       ā˜                                            |
# +  ā˜    ā˜ ā˜       ā˜     ā˜ā˜                                 +   6.00
# |     ā˜   ā˜ ā˜ā—   ā˜      ā˜                                  |
# | ā˜       ā˜  ā˜ā˜ā˜ā˜ā˜   ā˜    ā˜                                |
# +     ā˜         ā˜ ā˜    ā˜                                   +   4.00
# |                           ā˜                              |
# +-------+-----------+----------+-----------+----------+----+
#         4.00        6.00       8.00        10.00      12.00

Remark: By default find-clusters uses the K-means algorithm. The functions k-means and k-mediods call find-clusters with the option settings method=>'K-means' and method=>'K-mediods' respectively.

Implementation considerations

UML diagram

Here is a UML diagram that shows package's structure:

image ./resources/class-diagram.png not found

The PlantUML spec and diagram were obtained with the CLI script to-uml-spec of the package "UML::Translators", [AAp6].

Here we get the PlantUML spec:

to-uml-spec ML::AssociationRuleLearning > ./resources/class-diagram.puml

Here get the diagram:

to-uml-spec ML::Clustering | java -jar ~/PlantUML/plantuml-1.2022.5.jar -pipe > ./resources/class-diagram.png

Remark: Maybe it is a good idea to have an abstract class named, say, ML::Clustering::AbstractFinder that is a parent of ML::Clustering::KMeans, ML::Clustering::KMedoids, ML::Clustering::BiSectionalKMeans, etc., but I have not found to be necessary. (At this point of development.)

TODO

  • Implement Bi-sectional K-means algorithm, [AAp1].

  • Implement K-medoids algorithm.

  • Automatic determination of the number of clusters.

  • Implement Agglomerate algorithm.

References

Articles

[Wk1] Wikipedia entry, "Cluster Analysis".

[AA1] Anton Antonov, "Introduction to data wrangling with Raku", (2021), RakuForPrediction at WordPress.

Packages

[AAp1] Anton Antonov, Bi-sectional K-means algorithm in Mathematica, (2020), MathematicaForPrediction at GitHub/antononcube.

[AAp2] Anton Antonov, Data::Generators Raku package, (2021), GitHub/antononcube.

[AAp3] Anton Antonov, Data::Reshapers Raku package, (2021), GitHub/antononcube.

[AAp4] Anton Antonov, Data::Summarizers Raku package, (2021), GitHub/antononcube.

[AAp5] Anton Antonov, UML::Translators Raku package, (2022), GitHub/antononcube.

[AAp6] Anton Antonov, Text::Plot Raku package, (2022), GitHub/antononcube.

ML::Clustering v0.1.0

Package for clustering algorithms

Authors

  • Anton Antonov

License

Artistic-2.0

Dependencies

Data::Reshapers:auth<zef:antononcube>:ver<0.1.9+>

Provides

  • ML::Clustering
  • ML::Clustering::DistanceFunctions
  • ML::Clustering::KMeans

The Camelia image is copyright 2009 by Larry Wall. "Raku" is trademark of the Yet Another Society. All rights reserved.