Data::Summarizers

Data summarizing functions for different data structures (arrays, lists of hashes, Text::CSV tables.)

Raku Data::Summarizers

This Raku package has data summarizing functions for different data structures that are coercible to full arrays.

The supported data structures (so far) are:

  • 1D Arrays

  • 1D Lists

  • Positional-of-hashes

  • Positional-of-arrays

Usage examples

Setup

Here we load the Raku modules Data::Generators, Data::Reshapers and this module, Data::Summarizers:

use Data::Generators;
use Data::Reshapers;
use Data::Summarizers;
# (Any)

Summarize vectors

Here we generate a numerical vector, place some NaN's or Whatever's in it:

my @vec = [^1001].roll(12);
@vec = @vec.append( [NaN, Whatever, Nil]);
@vec .= pick(@vec.elems);
@vec
# [740 311 434 300 (Whatever) 192 705 202 576 561 544 NaN (Any) 744 133]

Here we summarize the vector generated above:

records-summary(@vec)
# Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€O
# ā”‚ numerical                          ā”‚
# Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€O
# ā”‚ 1st-Qu                    => 251   ā”‚
# ā”‚ Max                       => 744   ā”‚
# ā”‚ Median                    => 489   ā”‚
# ā”‚ (Any-Nan-Nil-or-Whatever) => 3     ā”‚
# ā”‚ Mean                      => 453.5 ā”‚
# ā”‚ Min                       => 133   ā”‚
# ā”‚ 3rd-Qu                    => 640.5 ā”‚
# Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€O

Summarize tabular datasets

Here we generate a random tabular dataset with 16 rows and 3 columns and display it:

srand(32);
my $tbl = random-tabular-dataset(16,
                                 <Pet Ref Code>,
                                 generators=>[random-pet-name(4), -> $n { ((^20).rand xx $n).List }, random-string(6)]);
to-pretty-table($tbl)
# Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€O
# ā”‚      Code      ā”‚    Ref    ā”‚   Pet    ā”‚
# Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€O
# ā”‚ A2Ue69EWAMtJCi ā”‚  0.050176 ā”‚ Guinness ā”‚
# ā”‚ KNwmt0QmoqABwR ā”‚  0.731900 ā”‚ Truffle  ā”‚
# ā”‚ A2Ue69EWAMtJCi ā”‚  0.739763 ā”‚  Jumba   ā”‚
# ā”‚       aY       ā”‚  7.342107 ā”‚ Guinness ā”‚
# ā”‚ xgZjtSP6VrKbH  ā”‚ 19.868591 ā”‚  Jumba   ā”‚
# ā”‚    20CO9FGD    ā”‚ 12.956172 ā”‚  Jumba   ā”‚
# ā”‚    20CO9FGD    ā”‚ 15.854088 ā”‚ Guinness ā”‚
# ā”‚ A2Ue69EWAMtJCi ā”‚  4.774780 ā”‚ Guinness ā”‚
# ā”‚ A2Ue69EWAMtJCi ā”‚ 18.729798 ā”‚ Guinness ā”‚
# ā”‚ xgZjtSP6VrKbH  ā”‚ 13.383997 ā”‚ Guinness ā”‚
# ā”‚       aY       ā”‚  9.837488 ā”‚  Jumba   ā”‚
# ā”‚    20CO9FGD    ā”‚  2.912506 ā”‚ Truffle  ā”‚
# ā”‚ xgZjtSP6VrKbH  ā”‚ 11.782221 ā”‚ Truffle  ā”‚
# ā”‚ KNwmt0QmoqABwR ā”‚  9.825102 ā”‚ Truffle  ā”‚
# ā”‚ xgZjtSP6VrKbH  ā”‚ 16.277717 ā”‚  Jumba   ā”‚
# ā”‚ CQmrQcQ4YkXvaD ā”‚  1.740695 ā”‚ Guinness ā”‚
# Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€O

Remark: The values of the column "Pet" is sampled from a set of four pet names, and the values of the column and "Code" is sampled from a set of 6 strings.

Here we summarize the tabular dataset generated above:

records-summary($tbl)
# Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€O
# ā”‚ Pet           ā”‚ Ref                          ā”‚ Code                ā”‚
# Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€O
# ā”‚ Guinness => 7 ā”‚ Min    => 0.0501758995572299 ā”‚ xgZjtSP6VrKbH  => 4 ā”‚
# ā”‚ Jumba    => 5 ā”‚ 1st-Qu => 2.3266005718178704 ā”‚ A2Ue69EWAMtJCi => 4 ā”‚
# ā”‚ Truffle  => 4 ā”‚ Mean   => 9.175443804770861  ā”‚ 20CO9FGD       => 3 ā”‚
# ā”‚               ā”‚ Median => 9.831294839627123  ā”‚ KNwmt0QmoqABwR => 2 ā”‚
# ā”‚               ā”‚ 3rd-Qu => 14.619042446877677 ā”‚ aY             => 2 ā”‚
# ā”‚               ā”‚ Max    => 19.868590809216744 ā”‚ CQmrQcQ4YkXvaD => 1 ā”‚
# Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€O

Summarize collections of tabular datasets

Here is a hash of tabular datasets:

my %group = group-by($tbl, 'Pet');

%group.pairs.map({ say("{$_.key} =>"); say to-pretty-table($_.value) });
# Guinness =>
# Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€O
# ā”‚      Code      ā”‚    Ref    ā”‚   Pet    ā”‚
# Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€O
# ā”‚ A2Ue69EWAMtJCi ā”‚  0.050176 ā”‚ Guinness ā”‚
# ā”‚       aY       ā”‚  7.342107 ā”‚ Guinness ā”‚
# ā”‚    20CO9FGD    ā”‚ 15.854088 ā”‚ Guinness ā”‚
# ā”‚ A2Ue69EWAMtJCi ā”‚  4.774780 ā”‚ Guinness ā”‚
# ā”‚ A2Ue69EWAMtJCi ā”‚ 18.729798 ā”‚ Guinness ā”‚
# ā”‚ xgZjtSP6VrKbH  ā”‚ 13.383997 ā”‚ Guinness ā”‚
# ā”‚ CQmrQcQ4YkXvaD ā”‚  1.740695 ā”‚ Guinness ā”‚
# Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€O
# Truffle =>
# Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€O
# ā”‚   Pet   ā”‚    Ref    ā”‚      Code      ā”‚
# Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€O
# ā”‚ Truffle ā”‚  0.731900 ā”‚ KNwmt0QmoqABwR ā”‚
# ā”‚ Truffle ā”‚  2.912506 ā”‚    20CO9FGD    ā”‚
# ā”‚ Truffle ā”‚ 11.782221 ā”‚ xgZjtSP6VrKbH  ā”‚
# ā”‚ Truffle ā”‚  9.825102 ā”‚ KNwmt0QmoqABwR ā”‚
# Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€O
# Jumba =>
# Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€O
# ā”‚    Ref    ā”‚      Code      ā”‚  Pet  ā”‚
# Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€O
# ā”‚  0.739763 ā”‚ A2Ue69EWAMtJCi ā”‚ Jumba ā”‚
# ā”‚ 19.868591 ā”‚ xgZjtSP6VrKbH  ā”‚ Jumba ā”‚
# ā”‚ 12.956172 ā”‚    20CO9FGD    ā”‚ Jumba ā”‚
# ā”‚  9.837488 ā”‚       aY       ā”‚ Jumba ā”‚
# ā”‚ 16.277717 ā”‚ xgZjtSP6VrKbH  ā”‚ Jumba ā”‚
# Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€O

Here is the summary of that collection of datasets:

records-summary(%group)
# summary of Guinness =>
# Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€O
# ā”‚ Ref                          ā”‚ Code                ā”‚ Pet           ā”‚
# Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€O
# ā”‚ Min    => 0.0501758995572299 ā”‚ A2Ue69EWAMtJCi => 3 ā”‚ Guinness => 7 ā”‚
# ā”‚ 1st-Qu => 1.7406953436440742 ā”‚ CQmrQcQ4YkXvaD => 1 ā”‚               ā”‚
# ā”‚ Mean   => 8.839377375678543  ā”‚ 20CO9FGD       => 1 ā”‚               ā”‚
# ā”‚ Median => 7.34210706081909   ā”‚ xgZjtSP6VrKbH  => 1 ā”‚               ā”‚
# ā”‚ 3rd-Qu => 15.854088005472917 ā”‚ aY             => 1 ā”‚               ā”‚
# ā”‚ Max    => 18.72979803423013  ā”‚                     ā”‚               ā”‚
# Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€O
# summary of Truffle =>
# Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€O
# ā”‚ Pet          ā”‚ Ref                          ā”‚ Code                ā”‚
# Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€O
# ā”‚ Truffle => 4 ā”‚ Min    => 0.7318998724597869 ā”‚ KNwmt0QmoqABwR => 2 ā”‚
# ā”‚              ā”‚ 1st-Qu => 1.822202836225727  ā”‚ 20CO9FGD       => 1 ā”‚
# ā”‚              ā”‚ Mean   => 6.312932174017679  ā”‚ xgZjtSP6VrKbH  => 1 ā”‚
# ā”‚              ā”‚ Median => 6.368803873269801  ā”‚                     ā”‚
# ā”‚              ā”‚ 3rd-Qu => 10.803661511809633 ā”‚                     ā”‚
# ā”‚              ā”‚ Max    => 11.782221077071329 ā”‚                     ā”‚
# Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€O
# summary of Jumba =>
# Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€O
# ā”‚ Ref                          ā”‚ Pet        ā”‚ Code                ā”‚
# Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€O
# ā”‚ Min    => 0.7397628145038704 ā”‚ Jumba => 5 ā”‚ xgZjtSP6VrKbH  => 2 ā”‚
# ā”‚ 1st-Qu => 5.28862527360509   ā”‚            ā”‚ 20CO9FGD       => 1 ā”‚
# ā”‚ Mean   => 11.935946110102654 ā”‚            ā”‚ A2Ue69EWAMtJCi => 1 ā”‚
# ā”‚ Median => 12.956171789492936 ā”‚            ā”‚ aY             => 1 ā”‚
# ā”‚ 3rd-Qu => 18.073154106905072 ā”‚            ā”‚                     ā”‚
# ā”‚ Max    => 19.868590809216744 ā”‚            ā”‚                     ā”‚
# Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€Oā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€O

Skim

TBD...

TODO

  • User specified NA marker

  • Tabular dataset summarization tests

  • Skimmer

  • Peek-er

References

Functions, repositories

[AAf1] Anton Antonov, RecordsSummary, (2019), Wolfram Function Repository.

Data::Summarizers v0.2.1

Data summarizing functions for different data structures (arrays, lists of hashes, Text::CSV tables.)

Authors

  • Anton Antonov

License

Artistic-2.0

Dependencies

StatsData::Reshapers

Test Dependencies

Provides

  • Data::Summarizers
  • Data::Summarizers::Predicates
  • Data::Summarizers::RecordsSummary

The Camelia image is copyright 2009 by Larry Wall. "Raku" is trademark of the Yet Another Society. All rights reserved.