Introduction to Pattern Matching

AUTHOR

L. Grondin

Problem

Given a collection of strings, their trie (often pronounced "try" to avoid ambiguity with the general term tree) is a rooted tree formed as follows. For every unique first symbol in the strings, an edge is formed connecting the root to a new vertex. This symbol is then used to label the edge.

We may then iterate the process by moving down one level as follows. Say that an edge connecting the root to a node v is labeled with 'A'; then we delete the first symbol from every string in the collection beginning with 'A' and then treat v as our root. We apply this process to all nodes that are adjacent to the root, and then we move down another level and continue. See Figure 1 for an example of a trie.

As a result of this method of construction, the symbols along the edges of any path in the trie from the root to a leaf will spell out a unique string from the collection, as long as no string is a prefix of another in the collection (this would cause the first string to be encoded as a path terminating at an internal node).

Given: A list of at most 100 DNA strings of length at most 100 bp, none of which is a prefix of another.

Return: The adjacency list corresponding to the trie T for these patterns, in the following format. If T has n nodes, first label the root with 1 and then label the remaining nodes with the integers 2 through n in any order you like. Each edge of the adjacency list of T will be encoded by a triple containing the integer representing the edge's parent node, followed by the integer representing the edge's child node, and finally the symbol labeling the edge.

http://rosalind.info/problems/trie/

Sample dataset:

ATAGA
    ATC
    GAT

Sample output:

1 2 A
    2 3 T
    3 4 A
    4 5 G
    5 6 A
    3 7 C
    1 8 G
    8 9 A
    9 10 T

Usage:

$ perl6 trie-grondilu.pl

or

$ perl6 trie-grondilu.pl --data="GAT ATC"
use v6;



my Int $node = 1;

sub trie(@string is copy, $root = $node) {
    @string .= grep: *.chars;
    return {} if not @string;
    hash gather for @string.classify(*.substr: 0, 1).sort(*.key)>>.kv -> ($k, $v) {
        my @value = map *.substr(1), grep *.chars > 1, $v[];
        say "$root {++$node} $k";
        if (@value) {
            take $k => &?ROUTINE( @value, $node );
        }
    }
}

sub MAIN(:$data = "ATAGA ATC GAT") {
    my @input = $data.split(/\s+/);
    trie @input;
}

# vim: expandtab shiftwidth=4 ft=perl6

See Also

afrq-grondilu.pl

Counting Disease Carriers

aspc-grondilu.pl

Introduction to Alternative Splicing

cons-grondilu.pl

Consensus and Profile

conv-grondilu.pl

Comparing Spectra with the Spectral Convolution

cstr-grondilu.pl

Creating a Character Table from Genetic Strings

ctbl-grondilu.pl

Creating a Character Table

dbpr-grondilu.pl

Introduction to Protein Databases

dna-gerdr.pl

Counting DNA Nucleotides

dna-grondilu.pl

Counting DNA Nucleotides

eubt-grondilu.pl

Enumerating Unrooted Binary Trees

eval-grondilu.pl

Expected Number of Restriction Sites

fib-grondilu.pl

Rabbits and Recurrence Relations

fibd-grondilu.pl

Mortal Fibonacci Rabbits

gc-gerdr.pl

Computing GC Content

grph-grondilu.pl

Overlap Graphs

hamm-grondilu.pl

Counting Point Mutations

iev-grondilu.pl

Calculating Expected Offspring

indc-grondilu.pl

Independent Segregation of Chromosomes

iprb-grondilu.pl

Mendel's First Law

itwv-grondilu.pl

Finding Disjoint Motifs in a Gene

lcsq-grondilu.pl

Finding a Shared Spliced Motif

lia-grondilu.pl

Independent Alleles

lrep-grondilu-p5.pl

mmch-grondilu.pl

Maximum Matchings and RNA Secondary Structures

mprt-grondilu.pl

Finding a Protein Motif

mrna-grondilu.pl

Inferring mRNA from Protein

nwck-grondilu.pl

Distances in Trees

orf-grondilu.pl

Open Reading Frames

pmch-grondilu.pl

Perfect Matchings and RNA Secondary Structures

pper-grondilu.pl

Partial Permutations

prob-grondilu.pl

Introduction to Random Strings

qrt-grondilu.pl

Quartets

README.md

revc-gerdr.pl

Complementing a Strand of DNA

rna-gerdr.pl

Transcribing DNA into RNA

rstr-grondilu.pl

Matching Random Motifs

sexl-grondilu.pl

Sex-Linked Inheritance

sgra-grondilu.pl

Using the Spectrum Graph to Infer Peptides

spec-grondilu.pl

Inferring Protein from Spectrum

sseq-grondilu.pl

Finding a Spliced Motif

subs-grondilu.pl

Finding a Motif in DNA

suff-grondilu.pl

Encoding Suffix Trees

tran-grondilu.pl

Transitions and Transversions

The Camelia image is copyright 2009 by Larry Wall. "Raku" is trademark of the Yet Another Society. All rights reserved.