README

[Raku PDF Project] / PDF::Tags

PDF-Tags-raku

A small DOM-like API for the creation of tagged PDF files.

This module enables PDF tagged content manipulation, with simple construction, XPath queries and basic XML serialization.

Synopsis

use PDF::Tags;
use PDF::Tags::Elem;

# PDF::API6
use PDF::API6;
use PDF::Annot;
use PDF::XObject::Image;
use PDF::XObject::Form;

my PDF::API6 $pdf .= new;
my PDF::Tags $tags .= create: :$pdf;
# create the document root
my PDF::Tags::Elem $root = $tags.Document;

my $page = $pdf.add-page;
my $header-font = $page.core-font: :family<Helvetica>, :weight<bold>;
my $body-font = $page.core-font: :family<Helvetica>;

$page.graphics: -> $gfx {

    $root.Header1: $gfx, {
        .say('Marked Level 1 Header',
             :font($header-font),
             :font-size(15),
             :position[50, 120]);
    };

    $root.Paragraph: $gfx, {
        .say('Marked paragraph text', :position[50, 100], :font($body-font), :font-size(12));
    };

    # add a marked image
    my PDF::XObject::Image $img .= open: "t/images/lightbulb.gif";
    $root.Figure: $gfx, $img, :Alt('Incandescent apparatus');

    # add a marked link annotation
    my $destination = $pdf.destination( :page(2), :fit(FitWindow) );
    my PDF::Annot $annot = $pdf.annotation: :$page, :$destination, :rect[71, 717, 190, 734];

    $root.Link: $gfx, $annot;

    # tagged XObject Form
    my PDF::XObject::Form $form = $page.xobject-form: :BBox[0, 0, 200, 50];
    my $form-elem = $root.Form;
    $form.text: {
        my $font-size = 12;
        .text-position = [10, 38];

        $form-elem.Header2: $_, {
            .say: "Tagged XObject header", :font($header-font), :$font-size;
        };

        $form-elem.Paragraph: $_, {
            .say: "Some sample tagged text", :font($body-font), :$font-size;
        };
    }

    # render the form contained in $form-elem
    $form-elem.do: $gfx, :position[150, 70];
}

$pdf.save-as: "/tmp/marked.pdf"

Description

A tagged PDF contains additional markup information describing the logical document structure of PDF documents.

PDF tagging may assist PDF readers and other automated tools in reading PDF documents and locating content such as text and images.

This module provides a DOM like interface for creating and traversing PDF structure and content via tags. It also an XPath like search capability. It is designed for use in conjunction with PDF::Class or PDF::API6.

Standard Tags

Elements may be constructed using their Tag name or Mnemonic, as listed below. For example:

$root.P: $gfx, { .say('Marked paragraph text') };

Can also be written as:

$root.Paragraph: $gfx, { .say('Marked paragraph text') };

Or as:

$root.add-kid(:name<P>).mark: $gfx, { .say('Marked paragraph text') };

Documentation in this section adapted from pdfkit.

"Grouping" elements:

TagMnemonicDescription
Documentwhole document; must be used if there are multiple parts or articles
Partpart of a document
ArtArticle
SectSectionmay nest
DivDivisiongeneric division
BlockQuoteblock quotation
Captiondescribing a figure or table
TOCTableOfContentsmay be nested, and may be used for lists of figures, tables, etc.
TOCITableOfContentsItemtable of contents (leaf) item
Indexindex (text with accompanying Reference content)
NonStructNonStructuralnon-structural grouping element (element itself not intended to be exported to other formats like HTML, but 'transparent' to its content which is processed normally)
Privatecontent only meaningful to the creator (element and its content not intended to be exported to other formats like HTML)

"Block" elements:

Mmemonic | Tag | Description

TagMnemonicDescription
HHeadingheading (first element in a section, etc.)
H1 - H6Heading1 - Heading6heading of a particular level intended for use only if nesting sections is not possible for some reason
PParagraph
LListshould include optional Caption, and list items
LIListItemshould contain Lbl and/or LBody
LblLabelbullet, number, or "dictionary headword"
LBodyListBody(item text, or "dictionary definition"); may have nested lists or other blocks

"Table" elements:

TagMnemonicDescription
Tabletable; should either contain TR, or THead, TBody and/or TFoot
TRTableRow
THTableHeadertable heading cell
TDTableDatatable data cell
THeadTableHeadtable header row group
TBodyTableBodytable body row group; may have more than one per table
TFootTableFoottable footer row group

"Inline" elements:

TagMnemonicDescription
Spangeneric inline content
Quoteinline quotation
Notee.g. footnote; may have a Lbl (see "block" elements)
Referencecontent in a document that refers to other content (e.g. page number in an index)
BibEntryBibliographyEntrymay have a Lbl (see "block" elements)
Codecode
Linkhyperlink; should contain a link annotation
AnnotAnnotationannotation (other than a link)
RubyChinese/Japanese pronunciation/explanation
RBRubyBaseTextRuby base text
RTRubyTextRuby annotation text
RPRubyPunctuation
WarichuJapanese/Chinese longer description
WTWarichuText
WPWarichuPunctuation

"Illustration" elements (should have Alt and/or ActualText set):

TagMnemonicDescription
Figure
Formula
Formform widget

Non-structure tags:

TagMnemonicDescription
Artifactused to mark all content not part of the logical structure
ReversedCharsevery string of text has characters in reverse order for technical reasons (due to how fonts work for right-to-left languages); strings may have spaces at the beginning or end to separate words, but may not have spaces in the middle

Classes in this Distribution

See Also

Further Work

  • Type-casting of PDF::StructElem.A to roles; as per 14.8.5. Possibly belongs in PDF::Class, however slightly complicated by the need to apply role-mapping.

  • Develop a tag/accessibility checker. A low-level sanity checker that a tagged PDF meets PDF association recommendations pdf-tag-checker.raku --ua. See https://www.pdfa.org/wp-content/uploads/2014/06/MatterhornProtocol_1-02.pdf and Wikipedia Clause 7 guidelines:

    • Complete tagging of "real content" in logical reading order

    • Tags must correctly represent the document's semantic structures (headings, lists, tables, etc.)

    • Problematic content is prohibited, including illogical headings, the use of color/contrast to convey information, inaccessible JavaScript, and more

    • Meaningful graphics must include alternative text descriptions

    • Security settings must allow assistive technology access to the content

    • Fonts must be embedded, and text mapped to Unicode

The PDF accessibility standard ISO 14289-1 cannot be distributed and needs to be purchased from ISO.

  • Editing. Currently the API doesn't readily support editing tags into existing content. More work is also needed in the PDF::Content module to support content editing.

PDF::Tags v0.0.15

Reads and manipulates tagged PDF files

Authors

  • David Warring

License

Artistic-2.0

Dependencies

PDF:ver<0.4.17+>PDF::Content:ver<0.5.14+>PDF::Class:ver<0.4.16+>Method::Also

Test Dependencies

Provides

  • PDF::Tags
  • PDF::Tags::Attr
  • PDF::Tags::Elem
  • PDF::Tags::Mark
  • PDF::Tags::Node
  • PDF::Tags::Node::Parent
  • PDF::Tags::Node::Root
  • PDF::Tags::ObjRef
  • PDF::Tags::Text
  • PDF::Tags::XML-Writer
  • PDF::Tags::XPath
  • PDF::Tags::XPath::Actions
  • PDF::Tags::XPath::Axes
  • PDF::Tags::XPath::Grammar

The Camelia image is copyright 2009 by Larry Wall. "Raku" is trademark of the Yet Another Society. All rights reserved.