PDF::Tags
[Raku PDF Project] / PDF::Tags
PDF-Tags-raku
A small DOM-like API for the creation of tagged PDF files.
This module enables PDF tagged content manipulation, with simple construction, XPath queries and basic XML serialization.
Synopsis
use PDF::Tags;
use PDF::Tags::Elem;
# PDF::API6
use PDF::API6;
use PDF::Annot;
use PDF::XObject::Image;
use PDF::XObject::Form;
my PDF::API6 $pdf .= new;
my PDF::Tags $tags .= create: :$pdf;
# create the document root
my PDF::Tags::Elem $root = $tags.Document;
my $page = $pdf.add-page;
my $header-font = $page.core-font: :family<Helvetica>, :weight<bold>;
my $body-font = $page.core-font: :family<Helvetica>;
$page.graphics: -> $gfx {
$root.Header1: $gfx, {
.say('Marked Level 1 Header',
:font($header-font),
:font-size(15),
:position[50, 120]);
};
$root.Paragraph: $gfx, {
.say('Marked paragraph text', :position[50, 100], :font($body-font), :font-size(12));
};
# add a marked image
my PDF::XObject::Image $img .= open: "t/images/lightbulb.gif";
$root.Figure: $gfx, $img, :Alt('Incandescent apparatus');
# add a marked link annotation
my $destination = $pdf.destination( :page(2), :fit(FitWindow) );
my PDF::Annot $annot = $pdf.annotation: :$page, :$destination, :rect[71, 717, 190, 734];
$root.Link: $gfx, $annot;
# tagged XObject Form
my PDF::XObject::Form $form = $page.xobject-form: :BBox[0, 0, 200, 50];
my $form-elem = $root.Form;
$form.text: {
my $font-size = 12;
.text-position = [10, 38];
$form-elem.Header2: $_, {
.say: "Tagged XObject header", :font($header-font), :$font-size;
};
$form-elem.Paragraph: $_, {
.say: "Some sample tagged text", :font($body-font), :$font-size;
};
}
# render the form contained in $form-elem
$form-elem.do: $gfx, :position[150, 70];
}
$pdf.save-as: "/tmp/marked.pdf"
Description
A tagged PDF contains additional markup information describing the logical document structure of PDF documents.
PDF tagging may assist PDF readers and other automated tools in reading PDF documents and locating content such as text and images.
This module provides a DOM like interface for creating and traversing PDF structure and content via tags. It also an XPath like search capability. It is designed for use in conjunction with PDF::Class or PDF::API6.
Standard Tags
Elements may be constructed using their Tag
name or Mnemonic
, as listed below. For example:
$root.P: $gfx, { .say('Marked paragraph text') };
Can also be written as:
$root.Paragraph: $gfx, { .say('Marked paragraph text') };
Or as:
$root.add-kid(:name<P>).mark: $gfx, { .say('Marked paragraph text') };
Documentation in this section adapted from pdfkit.
"Grouping" elements:
Tag | Mnemonic | Description |
Document | whole document; must be used if there are multiple parts or articles | |
Part | part of a document | |
Art | Article | |
Sect | Section | may nest |
Div | Division | generic division |
BlockQuote | block quotation | |
Caption | describing a figure or table | |
TOC | TableOfContents | may be nested, and may be used for lists of figures, tables, etc. |
TOCI | TableOfContentsItem | table of contents (leaf) item |
Index | index (text with accompanying Reference content) | |
NonStruct | NonStructural | non-structural grouping element (element itself not intended to be exported to other formats like HTML, but 'transparent' to its content which is processed normally) |
Private | content only meaningful to the creator (element and its content not intended to be exported to other formats like HTML) |
"Block" elements:
Mmemonic | Tag | Description
Tag | Mnemonic | Description |
H | Heading | heading (first element in a section, etc.) |
H1 - H6 | Heading1 - Heading6 | heading of a particular level intended for use only if nesting sections is not possible for some reason |
P | Paragraph | |
L | List | should include optional Caption, and list items |
LI | ListItem | should contain Lbl and/or LBody |
Lbl | Label | bullet, number, or "dictionary headword" |
LBody | ListBody | (item text, or "dictionary definition"); may have nested lists or other blocks |
"Table" elements:
Tag | Mnemonic | Description |
Table | table; should either contain TR, or THead, TBody and/or TFoot | |
TR | TableRow | |
TH | TableHeader | table heading cell |
TD | TableData | table data cell |
THead | TableHead | table header row group |
TBody | TableBody | table body row group; may have more than one per table |
TFoot | TableFoot | table footer row group |
"Inline" elements:
Tag | Mnemonic | Description |
Span | generic inline content | |
Quote | inline quotation | |
Note | e.g. footnote; may have a Lbl (see "block" elements) | |
Reference | content in a document that refers to other content (e.g. page number in an index) | |
BibEntry | BibliographyEntry | may have a Lbl (see "block" elements) |
Code | code | |
Link | hyperlink; should contain a link annotation | |
Annot | Annotation | annotation (other than a link) |
Ruby | Chinese/Japanese pronunciation/explanation | |
RB | RubyBaseText | Ruby base text |
RT | RubyText | Ruby annotation text |
RP | RubyPunctuation | |
Warichu | Japanese/Chinese longer description | |
WT | WarichuText | |
WP | WarichuPunctuation |
"Illustration" elements (should have Alt and/or ActualText set):
Tag | Mnemonic | Description |
Figure | ||
Formula | ||
Form | form widget |
Non-structure tags:
Tag | Mnemonic | Description |
Artifact | used to mark all content not part of the logical structure | |
ReversedChars | every string of text has characters in reverse order for technical reasons (due to how fonts work for right-to-left languages); strings may have spaces at the beginning or end to separate words, but may not have spaces in the middle |
Classes in this Distribution
PDF::Tags - Tagged PDF root node
PDF::Tags::Attr - A single node attribute
PDF::Tags::Elem - Structure Tree descendant node
PDF::Tags::Node - Abstract node
PDF::Tags::Node::Parent - Abstract parent node
PDF::Tags::Mark - Leaf content marker node
PDF::Tags::Text - Text content node
PDF::Tags::ObjRef - A reference to a PDF object (PDF::Annot, PDF::Field or PDF::XObject)
PDF::Tags::XML-Writer - XML Serializer
PDF::Tags::XPath - XPath evaluation context
See Also
PDF::Tags::Reader - for reading PDF files with existing tagged content
Further Work
Type-casting of PDF::StructElem.A to roles; as per 14.8.5. Possibly belongs in PDF::Class, however slightly complicated by the need to apply role-mapping.
Develop a tag/accessibility checker. A low-level sanity checker that a tagged PDF meets PDF association recommendations
pdf-tag-checker.raku --ua
. See https://www.pdfa.org/wp-content/uploads/2014/06/MatterhornProtocol_1-02.pdf and Wikipedia Clause 7 guidelines:Complete tagging of "real content" in logical reading order
Tags must correctly represent the document's semantic structures (headings, lists, tables, etc.)
Problematic content is prohibited, including illogical headings, the use of color/contrast to convey information, inaccessible JavaScript, and more
Meaningful graphics must include alternative text descriptions
Security settings must allow assistive technology access to the content
Fonts must be embedded, and text mapped to Unicode
The PDF accessibility standard ISO 14289-1 cannot be distributed and needs to be purchased from ISO.
Editing. Currently the API doesn't readily support editing tags into existing content. More work is also needed in the PDF::Content module to support content editing.