PDF::Tags::Reader

Tagged PDF reader

Synopsis

use PDF::Class;
use PDF::Tags::Reader;
# read tags
my PDF::Class $pdf .= open: "t/pdf/tagged.pdf");
my PDF::Tags::Reader $tags .= read: :$pdf;
my PDF::Tags::Elem $doc = $tags[0];
say "document root {$doc.name}";
say " - child {.name}" for $doc.kids;
say $doc.xml; # dump tags and text content as XML

Description

This module implements reading of tagged PDF content from PDF files.

Methods

This class inherits from PDF::Tags and has its methods available.

method read

method read(PDF::Class :$pdf!, Bool :$create, Bool :$marks) returns PDF::Tags

Read tagged PDF structure from an existing file that has been previously tagged.

The :create option creates a new struct-tree root, if one does not already exist.

The :marks option causes PDF::Tag::Reader to descend into content and build a more detailed structure that includes the actual marks in the content stream as PDF::Tags::Mark objects. Otherwise just the content text is inserted as a child of type Str.

method canvas-tags

method canvas-tags(PDF::Content::Canvas) returns Hash

Renders a canvas object (Page or XObject form) and caches marked content as a hash of PDF::Content::Tag objects, indexed by MCID (Marked Content ID).

Scripts in this Distribution

pdf-tag-dump.raku

pdf-tag-dump.raku --select=<xpath-expr> --omit=tag --password=Xxxx --max-depth=n --marks --/atts --/style --debug t/pdf/tagged.pdf

Options:

  • --password=**** - password for the input PDF, if encrypted with a user password

  • --max-depth=n - depth to ascend/descend struct tree

  • --/atts disable tags attributes

  • --debug - write extra debugging information to XML

  • --marks - descend into marked content

  • --strict - warn about unknown tags, etc

  • --/style - omit stylesheet

  • --select=xpath-expr - twigs to include (relative to root)

This script reads tagged PDF content from PDF files as XML.

PDF::Tags::Reader v0.0.4

Tagged PDF reader

Authors

  • David Warring

License

Artistic-2.0

Dependencies

PDF::Tags:ver<0.0.14+>PDF::Font::Loader:ver<0.5.12+>Method::Also

Test Dependencies

Provides

  • PDF::Tags::Reader

The Camelia image is copyright 2009 by Larry Wall. "Raku" is trademark of the Yet Another Society. All rights reserved.