% !Mode:: "TeX:DE:UTF-8:Main"
\PassOptionsToPackage{check-declarations,enable-debug}{expl3}
% Note on the compilation of the documentation:
% The documentation uses for the tagging sometimes code
% that is under development and/or not public yet.
% To compile an *untagged* documentation, comment the line with
% the testphase keys in the following \DocumentMetadata command.
\DocumentMetadata
{
% comment the following line to compile an untagged documentation:
testphase={phase-III,title,table},
pdfversion=2.0,lang=en-UK,pdfstandard=a-4,pdfstandard=ua-2
%uncompress
}
\DebugBlocksOff
\makeatletter
\def\UlrikeFischer@package@version{0.99i}
\def\UlrikeFischer@package@date{2024-11-19}
\makeatother
\documentclass[bibliography=totoc,a4paper]{article}
\usepackage{geometry}
\usepackage[english]{babel}
\usepackage{unicode-math}
\setmainfont{Heuristica}
\usepackage[nopatch]{microtype}
\usepackage[autostyle]{csquotes}
\usepackage[style=numeric]{biblatex}
\addbibresource{tagpdf.bib}
\reversemarginpar
\NewDocumentCommand\sidenote{m}{\marginpar{#1}}
\usepackage{booktabs}
\setlength\belowcaptionskip{10pt}
\usepackage{tcolorbox}
\usepackage{tikz}
\usetikzlibrary{positioning}
\usetikzlibrary{fit,tikzmark}
\usetikzlibrary{arrows.meta}
\tikzset{arg/.style = {font=\footnotesize\ttfamily, anchor=base,draw, rounded corners,node distance=2mm and 2mm}}
\tikzset{operator/.style = {font=\footnotesize\ttfamily, anchor=base,draw, rounded corners,node distance=4mm and 4mm}}
\usepackage{listings}
\lstset{basicstyle=\ttfamily, columns=fullflexible,language=[LaTeX]TeX,
escapechar=*,
commentstyle=\color{green!50!black}\bfseries}
% this allow to get real spaces in the code parts.
% This should perhaps be combined in a new listings key
\lstset{showspaces}
\makeatletter \def\lst@visiblespace{\lst@ttfamily{\char32}{\char32}}\makeatother
\tagpdfsetup{tabsorder=structure}
\usepackage[pdfdisplaydoctitle=true]{hyperref}
\hypersetup{
pdftitle={The tagpdf package, v\csname UlrikeFischer@package@version\endcsname},
pdfauthor=Ulrike Fischer,
colorlinks}
\tcbuselibrary{documentation}
\definecolor{Definition}{rgb}{0,0.2,0.6}
\newcommand\PrintKeyName[1]{\textsf{#1}}
\newcommand\pkg[1]{\texttt{#1}}
\newcommand\DescribeKey[1]{\texttt{#1}}
%tagging patches:
\usepackage{tagpdfdocu-patches}
\newcommand\PDF{PDF}
\title{The \pkg{tagpdf} package, v\csname UlrikeFischer@package@version\endcsname}
\date{\csname UlrikeFischer@package@date\endcsname}
\author{Ulrike Fischer\thanks{fischer@troubleshooting-tex.de}}
\usepackage{shortvrb}
\MakeShortVerb|
\begin{document}
\maketitle
\begin{tcolorbox}[colframe=red]
This package is not meant for direct use in (normal) documents. It started in 2018 as
a support tool to \emph{research} tagging. It is now the base of the code developed
in the \pkg{latex-lab} bundle for the Tagged PDF project (i.e., loaded by that code)
\url{https://www.latex-project.org/publications/indexbytopic/pdf/}.
The package is developed and improved in parallel with the code in the \pkg{latex-lab}
bundle (part of the core \LaTeX{} distribution), the \pkg{pdfmanagement-testphase}
package (the \LaTeX{} PDF management bundle) and the L3 programming layer (part of the \LaTeX{} format).
That means you must ensure that all these components are up-to-date and in
sync which each other.
This package quite probably still contains some bugs. It is in some parts quite slow because
the code currently prefers readability over speed. At some point in the future its code will
be integrated into the \LaTeX{} format and then this package will disappear.
Because of its function as a research and development tool it is
important to understand that this package can still change in
incompatible ways from one version to the next.
You need some knowledge about \TeX, \PDF{} and perhaps even lua to use it.
\medskip
Issues, comments, suggestions can be added as issues to these two github tracker:
\medskip
\centering \url{https://github.com/latex3/tagging-project}\par
\leavevmode\llap{or\qquad\qquad} \url{https://github.com/latex3/tagpdf}
\end{tcolorbox}
\tagtool{sec-add-grouping=false}
\tableofcontents
\tagtool{sec-add-grouping}
\section{Introduction}
For many years the creation of accessible, tagged \PDF{}-files with \LaTeX\
that conform to the PDF/UA standard has been on the agenda of \TeX-meetings.
Many people agree that this is important and Ross Moore has done quite some
work on it. There is also a TUG-mailing list and a web page
\parencite{tugaccess} dedicated to this topic.
In my opinion missing were means to \emph{experiment} with tagging and
accessibility. Means to try out, how difficult it is to tag some structures,
means to try out, how much tagging is really needed (standards and validators
don't need to be right \ldots), means to test what else is needed so that a
\PDF{} works e.g. with a screen reader, means to try out how core \LaTeX\
commands behave if tagging is used. Without such experiments it is in my
opinion quite difficult to get a feeling about what has to be done, which
kernel changes are needed, and how packages should be adapted.
This package was developed to close this gap by offering \emph{core} commands
to tag a \PDF{}\footnote{In case you don't know what this means: there will
be some explanations later on.}. My hope was that the knowledge gained by the
use of this package would in the end allow to decide if and how code to do
tagging should become part of the \LaTeX\ kernel.
The code has been written so that it can be added as module to the \LaTeX{}
kernel itself if it turns out to be usable. It therefore avoid to patch
commands from other packages. It was also not an aim of the package to
develop patches to directly enable tagging in other packages. While in the end changes to various commands in many
classes and packages will be needed to automatically get tagged \PDF{} files, these changes
should be done by class, package and document writers themselves using a
sensible API provided by the kernel and not by some external package that
adds patches everywhere and would need constant maintenance --- one only need
to look at packages like \pkg{tex4ht} or \pkg{bidi} or \pkg{hyperref} to see how difficult and
sometimes fragile this is.
The package is now a part of the Tagged PDF project and triggered already
various changes in the \LaTeX\ kernel and the engines: There is a new PDF
management, the new para hooks allows to automatically tag paragraphs, after
changes in the output routine page breaks and header and footer are handled
correctly, the engines now support structure destinations. More changes are
in the latex-lab bundle and can be loaded through \texttt{testphase} keys.
I'm sure that tagpdf still has bugs. Bugs reports, suggestions and comments
can be added to the issue tracker on github either
\url{https://github.com/latex3/tagpdf} or
\url{https://github.com/latex3/tagging-project}.
Please also check the github site and latex-lab for new examples and improvements.
\subsection{Tagging and accessibility}
While the package is named \pkg{tagpdf} the goal is also \emph{accessible}
\PDF{}-files. Tagging is \emph{one} (the most difficult) requirement for
accessibility but there are others. I will mention some later on in this
documentation, and -- if sensible -- I will also try to add code, keys or
tips for them.
So the name of the package is a bit wrong. As excuse I can only say that it
is short and easy to pronounce (and of course, it was always meant to be temporary).
\subsection{Engines and modes}
Theoretically, the package works with all engines, but the xelatex and the
latex-dvips-route are basically untested and they also don't support real
space glyphs so I don't recommend them. lualatex is the most powerful and
safe modus and should be used for new documents, it is slower than pdflatex
but requires less compilations. pdflatex works ok and can be used for legacy
documents; it needs more compilations to resolve all cross references needed
for the tagging.
The package has two modes: the \emph{generic mode} which should work in
theory with every engine and the \emph{lua mode} which works only with
lualatex and (since version 0.98k) with dvilualatex.
I implemented the generic mode first. Mostly because my \TeX\ skills are much
better than my lua skills and I wanted to get the \TeX\ side right before
starting to fight with attributes and node traversing.
While the generic mode is not bad and I spent quite some time to get it
working I nevertheless think that the lua mode is the future and the only one
that will be usable for larger documents. \PDF{} is a page orientated format
and so the ability of luatex to manipulate pages and nodes after the
\TeX-processing has finished is really useful here. Also with luatex characters are
normally already given as Unicode.
The package uses quite a lot labels (in generic mode more than with luamode).
It is now based on the property module of the \LaTeX{} kernel. This module
provides expandable references but the drawback is that (right now) they don't always give
good rerun messages if they have changed. I advise to use the
\pkg{rerunfilecheck} package as a intermediate work-around and when using
pdflatex compile at least once or twice more often then normal.
\subsection{References and target PDF version}
My main reference for the first versions of this package was the free
reference for \PDF{} 1.7. \parencite{pdfreference} and so they implemented
only support for \PDF{} 1.7.
In 2018 \PDF{} 2.0. has been released. The reference can now be bought at no
cost through the PDF association.
\PDF{} 2.0 has a number of features that are really needed for good tagging:
it knows more structure types, it allows to add associated files to
structures---these are small, embedded files that can, for example, contain
the mathML or source code of an equation---, it knows structure destinations,
which allows to link to a structure.
\PDF{}~2.0 features are currently (end of 2023) not well supported by
\PDF~consumer. No PDF viewer (including Acrobat) for example can handle name
spaces and associated files. The PDF Accessibility Checker (PAC) even crashes
if one tries to load a \PDF{} 2.0 file, and pdftk will create a \PDF{}~1.0
from it.
Nevertheless \LaTeX{} targets \PDF{} 2.0, tagpdf has added support for
associated files, for name spaces and other \PDF{} 2.0 features. We recommend
to use \PDF{} 2.0 if possible and then to complain to the PDF{} consumer if
something doesn't work.
The package doesn't try to suppress all 2.0 features if an older \PDF{}
version is produced. It normally doesn't harm if a \PDF{} contains keys
unknown in its version and it makes the code faster and easier to maintain if
there aren't too many tests and code paths; so for example associated files
will always be added. But tests could be added in case this leads to
incompatibilities.
\subsection{Validation}
\PDF{}'s created with the commands of this package must be validated:
\begin{itemize}
\item
One must check that the \PDF{} is \emph{syntactically} correct.
It is rather easy to create broken \PDF{}:
e.g. if a chunk is opened on one page but closed
on the next page or if the document isn't compiled often enough.
\item One must check how good the PDF follows requirements of standards
like PDF/UA \emph{formally}\footnote{The PDF/UA-2 standard for \PDF~2.0
will hopefully be released begin of 2024.}.
\item
One must check how good the accessibility is \emph{practically}.
\end{itemize}
Syntax validation and formal standard validation can be done for example with
preflight of the (non-free) Adobe Acrobat. It can also be done
(only for PDF 1.7 and older) also with the
free \PDF{} Accessibility Checker (PAC~2024) \parencite{pac2024}. There is also
the validator veraPDF \parencite{verapdf} which can also handle PDF 2.0 files.
A quite useful tool
is \enquote{Next Generation PDF} \parencite{ngpdf}, a browser application
which converts a tagged PDF to html, allows to inspect its structure and also
to edit the structure. For PDF~2.0 files there is also a checker based on the
Arlington model from veraPDF.
Practical validation is naturally the more complicated part.
It needs screen reader, users which actually knows how to handle them,
can test documents and can report where a \PDF{} has real accessibility problems.
\minisec{Preflight woes}
Sadly validators can not be always trusted.
As an example for an reason that I don't understand the adobe preflight
don't like the list structure \texttt{L}.
It is also possible that validators contradict: that the one says everything is okay,
while the other complains.
\subsection{Examples wanted!}
To make the package usable examples are needed: examples that demonstrate how
various structures can be tagged and which patches are needed, examples for
the test suite, examples that demonstrates problems.
\begin{tcolorbox}
Feedback, contributions and corrections are welcome!
\end{tcolorbox}
All examples should use the \cs{DocumentMetadata} key \PrintKeyName{uncompress}
so that uncompressed \PDF{} are created and the internal objects and
structures can be inspected and be compared by the l3build checks.%
\subsection{Proof of concept: the tagging of the documentation itself}
Starting with version 0.6 the documentation itself has been tagged. The
tagging wasn't (and isn't) in no way perfect. The validator from Adobe didn't
complain, but PAC~3 wanted alternative text for all links (no idea why) and
so I put everywhere simple text like \enquote{link} and \enquote{ref}. The
links to footnotes gave warnings, so I disabled them. I used types from the
\PDF{} version 1.7, mostly as I had no idea what should be used for code in
2.0. Margin notes were simply wrong and there were tagging commands
everywhere \ldots
The tagging has been improved and automated over time in sync with
improvements and new features in the \LaTeX\ kernel, the latex-lab bundle and
the \PDF\ management code and is now much better. Only a few
structures---mostly some from currently unsupported packages--- still need
manual tagging. But sadly the output of the validators don't quite reflect
the improvements. The documentation uses now \PDF~2.0 and while the newest
PAC~2024 can at least open the file it can not validate properly the file. For example
it complains about the tabular header cells as it doesn't follow attribute classes.
The Adobe validator has a bug and
doesn't like the (valid) use of the \texttt{Lbl} tag for the section numbers
(see figure~\ref{fig:adobe}).
But even if the documentation would pass all the tests of the validators: as
mentioned above passing a formal test doesn't mean that the content is really
good and usable. The user commands used for the tagging and also some of the
patches used are still rather crude. So there is lot space for improvement.
\begin{tcolorbox}[]
Be aware that to create the tagged version a current lualatex-dev and a
current version of the pdfmanagment-testphase package is needed.
\end{tcolorbox}
\includegraphics[alt=PAC 2024 complains about PDF version]{pac2024-version}
\includegraphics[alt=PAC 2024 complains about table header cells]{pac2024-report}
\begin{figure}
\includegraphics[alt={Screenshot of Adobe report}]{acrobat}
\caption{Adobe Acrobat complaining
about the \texttt{Lbl} use}\label{fig:adobe}\par
\end{figure}
\section{Loading}
The package requires the new PDF management. With a current \LaTeX{} (2022-06-01 or newer)
the PDF management is loaded if you use the \cs{DocumentMetadata} command before \cs{documentclass}.
The \pkg{tagpdf} package can then be loaded and activated by using the \texttt{testphase} key. The exact behavior of
the \texttt{testphase} key is documented in \texttt{documentmetadata-support-doc.pdf} which
is part of the \pkg{latex-lab} bundle.
Various parts of the code differentiate between \PDF{} version 2.0 and lower versions. If
\PDF{} 2.0 is wanted it is required to set the version early in the \cs{DocumentMetadata}
command so that \pkg{tagpdf} can pick up the correct code path.
\begin{taglstlisting}
\DocumentMetadata
{
% testphase = phase-I, % tagging without paragraph tagging
% testphase = phase-II, % tagging with paragraph tagging
testphase = phase-III, % tagging with paragraph sec, toc, blocks and more
pdfversion = 2.0, % pdfversion must be set here.
pdfstandard=ua-2, % pdfstandard can be set too
}
\documentclass{article}
\begin{document}
some text
\end{document}
\end{taglstlisting}
\minisec{Deactivation}
When loading \pkg{tagpdf} through the \texttt{testphase} keys, it is automatically activated.
To deactivate it while still retaining all the other new code from the latex-lab testphase files,
use in the preamble |\tagpdfsetup{activate/all=false}|. You can additionally also deactivate the
paratagging and the interword space code.
To suppress the loading of the package altogether you can try
\begin{taglstlisting}
\makeatletter
\disable@package@load{tagpdf}{}
\makeatother
\DocumentMetadata{...}
\end{taglstlisting}
\minisec{Loading as package needs activation!}
It is not recommended anymore, but the package can also be loaded
normally with |\usepackage| (but it is still required to
use \cs{DocumentMetadata} to load the \PDF\ management) but it will
then -- apart from loading more packages and defining a lot of things
-- not do much. You will have to \emph{activate} it
with \verb+\tagpdfsetup+.
The \PDF\ management loaded with \cs{DocumentMetadata} will in any
case load \pkg{tagpdf-base} a small package that provides no-op
versions of the main tagging commands.
Most commands do nothing if tagging is not activated, but in case a
test is needed a command (with the usual p,T,F variants) is provided:
\begin{docCommand}{tag_if_active:TF}{}\end{docCommand}
The check is true only if \emph{everything} is activated. In all other
cases (including if tagging has been stopped locally) it will be
false.
\subsection{Modes and package options}
%TODO think about tagging of the keys. Aside? Header?
The package has two different modes: The \textbf{generic mode} works
(in theory, currently only fully tested with pdflatex) probably with
all engines, the \textbf{lua mode} only with lualatex. The differences
between both modes will be described later. The mode can be set with
package options:
\DescribeKey{luamode}
This is the default mode. It will use the generic mode if the document
is processed with pdflatex and the lua mode with lualatex.
\DescribeKey{genericmode}
This will force the generic mode for all engines.
\subsection{Setup and activation}\label{ssec:setup}
\begin{docCommand}{tagpdfsetup}{\marg{key-val-list}}\end{docCommand}
This command setups the general behavior of the package.
The command should be normally used only in the preamble
(for a few keys it could also make sense to change them in the document).
The key-val list understands at least the following keys. More keys are defined in some of the latex-lab module, see table~\ref{tab:setupkey} for an overview which also includes older, now deprecated names.
\begin{table}
\caption{Overview over keys for \cs{tagpdfsetup}}\label{tab:setupkey}
\input{tagpdfsetup-keys}
\end{table}
\begin{description}
\item[\PrintKeyName{activate/all}] Boolean, initially false. Activates
everything, that's normally the sensible thing to do.
\item [\PrintKeyName{activate}] Like |activate/all|,
\emph{additionally} is opens at begin document a structure with
|\tagstructbegin| and closes it at end document. The key accepts as
value a tag name which is used as the tag of the structure. The
default value is |Document|.
\item[\PrintKeyName{activate/mc}] Boolean, initially false. Activates
the code related to marked content.
\item[\PrintKeyName{activate/struct}] Boolean, initially
false. Activates the code related to structures. Should be used only
if \PrintKeyName{activate/mc} has been used too.
\item[\PrintKeyName{activate/struct-dest}] Boolean, initially true.
Starting with version 0.93
\pkg{tagpdf} will create automatically structure destinations (see
section~\ref{sec:struct-dest} if \pkg{hyperref} is used and if the
engine supports it. With this key this
can be suppressed.
\item[\PrintKeyName{activate/tree}] Boolean, initially
false. Activates the code related to trees. Should be used only if
the two other keys has been used too.
\item[\PrintKeyName{activate/spaces}] Boolean. The key
activates/deactivates the insertion of space glyphs, see
section~\ref{sec:spacechars}. In the luamode it only works if at
least \PrintKeyName{activate/mc} has been used.
The old name of the key |interwordspace| is still supported but deprecated.
\item[\PrintKeyName{activate/softhyphen}] Boolean. luamode only.
The key activates/deactivates the replacing of hard hyphens from hyphenation
by soft hyphens. By default this is activated.
\item[\PrintKeyName{role/new-tag}] Allows to define new tag names, see
section \ref{sec:new-tag} for a description.
\item[\PrintKeyName{role/new-attribute}] This key takes two arguments and
declares an attribute. See \ref{sec:attributes}.
\item[\PrintKeyName{role/map-tags}] This key allows to remap the structure
tags. Currently it supports only two values: |false| (the default) and |pdf| which
maps all tags to their standard PDF role, e.g. |itemize| will be mapped to |L|.
\item[\PrintKeyName{para/tagging}] Boolean. This activate/deactivates
the automatic tagging of paragraphs, see \ref{sec:paratagging} for
more background. It uses the \texttt{para/begin} and
\texttt{para/end} hooks.
With more tagging support conditions will be added, that means the
code is bound to change! Paragraphs can appear in many unexpected
places and the code can easily break, so there is also an option to
see where such paragraphs are:
\item[\PrintKeyName{para/tag}] String. This key changes the second tag
used by the paratagging code. The default tag is \texttt{text}, a
\LaTeX{} specific tag that is role mapped to \texttt{P}. A useful
local setting here can be \texttt{NonStruct}, which creates a
structure \enquote{without meaning}. For local changes it is
recommended to use the newer \cs{tagtool} command described below
instead of \cs{tagpdfsetup}.
\item[\PrintKeyName{para/maintag}] String. This key changes the first tag
used by the paratagging code. The default tag is \texttt{text-unit}, a
\LaTeX{} specific tag that is role mapped to \texttt{Part}.
For local changes it is
recommended to use the newer \cs{tagtool} command described below
instead of \cs{tagpdfsetup}.
\item[\PrintKeyName{page/tabsorder}] Choice key, possible values are
\PrintKeyName{row}, \PrintKeyName{column}, \PrintKeyName{structure},
\PrintKeyName{none}. This decides if a \verb+/Tabs+ value is
written to the dictionary of the page objects. Not really needed for
tagging itself, but one of the things you probably need for
accessibility checks. So I added it. Currently the tabsorder is the
same for all pages. Perhaps this should be changed \ldots.
\item[\PrintKeyName{activate/tagunmarked}] Boolean,\sidenote{luamode} initially
true. When this boolean is true, the lua code will try to mark
everything that has not been marked yet as an artifact. The benefit
is that one doesn't have to mark up every deco rule oneself. The
danger is that it perhaps marks things that shouldn't be marked --
it hasn't been tested yet with complicated documents containing
annotations etc. See also section~\ref{sec:lazy} for a discussion
about automatic tagging.
\item[\PrintKeyName{viewer/startstructure}] A structure number. If a \texttt{OpenAction}
is set in the PDF Catalog (which is normally the case if hyperref is used)
a structure destination pointing to the structure is added. The initial value is structure 1 (the \texttt{Document} structure), the default value is the current structure. The
key can be used more than once, the last setting will win.
\item[\PrintKeyName{debug/uncompress}] Sets both the \PDF{} compresslevel
and the \PDF{} objcompresslevel to 0 and so allows to inspect the
\PDF{}. No really useful anymore as this can also
be set in \cs{DocumentMetadata}.
\item[\PrintKeyName{debug}] This keys knows a number of sub-keys to
set various debug options.
\begin{description}
\item[\PrintKeyName{debug/show}] This takes a comma list of keywords:
\texttt{spaces}/\texttt{spacesOff}: \sidenote{luamode}
That helps in lua mode to see where space glyph will be
inserted if \PrintKeyName{activate/spaces} is activated.
This can also be activated with the now deprecated key |show-spaces|
\texttt{para}/\texttt{paraOff}: This (locally)
activates/deactivates small red and green numbers in the places where
the paratagging hook code is used.
\item[\PrintKeyName{debug/log}] Choice key, possible values
\PrintKeyName{none}, \PrintKeyName{v}, \PrintKeyName{vv},
\PrintKeyName{vvv}, \PrintKeyName{all}. Setups the log level.
Changing the value affects currently mostly the luamode:
\enquote{higher} values gives more messages in the log. The current
levels and messages have been setup in a quite ad-hoc manner and
will need improvement.
\end{description}
\end{description}
\begin{docCommands}
{
{doc name=tagtool,doc parameter=\marg{key-val}},
{doc name=tag_tool:n,doc parameter=\marg{key-val}}
}
\end{docCommands}
The tagging of document elements requires a variety of small
commands. This command will unify them under a common interface. This
is work-in-progress and syntax and implementation can change! While
the argument looks like a key-val \emph{list} (and currently is
actually one), this should not be relied on. Instead only one argument
should be used as the implementation will change to improve the
speed. Currently the following arguments are supported
\begin{description}
\item[\PrintKeyName{para/tagging}] Boolean. It will replace the
\cs{tagpdfparaOn} and \cs{tagpdfparaOff} command.
\item[\PrintKeyName{para/maintag}] String. It allows to change the outer
tag used in the following automatically tagged paragraphs. The
setting is local.
\item[\PrintKeyName{para/tag}] String. It allows to change the inner
tag used in the following automatically tagged paragraphs. The
setting is local.
\item[\PrintKeyName{para/flattened}] Boolean. If set it will suppress
the outer structure in the automatic paratagging. This should be
applied to the start and end hook in the same way! The setting is
local.
\end{description}
\section{Tagging}
PDF is a page orientated graphic format. It simply puts ink and glyphs
at various coordinates on a page. A simple stream of a page can look
like this\footnote{The appendix contains some remarks about the syntax
of a \PDF{} file}:
\begin{taglstlisting}[columns=fixed]
stream
BT
/F27 14.3462 Tf %select font
89.291 746.742 Td %move point
[(1)-574(Intro)-32(duction)]TJ %print text
/F24 10.9091 Tf %select font
0 -24.35 Td %move point
[(Let's)-331(start)]TJ %print text
205.635 -605.688 Td %move point
[(1)]TJ %print text
ET
endstream
\end{taglstlisting}
From this stream one can extract the characters and their placement on the page
but not their semantic meaning (the first line is actually a section heading,
the last the page number). And while in the example the order is correct
there is actually no guaranty that the stream contains the text in the order
it should be read.
Tagging means to enrich the \PDF{} with information about the \emph{semantic}
meaning and the \emph{reading order}. (Tagging can do more, one can also
store all sorts of layout information like font properties and indentation
with tags. But as I already wrote this package concentrates on the part of
tagging that is needed to improve accessibility.)
\subsection{Three tasks}
To tag a \PDF{} three tasks must be carried out:
\begin{enumerate}
\item
\textbf{The mark-content-task}:\sidenote{mc-task} The document must add
\enquote{labels} to the page stream which allows to identify and reference
the various chunks of text and other content.
This is the most difficult part of tagging -- both for the document writer
but also for the package code. At first there can be quite many
chunks as every one is a leaf node of the structure and so often a rather
small unit. At second the chunks must be defined page-wise -- and
this is not easy when you don't know where the page breaks are.
Also in a standard document a lot text is created automatically, e.g.
the toc, references, citations, list numbers etc and it is not always
easy to mark them correctly.
\item \textbf{The structure-task}:\sidenote{struct-task} The document must
declare the structure. This means marking the start and end of
semantically connected portions of the document (correctly nested as a
tree). This too means some work for the document writer, but less than
for the mc-task: at first quite often the mc-task and the
structure-task can be combined, e.g. when you mark up a list number or
a tabular cell or a section header; at second one doesn't have to worry
about page breaks so quite often one can patch standard environments to
declare the structure. On the other side a number of structures end in
\LaTeX\ only implicitly -- e.g. an item ends at the next item, so
getting the \PDF{} structure right still means that additional mark up
must be added.
\item \textbf{The tree management}:\sidenote{tree-task} At last the
structure must be written into the \PDF{}. For every structure an
object of type \texttt{StructElem} must be created and flushed with
keys for the parents and the kids. A parent tree must be created to get
a reference from the mc-chunks to the parent structure. A role map must
be written. And a number of dictionary entries. All this is hopefully
done automatically and correctly by the package \ldots.
\end{enumerate}
\begin{figure}[t!]
\begin{tcolorbox}[]
\minisec{Page stream with marked content}
\begin{tikzpicture}[baseline=(a.north),node distance=2pt,remember picture,
alt={Illustration of page stream with marked content}]
\node(start){\ldots~\ldots~\ldots};
\node[draw,base right = of start](a) {mc-chunk 1};
\node[draw,base right = of a](b) {mc-chunk 2};
\node[draw,base right = of b](c) {mc-chunk 3};
\node[draw,base right = of c](d) {mc-chunk 3};
\node[base right = of d] {\ldots~\ldots};
\end{tikzpicture}
\minisec{Structure}
\newlength\ydistance\setlength\ydistance{-0.8cm}
\begin{tikzpicture}[remember picture,baseline=(root.north),alt={Illustration of structure}]
\node[draw,anchor=base west] (root) at (0,0) {Sect (start section)};
\node[draw,anchor=base west] at (0.3,\ydistance) {H (header section)};
\node[draw,anchor=base west](aref) at (0.6,2\ydistance){mc-chunk 1};
\node[draw,anchor=base west](bref) at (0.6,3\ydistance){mc-chunk 2};
\node[draw,anchor=base west] at (0.3,4\ydistance){/H (end header)};
\node[draw,anchor=base west] at (0.3,5\ydistance){P (start paragraph)};
\node[draw,anchor=base west](cref) at (0.6,6\ydistance){mc-chunk 3};
\node[draw,anchor=base west](dref) at (0.6,7\ydistance){mc-chunk 4};
\node[draw,anchor=base west] at (0.3,8\ydistance){/P (end paragraph)};
\node[draw,anchor=base west] at (0,9\ydistance){/Sect (end section)};
\end{tikzpicture}
\begin{tikzpicture}[remember picture, overlay]
\draw[->,red](aref)-|(a);
\draw[->,red](bref)-|(b);
\draw[->,red](cref)-|(c);
\draw[->,red](dref)-|(d);
\end{tikzpicture}
\end{tcolorbox}
\caption{Schematical
description of the relation between marked content in the page stream and the
structure}
\end{figure}
\subsection{Task 1: Marking the chunks: the mark-content-step}
To be able to refer to parts of the text in the structure, the text in the
page stream must get \enquote{labels}. In the \PDF{} reference they are
called \enquote{marked content}. The three main variants needed here are:
\begin{description}
\item[Artifacts] They are marked with of a pair of keywords, \texttt{BMC}
and \texttt{EMC} which surrounds the text. \texttt{BMC} has a single
prefix argument, the fix tag name \texttt{/Artifact}. Artifacts should
be used for irrelevant text and page content that should be ignored in
the structure. Sadly it is often not possible to leave such text simply
unmarked -- the accessibility tests in Acrobat and other validators
complain.
\begin{taglstlisting}
/Artifact BMC
text to be marked
/EMC
\end{taglstlisting}
\item[Artifacts with a type] They are marked with of a pair of keywords,
\texttt{BDC} and \texttt{EMC} which surrounds the text. \texttt{BDC}
has two arguments: again the tag name \texttt{/Artifact} and a
following dictionary which allows to specify the suppressed info. Text
in header and footer can e.g. be declared as pagination like this:
\begin{taglstlisting}
/Artifact <> BDC
text to be marked
/EMC
\end{taglstlisting}
\item[Content] Content is marked also with of a pair of keywords,
\texttt{BDC} and \texttt{EMC}. The first argument of \texttt{BDC} is a
tag name which describes the structural type of the text\footnote{There
is quite some redundancy in the specification here. The structural type
is also set in the structure tree. One wonders if it isn't enough to
use always \texttt{/SPAN} here.} Examples are \texttt{/P} (paragraph),
\texttt{/H2} (heading), \texttt{/TD} (table cell). The reference
mentions a number of standard types but it is possible to add more or
to use different names.
In the second argument of \texttt{BDC} -- in the property dictionary -- more
data can be stored. \emph{Required} is an \texttt{/MCID}-key which takes an
integer as a value:
\begin{taglstlisting}
/H1 <> BDC
text to be marked
/EMC
\end{taglstlisting}
This integer is used to identify the chunk when building the structure
tree. The chunks are numbered by page starting with 0. As the numbers are
also used as an index in an array they shouldn't be \enquote{holes} in the
numbering system (It is perhaps possible to handle a numbering scheme not
starting by 0 and having holes, but it will enlarge the \PDF{} as one would
need dummy objects.).
It is possible to add more entries to the property dictionary, e.g. a
title, alternative text or a local language setting.
\end{description}
The needed markers can be added with low level code e.g. like this (in pdftex syntax):
\begin{taglstlisting}
\pdfliteral page {/H1 <> BDC}%
text to be marked
\pdfliteral page {EMC}%
\end{taglstlisting}
This sounds easy. But there are quite a number of traps, mostly with pdfLaTeX:
\begin{enumerate}[beginpenalty=10000]
\item \PDF{} is a page oriented format. And this means that the start
\texttt{BDC}/\texttt{BMC} and the corresponding end \texttt{EMC}
must be on the same page. So marking e.g. a section title like in the
following example won't always work as the literal before the
section could end on the previous page:
\begin{taglstlisting}
\pdfliteral page {/H1 <> BDC} %problem: possible pagebreak here
\section{mysection}
\pdfliteral page {EMC}%
\end{taglstlisting}
Using the literals \emph{inside} the section argument is better, but then
one has to take care that they don't wander into the header and the toc.
\item Literals are \enquote{whatsits} nodes and can change spacing, page
and line breaking. The literal \emph{behind} the section in the
previous example could e.g. lead to a lonely section title at the end
of the page.
\item The \texttt{/MCID} numbers must be unique on a page. So you can't
use the literal in a saved box that you reuse in various places. This
is e.\,g. a problem with \texttt{longtable} as it saves the table
header and footer in a box.
\item The \texttt{/MCID}-chunks are leaf nodes in the structure tree, so
they shouldn't be nested.
\item Often text in a document is created automatically or moved around:
entries in the table of contents, index, bibliography and more. To
mark these text chunks correctly one has to analyze the code creating
such content to find suitable places to inject the literals.
\item The literals are inserted directly and not at shipout. This means
that due to the asynchronous page breaking of \TeX\ the MCID-number
can be wrong even if the counter is reset at every page. This package
uses in generic mode a label-ref-system to get around this problem.
This sadly means that often at least three compilations are needed
until everything has settled down.
It can actually be worse: If the text is changed after the MCID-numbers
have been assigned, and a new mc-chunk is inserted in the middle of the
page, then all the numbers have to be recalculated and that requires
again a number of compilations until it really settles down again.
Internal references are especially problematic here, as the first
compilation typically creates a non-link |??|, and only the second
inserts the structure and the new mc. When the reference system in \LaTeX\
will be extended, care will be taken to ensure that already the dummy
text builds a chunk. Until then the advice is to first compile the
document and resolve all cross-reference and to activate tagging only at
the end.
\item There exist environments which process their content more than once
-- examples are \texttt{align} and \texttt{tabularx}.
So one has to check for doublets and holes in the counting system.
\item \PDF{} is a page oriented format. And this means that the start and
the end marker must be on the same page \ldots\ \emph{so what to do
with normal paragraphs that split over pages??}. This question will
be discussed in subsection~\ref{sec:splitpara}.
\end{enumerate}
\subsubsection{Generic mode versus lua mode in the mc-task}
While in generic mode the commands insert the literals directly and so have
all the problems described above the lua mode works quite differently: The
tagging commands don't insert literals but set some (global)
\emph{attributes} which are attached to all the following nodes. When the
page is shipped out some lua code is called which wanders through the shipout
box and injects the literals at the places where the attributes changes.
This means that quite a number of problems mentioned above are not relevant
for the lua mode:
\begin{enumerate}
\item Page breaks between start and end of the marker are
\emph{not} a problem. So you can mark a complete paragraph. If a pagebreak
occur directly after an start marker or before an end marker this can lead to
empty chunks in the \PDF{} and so bloat up \PDF{} a bit, but this is imho not
really a problem (compared to the size increase by the rest of the tagging).
\item The commands don't insert literals directly and so affect line and page
breaking much less.
\item The numbering of the MCID are done at shipout, so no label/ref system
is needed.
\item The code can do some marking automatically. Currently everything that
has not been marked up by the document is marked as artifact.
\end{enumerate}
\subsubsection{Commands to mark content and chunks}
In generic mode\sidenote{Generic mode only} is vital that the end command is
executed on the same page as the begin command. So think carefully how to
place them. For strategies how to handle paragraphs that split over pages see
subsection~\ref{sec:splitpara}.
\begin{docCommands}
{
{doc name=tagmcbegin,doc parameter={\marg{key-val-list}}},
{doc name=tag_mc_begin:n,doc parameter={\marg{key-val-list}}}
}
\end{docCommands}
These commands insert the begin of the marked content code in the \PDF{}.
They don't start a paragraph. \emph{They don't start a group}. Such markers
should not be nested. The command will warn you if this happens.
In the generic mode the commands insert literals. These are whatsits and so
can affect spacing. In lua mode they set an attribute \emph{globally}.
The key-val list understands the following keys:
\begin{description}
\item[\PrintKeyName{tag}] This key is optional. By default the tag name
of the surrounding structure is used, which normally should be fine.
But if needed the name can be set explicitly with this key. The value
of the key is typically one of the standard type listed in section
\ref{sec:new-tag} (without a slash at the begin, this is added by the
code). It is possible to setup new tags, see the same section. The
value of the key is expanded, so it can be a command. The expansion
is passed unchanged to the \PDF{}, so it should with a starting slash
give a valid \PDF{} name (some ascii with numbers like \texttt{H4}
is fine).
\item[\PrintKeyName{artifact}] This will setup the marked content as an
artifact. The key should be used for content that should be ignored.
The key can take one of the values \PrintKeyName{pagination},
\PrintKeyName{pagination/header}, \PrintKeyName{pagination/footer},
\PrintKeyName{layout}, \PrintKeyName{page},
\PrintKeyName{background} and \PrintKeyName{notype} (this is the
default). Text in the header and footer should normally be marked
with \PrintKeyName{artifact=pagination} or
\PrintKeyName{pagination/header}, \PrintKeyName{pagination/footer}
but simply artifact (as it is now done automatically) should be ok
too.
It is not quite clear if rules and other decorative graphical objects
needs to be marked up as artifacts. Acrobat seems not to mind if not, but
PAC~3 complained.
The validators complain if some text is not marked up, but it is not
quite clear if this is a serious problem.
The\sidenote{lua mode} lua mode will mark up everything unmarked as
\texttt{artifact=notype}. You can suppress this behavior by setting the
tagpdfsetup key \texttt{activate/tagunmarked} to false. See section
\ref{ssec:setup}.
\item[\PrintKeyName{stash}] Normally marked content will be stored in the
\enquote{current} structure. This may not be what you want. As an
example you may perhaps want to put a marginnote behind or before the
paragraph it is in the tex-code. With this boolean key the content is
marked but not stored in the kid-key of the current structure.
\item[\PrintKeyName{label}] This key sets a label by which you can call
the marked content \emph{later} in another structure (if it has been stashed
with the previous key). Internally the label name will start with
\texttt{tagpdf-}.
\item[\PrintKeyName{alt}]
This key inserts an \texttt{/Alt} value in the property dictionary
of the BDC operator. See section~\ref{sec:alt}.
The value is handled as verbatim string, commands are
not expanded but the value will be expanded first once (so works like
the key \texttt{alttext-o} in previous versions which has been
removed). If the value is empty, nothing will happen.
That means that you can do something like in the following listing
and it will insert \verb+\frac{a}{b}+ (hex encoded) in the \PDF{}.
\begin{taglstlisting}
\newcommand\myalttext{\frac{a}{b}}
\tagmcbegin{tag=P,alt=\myalttext}
\end{taglstlisting}
\item[\PrintKeyName{actualtext}] This key inserts an \texttt{/ActualText}
value in the property dictionary of the BDC operator. See
section~\ref{sec:alt}. The value is handled as verbatim string,
commands are not expanded but the value will be expanded first once
(so works like the key \texttt{actualtext-o} in previous versions
which has been removed). If the value is empty, nothing will happen.
That means that you can do something like in the following listing and
it will insert \verb+X+ (hex encoded) in the \PDF{}.
\begin{taglstlisting}
\newcommand\myactualtext{X}
\tagmcbegin{tag=Span,actualtext=\myactualtext}
\end{taglstlisting}
According to the PDF reference, \texttt{/ActualText} should only be used
on marked content sequence of type Span. This is not enforced by the code
currently. There is also some discussion going on, if
\texttt{/ActualText} can actually be used in a MC dictionary or if it
should be in a separate BDC-operator.
\item[\PrintKeyName{raw}] This key allows you to add more entries to the
properties dictionary. The value must be correct, low-level \PDF{}.
E.g. \verb+raw=/Alt (Hello)+ will insert an alternative Text.
\end{description}
\begin{docCommands}
{
{doc name=tagmcend},
{doc name=tag_mc_end:}
}
\end{docCommands}
These commands insert the end code of the marked content. They don't end a
group and it doesn't matter if they are in another group as the starting
commands. In generic mode both commands check if there has been a begin
marker and issue a warning if not. In luamode it is often possible to omit
the command, as the effect of the begin command ends with a new
\verb+\tagmcbegin+ anyway.
\begin{docCommands}
{
{doc name=tagmcuse,doc parameter=\marg{label}},
{doc name=tag_mc_use:n,doc parameter=\marg{label}}
}
\end{docCommands}
These commands allow you to record a marked content that you stashed away
into the current structure. Be aware that a marked content can be used only
once -- the command will warn you if you try to use it a second time.
\begin{docCommands}
{
{doc name=tag_mc_end_push:},
{doc name=tag_mc_begin_pop:n,doc parameter=\marg{key-val-list}}
}\end{docCommands}
If there is an open mc chunk,
the first command ends it and pushes its tag on a stack. If there is no
open chunk, it puts $-1$ on the stack (for debugging).
The second command removes a value from the stack. If it is different from
$-1$ it opens a tag with it. The command is mainly meant to be used inside hooks and command
definitions so there is only an expl3 version. Perhaps other content of the mc-dictionary (for example the Lang) needs to be saved on the stacked too.
\begin{docCommands}
{
{doc name=tagmcifinTF,doc parameter=\marg{true code}\marg{false code}},
{doc name=tag_mc_if_in:TF,doc parameter=\marg{true code}\marg{false code}}
}\end{docCommands}
These commands check if a marked content is currently open and allows you to e.g. add the end marker if yes.
In \emph{generic mode}, where marked content command shouldn't be nested, it works with a global boolean.
In \emph{lua mode} it tests if the mc-attribute is currently unset. You can't test the nesting level with it!
\begin{docCommand}{tag_mc_reset_box:N}{\marg{box}}\end{docCommand}
In lua mode this command will process the given box and reset all mc
related attributes in the box to the current values. This means that
if the box is used all its contents will be a kid of the current
structure. This should (probably) only be used on boxes which don't
contain tagging commands. See below section~\ref{sec:savebox} for
more details.
\subsubsection{Retrieving data} \label{sec:retrieve}
With more elaborate tagging the need arise to retrieve and store current data.
\begin{docCommand}{tag_get:n}{\marg{key word}}\end{docCommand}
This (expandable) command returns the values of some variables. Currently, the working key words are
\begin{itemize}
\item \verb+mc_tag+: the tag name of the current mc-chunk
\item \verb+struct_tag+: the tag name of the current structure
\item \verb+struct_id+: The ID of the current structure. This is a
string and is returned including parentheses.
\item \verb+struct_num+: This returns a number and works also if only
\pkg{tagpdf-base} has been loaded, but then doesn't give the same
output: if \pkg{tagpdf} is loaded and tagging is active,
\verb+struct_num+ gives the number of currently active structure, so
it reverts to the parent number if a structure is closed. If only
\pkg{tagpdf-base} is loaded nesting of structure is not tracked and
so the command gives back the number of the last structure that has
been created.
\item \verb+struct_counter+: This returns a number and works also if
only \pkg{tagpdf-base} has been loaded. It gives back the state of
the absolute structure counter and so the number of the last structure
that has been created. This can be used to detect if in a piece of
code there are structure commands. Be aware that this is a \LaTeX{}
counter and so is reset in some places.
\item \verb+mc_counter+: This returns a number and works also if only
\pkg{tagpdf-base} has been loaded. It gives back the state of the
absolute mc-counter and so number of the last mc-chunk that has been
created. This can be used to detect if in a piece of code there are
mc-commands.
\end{itemize}
\subsubsection{Luamode: global or not global -- that is the question}\label{sec:global-local}
In\sidenote{lua mode} luamode the mc-commands set and unset an
attribute to mark the nodes. One can view such an attribute like a font
change or a color: they affect all following chars and glue nodes until
stopped.
From version 0.6 to 0.82 the attributes were set locally. This had the
advantage that the attributes didn't spill over in area where they are not
wanted like the header and footer or the background pictures. But it had the
disadvantage that it was difficult for an inner structure to correctly
interrupt the outer mc-chunk if it can't control the group level. For example
this didn't work due to the grouping inserted by the user:
\begin{taglstlisting}
\tagstructbegin{tag=P}
\tagmcbegin{tag=P}
Start paragraph
{% user grouping
\tag_mc_end_push:
\tagstructbegin{tag=Em}
\tagmcbegin{tag=Em}
\emph{Emphasized test}
\tagmcend
\tagstructend
\tag_mc_begin_pop:n{}
}% user grouping
Continuation of paragraph
\tagmcend
\tagstructend
\end{taglstlisting}
The reading order was then wrong, and the \emph{emphasized text} moved in the structure at the end.
So starting with version 0.9 this has been reverted. The attribute is now global again.
This solves the \enquote{interruption} problem, but has its price: Material inserted by the output routine
must be properly guarded. For example
\begin{taglstlisting}
\DocumentMetadata{uncompress}
\documentclass{article}
\pagestyle{headings}
\begin{document}
\sectionmark{HEADER}
\AddToHook{shipout/background}{\put(5cm,-5cm){BACKGROUND}}
\tagmcbegin{tag=P}Page 1\newpage Page 2\tagmcend
\end{document}
\end{taglstlisting}
Here the header and the background code on the \emph{first} page will be marked up as paragraph
and added as chunk to the document structure. The header and the background code on
the \emph{second} page will be marked as artifact. The following figure shows how the tags looks
like.
\includegraphics[alt=Show tags of examples]{global-ex}
It is therefore from now on important to correctly markup such code. Header
and footer are now marked as artifacts (see below). If they contain code
which needs a different markup it still must be added explicitly. With
packages like \pkg{fancyhdr} or \pkg{scrlayer-scrpage} it is quite easy to
add the needed code.
\subsubsection{Tips}
\begin{itemize}
\item Mark commands inside floats should work fine (but need perhaps some compilation rounds in generic mode).
\item In case you want to use it inside a \verb+\savebox+ (or some
command that saves the text internally in a box): If the box is used
directly, there is probably no problem. If the use is later, stash
the marked content and add the needed \verb+\tagmcuse+ directly
before or after the box when you use it.
\item Don't use a saved box with markers twice.
\item If boxes are unboxed you will have to analyze the \PDF{} to
check if everything is ok.
\item If you use complicated structures and commands (breakable boxes
like the one from \pkg{tcolorbox}, \pkg{multicol}, many footnotes)
you will have to check the \PDF{}.
\end{itemize}
\begin{figure}
\input{link-figure-input}
\caption{Structure needed for a link annotation}\label{fig:linkannot}
\end{figure}
\subsubsection{Header and Footer}\label{sec:header-footer}
Tagging header and footer is not trivial. At first on the technical side header and footer are
typeset and attached to the page during the output routine and the exact timing is not really under
control of the user. That means that when adding tagging there one has to be careful not to disturb
the tagging of the main text---this is mostly important in luamode where the attributes are global
and can easily spill over.
At second one has to decide about how to tag: in many cases header and footer can simply be ignored,
they only contain information which are meant to visually guide the reader and so are not relevant for
the structure. This means that normally they should be tagged as artifacts. The PDF reference offers
here a rather large number of options here to describe different versions of \enquote{ignore this}.
Typically the header and footer should get the type \texttt{Pagination} and this types has a number of subtypes like
Header, Footer, PageNum. It is not yet known if any technology actually makes use of this info.
But they can also contain meaningful content, for example an address. In such cases the content
should be added to the structure (where?) but even if this address is
repeated on every page at best only once. All this need some thoughts both from the users and the packages and code
providing support for header and footers.
For now tagpdf added some first support for automatically tagging:
Starting with version 0.92 header and footer are by default automatically marked up as (simple) artifacts.
With the key \PrintKeyName{exclude-header-footer} the behavior can be
changed: The value \texttt{false} disables the automatic tagging, the
value \texttt{pagination} add additionally an \texttt{/Artifact}
structure with the attribute \texttt{/Pagination}.
If some additional markup (or even a structure) is wanted, something like this should be used (here with
the syntax of the \pkg{fancyhdr} package) to close the open mc-chunk and restart if after the content:
\begin{taglstlisting}
\ExplSyntaxOn
\cfoot{\leavevmode
\tag_mc_end_push:
\tagmcbegin{artifact=pagination/footer}
\thepage
\tagmcend
\tag_mc_begin_pop:n{artifact}}
\ExplSyntaxOff
\end{taglstlisting}
\subsubsection{Links and other annotations}\label{sec:link+annot}
Annotations (like links or form field annotations) are objects
associated with a geometric region of the page rather than with a
particular object in its content stream. Any connection between a link
or a form field and the text is based solely on visual appearance (the
link text is in the same region, or there is empty space for the form
field annotation) rather than on an explicitly specified association.
To connect such a annotation with the structure and so with
surrounding or underlying text a specific structure has to be added,
see \ref{fig:linkannot}: The annotation is added to a structure
element as an object reference. It is not referenced directly but
through an intermediate object of type OBJR. To the dictionary of the
annotation a \texttt{/StructParent} entry must be added, the value is
a number which is then used in the ParentTree to define a relationship
between the annotation and the parent structure element.
To support this, \pkg{tagpdf} offers currently two commands
\begin{docCommand}{tag_struct_parent_int:}{}\end{docCommand}
This insert the current value of a global counter used to track such
objects. It can be used to add the \texttt{/StructParent} value to the
annotation dictionary.
\begin{docCommand}{tag_struct_insert_annot:nn}{\marg{object reference}\marg{struct parent number}}\end{docCommand}
This will insert the annotation described by the object reference into
the current structure by creating the OBJR object. It will also add
the necessary entry to the parent tree and increase the global counter
referred to by |\tag_struct_parent_int:|. It does nothing if
(structure) tagging is not activated.
Attention! As the second command increases the global counter at the
end it changes the value given back by the first. That means that if
nesting is involved care must be taken that the correct numbers is
used. This should be easy to fulfill for most annotations, as there
are boxes. There the second command should at best be used directly
behind the annotation and it can make use of
|\tag_struct_parent_int:|. For links nesting is theoretically
possible, and it could be that future versions need more sophisticated
handling here.
In environments which process their content twice like tabularx or
align it would be the best to exclude the second command from the
trial step, but this will need better support from these environments.
Typically using this commands is not often needed: Since version 0.81
\pkg{tagpdf} already handles (unnested) links, and form fields created
with the \pkg{l3pdffield-testphase} package will be handle by this
package.
The following listing shows low-level to create link where the two
commands are used:
\begin{taglstlisting}
\pdfextension startlink
attr
{
/StructParent \tag_struct_parent_int: %<----
}
user {
/Subtype/Link
/A
<<
/Type/Action
/S/URI
/URI(http://www.dante.de)
>>
}
This is a link.
\pdfextension endlink
\tag_struct_insert_annot:xx {\pdfannot_link_ref_last:}{\tag_struct_parent_int:}
\end{taglstlisting}
\subsubsection{Math}
Math is still a problem but some progress has been made.
To tag math you have to surround it with a \texttt{Formula} structure. But the content of such a structure is handled by readers as a black box so additional data is needed for accessibility.
There are a number of theoretical options here:
\begin{enumerate}
\item One can add an alternative text (\texttt{/Alt}) or an \texttt{/ActualText}
to the structure element either some text manually provided by the author or (with
the math module in the latex-lab bundle) the \LaTeX-source).
\item One can add an alternative text (\texttt{/Alt} or \texttt{/ActualText})
to the MC-chunks.
\item One can build inside the \texttt{Formula} structure element a tree with MathML structure elements --- with PDF 2.0 this not require to declare new tags as the MathML name space is built-in.
\item One can in PDF 2.0 attach a MathML file and/or the \LaTeX-source as associated file to the \texttt{Formula} structure (or to one or more MC-chunks).
\end{enumerate}
The question is how these work in reality.
Option 1 and 2 give not too bad results
with a screen reader, but can
require manual work and if you are unlucky the reader drops
important part of the math (like punctuation symbols).
Exploring the equation is not possible.
Option 3 creates many structure elements.
E.g. I have seen an example where \emph{every single
symbol} has been marked up with tags from MathML along with an
\texttt{/ActualText} entry and an entry with alternate text which
describes how to read the symbol. The \PDF{} then looked like this
\begin{taglstlisting}
/mn </Alt( : open bracket: four )>>BDC
...
/mn </Alt( third s )>>BDC
...
/mo </Alt( times )>>BDC
\end{taglstlisting}
If this is really the way to go one would need some script to add the
mark-up as doing it manually is too much work and would make the
source unreadable -- at least with pdflatex and the generic mode. In
lua mode is it possible to hook into the \texttt{mlist\_to\_hlist}
callback and add marker automatically. Some first implementation in this direction
has been done by Marcel Krüger in the luamml project. But up-to-now it was not possible
to test the usability of this approach: With the exception of the html derivation
with ngpdf no PDF-viewer/screen reader combination
seems to make use of such structures.
I'm not sure anyway that this is the best way to do math. It looks rather
odd that a document should have to tell a screen reader in such detail
how to read an equation.
The last option 4 has been implemented in the math module in the \texttt{latex-lab}
bundle. Here happily a proof of
concept was possible: With development versions of foxit and the NVDA reader
it was possible to access an attached MathML and get speech output from it \cite{todasoifferdeims2024,mittelbachfischerdeims2024}. See also \cite{mathexamples} for some
examples and section~\ref{sec:alt} for some more remarks and tests.
\subsubsection{Split paragraphs}\label{sec:splitpara}
%TODO: think about marginnote! Aside?
A\sidenote{Generic mode only} problem in generic mode are paragraphs
with page breaks. As already mentioned the end marker \texttt{EMC}
must be added on the same page as the begin marker. But it is in
pdflatex \emph{very} difficult to inject something at the page break
automatically. One can manipulate the shipout box to some extend in
the output routine, but this is not easy and it gets even more
difficult if inserts like footnotes and floats are involved: the end
of the paragraph is then somewhere in the middle of the box.
So with pdflatex in generic mode one until now had to do the splitting manually.
The example \texttt{mc-manual-para-split} demonstrates how this can be
done. The general idea was to use \verb+\vadjust+ in the right place:
\begin{taglstlisting}
\tagmcbegin{tag=P}
...
fringilla, ligula wisi commodo felis, ut adipiscing felis dui in
enim. Suspendisse malesuada ultrices ante.% page break
\vadjust{\tagmcend\pagebreak\tagmcbegin{tag=P}}
Pellentesque scelerisque
...
sit amet, lacus.\tagmcend
\end{taglstlisting}
Starting with version 0.92 there is code which resolves this
problem. Basically it works like this: every mc-command issues a mark
command (actually two slightly different). When the page is built in
the output routine this mark commands are inspected and from them
\LaTeX{} can deduce if there is a mc-chunk which must be closed or
reopened. The method is described in Frank Mittelbach's talk at
TUG~2021 \enquote{Taming the beast — Advances in paragraph tagging
with pdfTeX and XeTeX} \url{https://youtu.be/SZHIeevyo3U?t=19551}.
Please note
\begin{itemize}
\item Typically you will need more compilations than previously, don't
rely on the rerun messages, but if something looks wrong rerun.
\item The code relies on that related |\tagmcbegin| and |\tagmcend|
are in the same boxing level. If one is in a box (which hides the
marks) and the other in the main galley, things will go wrong (\texttt{longtable}
is for example problematic).
\end{itemize}
\subsubsection{Automatic tagging of paragraphs}\label{sec:paratagging}
Another feature that emerged from the \LaTeX{} tagged PDF project are hooks
at the begin and end of paragraphs. \pkg{tagpdf} makes use of these hooks to
tag paragraphs. In the first version it added only one structure, but this
proved to be not adequate:
Paragraphs in \LaTeX{} can be nested, e.g., you can have a paragraph
containing a display quote, which in turn consists of more than one
(sub)paragraph, followed by some more text which all belongs to the
same outer paragraph.
In the \PDF{} model and in the HTML model that is not supported: the rules in
\PDF{} specification do not allow \texttt{P}-structures to be nested --- a
limitation that conflicts with real live, given that such constructs are
quite normal in spoken and written language.
The approach we take (starting with march 2023, version 0.98e) to resolve
this is to model such \enquote{big} paragraphs with a structure named
\texttt{text-unit} and use \texttt{P} (under the name \texttt{text}) only for
(portions of) the actual paragraph text in a way that the \texttt{P}s are not
nested. As a result we have for a simple paragraph two structures:
\begin{taglstlisting}
The paragraph text ...
The paragraph text before the display element ...
Content of the display structure possibly involving inner tags
... continuing the outer paragraph text
\end{taglstlisting}
In other words such a display block is always embedded in a ||
structure, possibly preceded by a ||\ldots|| block and possibly
followed by one, though both such blocks are optional. More information about
this can be found in the documentation of \texttt{latex-lab-block-tagging}.
As a consequence \pkg{tagpdf} now adds two structures if paratagging is
activated. The new code to tag display blocks extends this code to handle the
nesting of lists and other display structures.
The automatic tagging require that for every begin of a paragraph with the
begin hook code there a corresponding end with the closing hook code. This
can fail, e.g if a |vbox| doesn't correctly issue a |\par| at the end. If
this happens the tagging structure can get very confused. At the end of the
document \pkg{tagpdf} checks if the number of outer and inner start and end
paragraph structures created with the automatic paratagging code are equal
and it will error if not.
The automatic tagging of paragraphs can be deactivated completely or only the
outer level with the |\tagtool| keys |para| and |para-flattened| or with the
(now deprecated) commands |\tagpdfparaOn| and |\tagpdfparaOff|.
Nesting the activation and deactivation of the tagging of paragraphs can be
quite difficult. For example if it is unclear if the inner code issues a
|\par| or not it is not trivial to exclude an end hook for every excluded
begin hook. In such cases it can be easier to use the |paratag| key with the
value |NonStruct| to convert some |P|-structures into |NonStruct|-structures
without real meaning.
\subsection{Task 2: Marking the structure}
The structure is represented in the \PDF{} with a number of objects of type
\texttt{StructElem} which build a tree: each of this objects points back to
its parent and normally has a number of kid elements, which are either again
structure elements or -- as leafs of the tree -- the marked contents chunks
marked up with the \verb+tagmc+-commands. The root of the tree is the
\texttt{StructTreeRoot}.
\subsubsection{Structure types}
The tree should reflect the \emph{semantic} meaning of the text. That means
that the text should be marked as section, list, table head, table cell and
so on. A number of standard structure types is predefined, see section
\ref{sec:new-tag} but it is allowed to create more. If you want to use types
of your own you must declare them. E.g. this declares two new types
\texttt{TAB} and {FIG} and bases them on \texttt{P}:
\begin{taglstlisting}
\tagpdfsetup{
role/new-tag = TAB/P,
role/new-tag = FIG/P,
}
\end{taglstlisting}
\subsubsection{Sectioning}
The sectioning units can be structured in two ways: a flat, html-like and a
more (in pdf/UA2 basically deprecated) xml-like version. The flat version
creates a structure like this:
\begin{taglstlisting}
section header
text
subsection header
...
\end{taglstlisting}
So here the headings are marked according their level with \texttt{H1}, \texttt{H2}, etc.
In the xml-like tree the complete text of a sectioning unit is surrounded
with the \texttt{Sect} tag, and all headers with the tag \texttt{H}. Here the
nesting defines the level of a sectioning heading.
\begin{taglstlisting}
section heading
text
subsection heading
...
\end{taglstlisting}
The flat version is more \LaTeX-like and it is rather straightforward to
patch \verb+\chapter+, \verb+\section+ and so on to insert the appropriates
\texttt{H\ldots} start and end markers. The xml-like tree is more difficult
to automate. It has been implemented in the sec module in latex-lab, but can break
if sectioning commands are hidden inside boxes.
\subsubsection{Commands to define the structure}
The following commands can be used to define the tree structure:
\begin{docCommands}
{
{doc name=tagstructbegin,doc parameter=\marg{key-val-list}},
{doc name=tag_struct_begin:n,doc parameter=\marg{key-val-list}}
}\end{docCommands}
These commands start a new structure. They don't start a group. They set all their values globally.
The key-val list understands the following keys:
\begin{description}
\item[\PrintKeyName{tag}] This is required. The value of the key is
normally one of the standard types listed in section
\ref{sec:new-tag}. It is possible to setup new tags/types, see the
same section. The value can also be of the form |type/NS|, where
|NS| is the shorthand of a declared name space. Currently the
names spaces |pdf|, |pdf2|, |mathml| and |user| are defined. This
allows to use a different name space than the one connected by
default to the tag. But normally this should not be needed.
\item[\PrintKeyName{stash}] Normally a new structure inserts itself
as a kid into the currently active structure. This key prohibits
this. The structure is nevertheless from now on \enquote{the
current active structure} and parent for following marked
content and structures.
\item[\PrintKeyName{label}] This key sets a label by which one can
refer to the structure. Currently the key writes a property whose
name starts with \texttt{tagpdfstruct-} to the aux-file with the two
attributes \texttt{tagstruct} (the structure number) and
\texttt{tagstructobj} (the object reference) but also stores the
name and the structure number into a prop for use in the current compilation.
The label is e.g. used by \cs{tag\_struct\_use:n} and by the |ref|
key (which can refer to future structures).
\item[\PrintKeyName{parent}] With the parent key one can choose another
parent. The value is a structure number which must refer to an
already existing, previously created structure. Such a structure
number can have been stored previously with \cs{tag\_get:n}, but one
can also use a label on the parent structure and then use
\cs{property\_ref:nn}\verb+{tagpdfstruct-label}{tagstruct}+ to retrieve
it.
\item[\PrintKeyName{firstkid}] If this key is used the structure is
added at the left of the kids of the parent structure (if the structure is not stashed). This means that it will be the first kid of the structure (unless some
later structure uses the key too). This can be needed e.g. for a caption as
the PDF reference requires it to be the first or last kid of its structure.
\item[\PrintKeyName{alt}] This key inserts an \texttt{/Alt} value in the
dictionary of structure object, see section~\ref{sec:alt}. The value
is handled as verbatim string and hex encoded. The value will be
expanded first once (so works like the key \texttt{alttext-o} in
previous versions which has been removed). If the value is empty,
nothing will happen.
That means that you can do something like this:
\begin{taglstlisting}
\newcommand\myalttext{\frac{a}{b}}
\tagstructbegin{tag=P,alt=\myalttext}
\end{taglstlisting}
and it will insert \verb+\frac{a}{b}+ (hex encoded) in the
\PDF{}. In case that the text begins with a command that should not
be expanded protect it e.g. with a \verb+\empty+.
\item[\PrintKeyName{actualtext}] This key inserts an \texttt{/ActualText}
value in the dictionary of structure object, see
section~\ref{sec:alt}. The value is handled as verbatim string. The
value will be expanded first once (so works like the key
\texttt{alttext-o} in previous versions which has been removed). If
the value is empty, nothing will happen.
That means that you can do something like this:
\begin{taglstlisting}
\newcommand\myactualtext{X}
\tagstructbegin{tag=P,actualtext=\myactualtext}
\end{taglstlisting}
and it will insert \verb+X+ (hex encoded) in the \PDF{}. In case
that the text begins with a command that should not be expanded
protect it e.g. with a \verb+\empty+
\item[\PrintKeyName{attribute}] This key takes as argument a comma
list of attribute names (use braces to protect the commas from
the external key-val parser) and allows to add one or more
attribute dictionary entries in the structure object. As an
example
\begin{taglstlisting}
\tagstructbegin{tag=TH,attribute= TH-row}
\end{taglstlisting}
See also section~\ref{sec:attributes}.
\item[\PrintKeyName{attribute-class}] This key takes as argument a
comma list of attribute names (use braces to protect the commas
from the external key-val parser) and allows to add them as
attribute classes to the structure object. As an example
\begin{taglstlisting}
\tagstructbegin{tag=TH,attribute-class= TH-row}
\end{taglstlisting}
See also section~\ref{sec:attributes}.
\item[\PrintKeyName{title}] This key allows to set the dictionary
entry \texttt{/T} (for a title) in the structure object. The value
is handled as verbatim string and hex encoded. Commands are not
expanded.
\item[\PrintKeyName{title-o}] This key allows to set the dictionary
entry \texttt{/T} in the structure object. The value is expanded
once and then handled as verbatim string like the
\PrintKeyName{title} key.
\item[\PrintKeyName{AF}] This key allows to reference an associated
file in the structure element. The value should be the name of
an object pointing to the \texttt{/Filespec} dictionary as
expected by \verb+\pdf_object_ref:n+ from a current
\texttt{l3kernel}. For example:
\begin{taglstlisting}
\group_begin:
\pdfdict_put:nnn {l_pdffile/Filespec} {AFRelationship}{/Supplement}
\pdffile_embed_file:nnn{example-input-file.tex}{}{tag/AFtest}
\group_end:
\tagstructbegin{tag=P,AF=tag/AFtest}
\end{taglstlisting}
As shown, the wanted AFRelationship can be set by filling the dictionary
with the value. The mime type is here detected automatically, but for
unknown types it can be set too. See the \texttt{l3pdffile}
documentation for details. Associated files are a concept new in PDF
2.0, but the code currently doesn't check the pdf version, it is your
responsibility to set it (this can be done with the \texttt{pdfversion}
key in \verb+\DocumentMetadata+).
\item[\PrintKeyName{root-AF}] This key allows to reference an
associated file in the root structure element. Using the root can
be e.g. useful to add a css-file. When converting the pdf to a
html with e.g. ngpdf this css-file is then referenced in the head
of the html.
\item[\PrintKeyName{AFinline}] This key allows to embed an
associated file with inline content. The value is some text,
which is embedded in the PDF as a text file with mime type
text/plain.
\begin{taglstlisting}
\tagstructbegin{tag=P,AFinline=Some extra text}
\end{taglstlisting}
\item[\PrintKeyName{AFinline-o}]
This is like \verb+AFinline+, but it expands the value once.
\item[\PrintKeyName{texsource}] This is like \verb+AFinline-o+, but
it creates a tex-file, with mime type \texttt{application/x-tex}
and the AFRelationship \texttt{Source}. It also sets the /Desc key to a
(currently) fix text to satisfy some validators.
\item[\PrintKeyName{mathml}] This is like \verb+AFinline-o+, but it
creates a xml-file, with mime type \texttt{application/xml} and
the AFRelationship \texttt{Supplement}. It also sets the /Desc key to a
(currently) fix text to satisfy some validators.
\item[\PrintKeyName{lang}] This key allows to set the language for
a structure element. The value should be a bcp-identifier,
e.g. |de-De|. It can also be set \enquote{from the outside} for all
structures in the current group with \cs{tagpdfsetup} and the |text/lang| key.
\item[\PrintKeyName{ref}] This key allows to add references to
other structure elements, it adds the |/Ref| array to the
structure. The value should be a comma separated list of
structure labels set with the |label| key.
e.g. |ref={label1,label2}|. It can be used more than once in the
key/value argument and combines the references. See below
in section~\ref{sec:Refkey} for an extended discussion about the |/Ref|
array.
\item[\PrintKeyName{E}] This key sets the |/E| key, the expanded
form of an abbreviation or an acronym (I couldn't think of a
better name, so I sticked to E).
\end{description}
\begin{docCommands}
{ {doc name=tagstructend}, {doc name=tag_struct_end:}}
\end{docCommands}
These commands end a structure. They don't end a group and it doesn't
matter if they are in another group as the starting commands.
\begin{docCommands}
{
{doc name=tagstructuse,doc parameter=\marg{label}},
{doc name=tag_struct_use:n,doc parameter=\marg{label}}
}\end{docCommands}
These commands insert a structure previously stashed away as kid into the
currently active structure. A structure should be used only once, if the
structure already has a parent you will get a warning.
\subsubsection{Updating the \texttt{Ref} key in structures}\label{sec:Refkey}
Structures that cross reference other structures, e.g. citation commands, table of
content entries, footnote require often a \texttt{Ref} key. \texttt{Ref}
can be added with the |ref| key of \cs{tagstructbegin} described above
but as it is a task that often has to be done automatically in code
there exist also a command that allows to extend the \texttt{Ref} key (and perhaps in future also other keys) later.
This command allows to add the value, the target structure of the \texttt{Ref} key,
with four methods: directly as object reference, through a label name set with the |label| key, through a destination name if a \cs{MakeLinkTarget} has been used in
the target structure---this also works if hyperref has not been loaded---and through
the structure number, which has been stored e.g. in a label.
\begin{docCommands}
{
{doc name=tag_struct_gput:nnn,doc parameter=\marg{structurenumber}\marg{keyword}\marg{value}},
}\end{docCommands}
The allowed \meta{keywords} are \texttt{ref}, \texttt{ref\_label} \texttt{ref\_dest}
and \texttt{ref\_num}.
\subsubsection{Root structure}
A document should have at least one structure which contains the whole
document. A suitable tag is \texttt{Document}. Such a root is now always
added automatically. Its type can be changed with the key
\texttt{activate}.
\subsubsection{Attributes and attribute classes}\label{sec:attributes}
Structure Element can have so-called attributes. A single attribute is a
dictionary (or a stream but this is currently not supported by the package
as I don't know an use-case) with at least the required key \verb+/O+ (for
\enquote{Owner} which describes the scope the attribute applies too. As an
example here an attribute that can be attached to tabular header (type TH)
and adds the info that the header is a column header:
\begin{taglstlisting}
<>
\end{taglstlisting}
One or more such attributes can be attached to a structure element. It is
also possible to store such an attribute under a symbolic name in a
so-called \enquote{ClassedMap} and then to attach references to such
classes to a structure.
To use such attributes you must at first declare it in \verb+\tagpdfsetup+
with the key \texttt{role/new-attribute}. This key takes two argument, a name and
the content of the attribute. The name should be a sensible key name, it is
converted to a pdf name with \verb+\pdf_name_from_unicode_e:n+, so slashes
and spaces are allow. The content should be a dictionary without the
bracket.
\begin{taglstlisting}
\tagpdfsetup
{
role/new-attribute =
{TH-col}{/O /Table /Scope /Column},
role/new-attribute =
{TH-row}{/O /Table /Scope /Row},
}
\end{taglstlisting}
Attributes are only written to the \PDF{} when used, so it is not a problem
to predeclare a number of standard attributes.
It is your responsibility that the content of the dictionary is valid
\PDF{} and that the values are sensible!
Attributes can then be used with the key \PrintKeyName{attribute} or
\PrintKeyName{attribute-class} which both take a comma list of attribute
names as argument:
\begin{taglstlisting}
\tagstructbegin{tag=TH,
attribute-class= {TH-row,TH-col},
attribute = {TH-row,TH-col},
}
\end{taglstlisting}
\subsection{Task 3: tree Management}
When all the document content has been correctly marked and the data for
the trees has been collected they must be flushed to the \PDF{}. This is
done automatically (if the package has been activated) with an internal
command in an end document hook.
\begin{docCommand}{__tag_finish_structure:}{}\end{docCommand}
This will hopefully write all the needed objects and values to the \PDF{}.
(Beside the already mentioned \texttt{StructTreeRoot} and
\texttt{StructElem} objects, additionally a so-called \texttt{ParentTree}
is needed which records the parents of all the marked contents bits, a
\texttt{Rolemap}, perhaps a \texttt{ClassMap} and object for the
attributes, and a few more values and dictionaries).
\subsection{A fully marked up document body}
The following shows the marking needed for a section, a sentence and a list
with two items. It is obvious that one wouldn't like to have to do this for
real documents. If tagging should be usable, the commands must be hidden as
much as possible inside suitable \LaTeX\ commands and environments.
\begin{taglstlisting}
\begin{document}
\tagstructbegin{tag=Document}
\tagstructbegin{tag=Sect}
\tagstructbegin{tag=H}
\tagmcbegin{tag=H} %avoid page break!
\section{Section}
\tagmcend
\tagstructend
\tagstructbegin{tag=P}
\tagmcbegin{tag=P,raw=/Alt (x)}
a paragraph\par x
\tagmcend
\tagstructend
\tagstructbegin{tag=L} %List
\tagstructbegin{tag=LI}
\tagstructbegin{tag=Lbl}
\tagmcbegin{tag=Lbl}
1.
\tagmcend
\tagstructend
\tagstructbegin{tag=LBody}
\tagmcbegin{tag=P}
List item body
\tagmcend
\tagstructend %lbody
\tagstructend %Li
\tagstructbegin{tag=LI}
\tagstructbegin{tag=Lbl}
\tagmcbegin{tag=Lbl}
2.
\tagmcend
\tagstructend
\tagstructbegin{tag=LBody}
\tagmcbegin{tag=P}
another List item body
\tagmcend
\tagstructend %lbody
\tagstructend %Li
\tagstructend %L
\tagstructend %Sect
\tagstructend %Document
\end{document}
\end{taglstlisting}
\subsection{Interrupting the tagging}
Experience showed that it must be possible to interrupt tagging in some
places. For example various packages do trial typesetting to measure text
and this shouldn't create structures. There are therefore a number of
commands for various use cases\footnote{it is quite possible that some of
the commands will disappear again if we realize that they are not fitting!}
Warning! Stopping tagging should be done only with care and when it is
ensured that no code inside the stopped part gets confused. Most importantly
currently tagging should not be stopped if a page break can occur
or the output routine is called.
\begin{docCommands}
{
{doc name=tag_suspend:n,doc parameter=\marg{label}},
{doc name=tag_resume:n,doc parameter=\marg{label}}
}
\end{docCommands}
These commands suspend and resume tagging in the current group by switching
\emph{local} booleans. They also stop the increasing of the counters which
keep track of paragraphs if the correct wrapper commands are used.
Restarting tagging is normally only needed if groups can't be used and then
must be done with care: |\tag_resume:n| should normally only restart tagging if
the corresponding stop command actually stopped tagging. This is implement
through a local counter which keeps track of the level.
The \meta{label} can be used to identify the command in debugging message. The label is not expanded and so can for example be a single command token.
The commands are the L3-layer versions of |\SuspendTagging| and |\ResumeTagging|
and will be available in the kernel with the 2024 november release.
\begin{taglstlisting}
\tag_suspend:n{\outercommand}
...
\tag_suspend:n{\innercommand}
...
\tag_resume:n{\innercommand}
...
\tag_resume:n{\outercommand}
\end{taglstlisting}
\begin{docCommands}
{
{doc name=tag_stop:},
{doc name=tag_start:},
{doc name=tagstop},
{doc name=tagstart},
{doc name=tag_stop:n,doc parameter=\marg{label}},
{doc name=tag_start:n,doc parameter=\marg{label}}
}
\end{docCommands}
These commands are now deprecated in favor or |\tag_suspend:n| and |\tag_resume:n|
but are still provided for some time.
\subsection{Lazy and automatic tagging}\label{sec:lazy}
A number of features of \PDF{} readers need a fully tagged \PDF{}. As
an example screen readers tend to ignore alternative text (see
section~\ref{sec:alt}) if the \PDF{} is not fully tagged. Also
reflowing a \PDF{} only works for me (even if real space chars are in
the \PDF{}) if the \PDF{} is fully tagged (recent versions of the
adobe reader manage to reflow also not tagged \PDF{} but it is very
slow).
This means that even if you don't care about a proper structure you
should try to add at least some minimal tagging. With the now
available automatic tagging of paragraphs all that is needed, is to
use |testphase=phase-II| in |\DocumentMetadata|. With lualatex this
can work quite OK if you don't have unbalanced paragraphs in your
document (pdflatex is more fragile).
\subsection{Adding tagging to commands}
As mentioned above the mc-markers should not be nested. Basically you write:
\begin{taglstlisting}
\tagmcbegin{..}some text ...\tagmcend
\tagmcbegin{..}some other text\tagmcend
\end{taglstlisting}
This is quite workable as long as you mark everything manually. But
when defining commands you have to ensure that they correctly push and
pop the mc-chunks where needed.
\section{Alternative text, ActualText and text-to-speech software}\label{sec:alt}
The \PDF{} format allows to add alternative text through the
\PrintKeyName{/Alt} and the \PrintKeyName{/ActualText} key. Both can
be added either to the marked content in the page stream or to the
object describing the structure.
The value of \PrintKeyName{/ActualText} (inserted by \texttt{tagpdf}
with \PrintKeyName{actualtext}) is meant to replace single characters
or rather small pieces of text. It can be used also without any
tagging (e.g. with the package accsupp). If the \PDF{} reader support
this (adobe reader does, sumatra not) one can change with it how a
piece of text is copied and pasted e.g. to split up a ligature.
\PrintKeyName{/Alt} (inserted by \texttt{tagpdf} with
\PrintKeyName{alt}) is a key to improve accessibility: with it one can
add to a picture or something else an alternative text.
The file \texttt{ex-alt-actualtext.tex} shows some experiments I made
with both keys and text-to-speech software (the in-built of adobe and
nvda). To sum them up:
\begin{itemize}
\item The keys have an impact on text-to-speech software only if the
document is fully tagged.
\item \PrintKeyName{/ActualText} should be at best used around short
pieces of marked content.
\item \PrintKeyName{/Alt} is used at best with a structure -- this
avoids problems with luatex where marked contents blocks can be
split over pages.
\item To some extend one can get a not so bad reading of math with the
alternative text.
\end{itemize}
\section{Standard types and new tags}\label{sec:new-tag}
The tags used to describe the type of a structure element can be
rather freely chosen. PDF 1.7 and earlier only requires that in a
tagged PDF all types should be either from a known set of standard
types or are \enquote{role mapped} to such a standard type. Such a
role mapping is a simple key-value in the RoleMap dictionary.
So instead of |H1| the type |section| could be used. The role mapping
can then be declared with the |role/new-tag| key:
\begin{taglstlisting}
\tagpdfsetup{role/new-tag = section/H1}
\end{taglstlisting}
In PDF 2.0 the situation is a bit more complicated. At first PDF~2.0
introduced \emph{name spaces}. That means that a type can have more
than one \enquote{meaning} depending on the name space it belongs to.
|section (name space A)| and |section (name space B)| are two
different types.
At second PDF 2.0 still requires that a tagged PDF maps all types to a
standard type, but now there are three sets of standard types (The meanings
of the PDF types can be looked up in the \PDF{}-references
\parencite{pdfspec-iso32000-1,pdfspec-iso32000-2_2020}):
\begin{enumerate}
\item The \emph{standard structure namespace for PDF 1.7}, also called
the \emph{default standard structure namespace}. The public name of
the namespace is |tag/NS/pdf|. This can be used to reference the
namespace e.g. in attributes. These are the structure names from PDF
1.7 (\texttt{StructTreeRoot} is a bit special, it is not really a structure
name but nevertheless listed here):
\ExplSyntaxOn %%
\clist_clear:N\l_tmpa_clist
\prop_map_inline:cn
{g__tag_role_NS_pdf_prop}
{
\str_if_eq:eeT
{#1}
{\use_i:nn #2}
{ \clist_put_right:Nn \l_tmpa_clist {#1} }
}
\clist_use:Nn \l_tmpa_clist {,\c_space_tl }. %%
\ExplSyntaxOff
\item The \emph{standard structure namespace for PDF 2.0}. The public
name of the namespace is |tag/NS/pdf2|. This can be used to
reference the namespace e.g. in attributes. These are more or less
same types as in PDF. The following types have been removed from
this set\footnote{They still can be used in a PDF 2.0 document!}:\\
%
\ExplSyntaxOn %
\clist_clear:N\l_tmpa_clist
\prop_map_inline:cn { g__tag_role_NS_pdf_prop }
{
\prop_if_in:cnF
{ g__tag_role_NS_pdf2_prop }
{#1}
{ \clist_put_right:Nn \l_tmpa_clist {#1} }
}
\clist_use:Nn \l_tmpa_clist {,\c_space_tl },\\
\ExplSyntaxOff %
and the following are new:\\
\ExplSyntaxOn %
\clist_clear:N\l_tmpa_clist %
\prop_map_inline:cn { g__tag_role_NS_pdf_prop }
{
\str_if_eq:eeF {#1}
{\use_i:nn #2}
{ \clist_put_right:Nn \l_tmpa_clist {#1} }
}
\clist_use:Nn \l_tmpa_clist {,\c_space_tl }.
\ExplSyntaxOff %
\item MathML 3.0 as an \emph{other namespaces}. The public name of
the namespace is |tag/NS/mathml|. This can be used to reference the
namespace e.g. in attributes. There are nearly 200 types in this
name space, so I refrain from listing them here.
\end{enumerate}
To allow to this more complicated setup the syntax of the
\texttt{role/new-tag} key has been extended. It now takes as argument
a key-value list with the following keys. A normal document shouldn't
need the extended syntax, the simple syntax |section/H1| should in
most cases do the right thing.
\begin{description}
\item[\PrintKeyName{tag}] This is the name of the new type as it
should then be used in \cs{tagstructbegin}.
\item[\PrintKeyName{tag-namespace}] This is the namespace of the new type.
The value should be a shorthand of a namespace. The allowed values are
currently |pdf|, |pdf2|, |mathml| and |user|. The default value (and
recommended value for a new tag) is |user|. The public name of the user
namespace is |tag/NS/user|. This can be used to reference the namespace
e.g. in attributes.
\item[\PrintKeyName{role}] This is the type the tag should be mapped
too. In a PDF 1.7 or earlier this is normally a type from the |pdf|
set, in PDF 2.0 from the |pdf|, |pdf2| or |mathml| set. It can also
be a user type, then this user tag must have been declared
before. The PDF format allows mapping to be done transitively. But
you should be aware that tagpdf can't (or more precisely won't)
check if some unusual role mapping makes really sense, this lies in
the responsibility of the author.
\item[\PrintKeyName{role-namespace}] The default value is the default
namespace of the role: |pdf2| for all types in this set, |pdf| for
the type which exist only in PDF 1.7, |mathml| for the MathML types,
and for previously defined user types whatever namespace has been
set there. With this key the value can be overwritten.
\item[unknown key] An unknown key is interpreted as a |tag/role|,
this preserves the old syntax. So this two calls are equivalent:
\begin{taglstlisting}
\tagpdfsetup{role/new-tag = section/H1}
\tagpdfsetup{role/new-tag = {tag=section,role=H1}}
\end{taglstlisting}
\end{description}
The exact effects of the keys depend on the PDF version. With PDF 1.7
or older the namespace keys are ignored, with PDF 2.0 the namespace
keys are use to setup the correct rolemaps. The |namespace| key is
also used to define the default namespace if the type is used as a
role or as tag in a structure.
\subsection{The \texttt{latex} namespace}
Starting with version 0.98 work has started to setup specific latex tags. In
\PDF{} 2.0 in form of a special name space, with \PDF{} 1.7 or older the
tags are role mapped. This is work in progress and bound to change.
\subsection{Fallback RoleMap}
As mentioned above PDF 2.0 support name spaces for tags. This is quite nice. At first because it
avoid name clashes, but also because it allow to build a cleaner model of the document structure.
But sadly support for PDF 2.0 is still quite scarce and while most PDF readers have no problems
to open and render a PDF 2.0 file they don't \enquote{see} the role mapping if name spaces are used.
Therefore since version 0.98t \pkg{tagpdf} adds in PDF 2.0 files additionally also a global |/RoleMap|
dictionary as a fallback for such processors.
\subsection{Mathml}
In PDF 2.0 mathml tags have their own name space and can be freely used.
In PDF 1.7. they can only be used if they are rolemapped to a standard type.
By default they are not added to the |/RoleMap| dictionary, but this can be
forced with |\tagpdfsetup{role/mathml-tags}|. Please note that this adds mathml
at the end of the document and overwrites tags with the same name without warning.
\section{Checking parent-child rules}\label{sec:parent-child}
The \PDF{} references formulate various rules about whether a
structure can be a child of another structure, e.g. a \texttt{Sect}
can not be a child of \texttt{P}. In the \PDF{} 1.7 reference this
rules were rather vague, in the \PDF{} 2.0 reference there is a quite
specific matrix, which sadly misses some of the tags from \PDF{}
1.7. The now released ISO norm 32005 addresses this problem and
extends the matrix to cover tags from \PDF{} 1.7 and 2.0 (but it still
misses the \texttt{math} tag and mathml tags).
The rules in the matrix are not a simple allowed/not allowed. Instead
some rules determine that structure elements can appear only once in a
parent, or that additional requirements can be found in the
descriptions of the standard structure types, e.g. \texttt{Caption}
often has to be the first element in the parent structure, and
elements like \texttt{Part} and \texttt{Div} inherit restrictions from
parent structures. External standards like \PDF/UA can add more
rules.
Altogether this doesn't make it easy to check if a structure tree is
conformant or not without slowing down the compilation a lot.
With version 0.98 some first steps to do checks (and to react to the
result of a change) have been implemented. Some checks will led to
warning directly, but the majority will only be visible if the
log-level is increased.
Typical messages will look then like this
\begin{taglstlisting}[mathescape]
Package tagpdf Info: The rule between parent 'Sect (from Sect/pdf2)'
(tagpdf) and child 'H10 (from H10/pdf2)' is '1 (0..n)'
Package tagpdf Info: The rule between parent 'H2 (from subsection/latex)'
(tagpdf) and child 'H1 (from section/latex)' is '-1 ($\emptyset$)'
\end{taglstlisting}
The descriptions of the parent and child are rather verbose as the checks
have to take role mapping and name spaces into account. The result of a
check is a number---negative if the relation is not allowed, positive if
allowed. The text in the parentheses show the symbols used in the
\PDF-matrix.
Be aware
\begin{itemize}
\item This doesn't test all rules, it only implements (hopefully
correctly) the matrix.
\item There can be differences between \PDF~1.7 and 2.0,
e.g. \texttt{FENote} is role-mapped to \texttt{Note} in \PDF~1.7 and
then has different containment rules.
\item The special tag \texttt{MC} stands for mc-chunks, so
\enquote{real content} (the matrix has containments rules for this
too).
\item Currently there is as only negative number \texttt{\textminus1}
but that is bound to change, depending on if (and how) it is
possible to \enquote{repair} a disallowed parent-child relation.
\item Warnings can be wrong.
\end{itemize}
\section{\enquote{Real} space glyphs}\label{sec:spacechars}
TeX uses only spaces (horizontal movements) to separate words. That
means that a \PDF{} reader has to use some heuristic when copying text
or reflowing the text to decide if a space is meant as a word boundary
or e.g. as a kerning. Accessible document should use real space
glyphs (U+0032) from a font in such places.
With the key \PrintKeyName{activate/spaces} you can activate such space
glyphs.
With pdftex this will simply call the primitive
\verb+\pdfinterwordspaceon+. pdftex will then insert at various places
a char from a font called dummy-space. Attention! This means that at
every space there are additional font switches in the \PDF{}: from the
current font to the dummy-space font and back again. This will make
the \PDF{} larger. As \verb+\pdfinterwordspaceon+ is a primitive
function it can't be fine tuned or adapted. You can only turn it on
and off and insert manually such a space glyph with
\verb+\pdffakespace+.
With luatex (in luamode) |activate/spaces| is implemented with a
lua-function which is inserted in two callbacks and marks up the
places where it seems sensible to inter a space glyph. Later in the
process the space glyphs are injected -- the code will take the glyph
from the current font if this has a space glyph or switch to the
default latin modern font. The current code works reasonable well in
normal text. |activate/spaces| can be used without actually tagging a
document.
The key-value \PrintKeyName{debug/show=spaces} will show lines at the places where
in lua mode spaces are inserted and so can help you to find
problematic places. For listings -- which have a quite specific
handling of spaces -- you can find a suggestion in the example
\texttt{ex-space-glyph-listings}.
\emph{Attention:} Even with real spaces copy\& pasting of code doesn't
need to give the correct results: you get spaces but not necessarily
the right number of spaces. The \PDF{} viewers I tried all copied four
real space glyphs as one space. I only got the four spaces with the
export to text or xml in the AdobePro.
\begin{docCommand}{pdffakespace}{}\end{docCommand}
This is in pdftex a primitive. It inserts the dummy space glyph.
\pkg{tagpdf} defines this command also for luatex -- attention if can
perhaps insert break points.
\begin{docCommands}
{
{doc name=tag_space_off:},
{doc name=tag_space_on:}
}
\end{docCommands}
The commands allow to switch on and off the insertion of space chars.
With pdftex they map to primitive \cs{pdfinterwordspaceoff} and
and \cs{pdfinterwordspaceon} which insert a whatsits and so act globally.
The luatex implementation uses an attribute which is also set globally to stay
more or less consistent with pdftex. In dvi-mode the commands do nothing.
\section{Structure destinations}\label{sec:struct-dest}
Standard destinations (anchors for internal links) consist of a
reference to a page in the pdf and instructions how to display
it---typically they will put a specific coordinate in the left top
corner of the viewer and so give the impression that a link jumped to
the word in this place. But in reality they are not connected to the
content.
Starting with pdf~2.0 destinations can in a tagged PDF also point to a
structure (to a \texttt{/StructElem} object). GoTo links can then
additionally to the \texttt{/D} key which points to a standard page
destination also point to such a structure destination with an
\texttt{/SD} key. Programs that e.g. convert such a PDF to html can then
create better links. (According to the reference, PDF-viewer should prefer
the structure destination over the page destination, but as far as it is
known this isn't done yet.)
At first structure destinations (and GoTo links making use of it) could
natively only be created with the dvipdfmx backend. With pdftex and
lualatex it was only possible to create a restricted type which used only
the \enquote{Fit} mode. Starting with \TeX{}live 2022 (earlier in miktex)
both engines knew new keywords which allowed to create structure
destination easily and support has been already added to the \PDF\
management and \pkg{tagpdf}. In most cases it should simply work, but one
should be aware that as one now has a destination that is actually tied to
the content it gets more important to actually consider the context and
the place where such destinations are created. It now makes a difference
if the destination is created before the structure is opened or after so
in some cases code that place destinations should be changed to place them
inside the structure they belong too. One also has to consider the pages
connected to the destinations: The structure destination is bound to the
page where the structure \emph{begins}, if this differ from the page of
the page destination (e.g. if the destination is created by a
\verb+\phantomsection+ in the middle of a longer paragraph) then it may
be necessary to surround destinations with a dummy structure (a Span or an
Artifact) to get the right page number.
\section{Storing and reusing boxes}\label{sec:savebox}
\TeX{} allows to store material in boxes and to use these box once or
multiple times in other places. This poses some challenges to tagging. The
listings in the following examples uses low-level \TeX{} box commands to
avoid that changes in the \LaTeX{} commands that improve tagging interfere
in case you want to test this. To keep the examples short they don't show
the needed \cs{ExplSyntaxOn}/\cs{ExplSyntaxOff}.
\subsection{Boxes without tagging commands}
If no tagging commands were used (or if they were inactive) when the box
was stored then there is no problem to use this box with
pdf\LaTeX{}/generic mode in various places. So
\begin{taglstlisting}
\newbox\mybox
The\setbox\mybox\hbox{yellow} duck
The \box\mybox{} sun
\end{taglstlisting}
will produce (assuming para tagging is activated) the paragraph structures
\enquote{The duck} and \enquote{The yellow sun}.
With lua\LaTeX{}/lua mode this is different: The nodes in the box will have
the mc-attribute value attached which were active when the box was saved
and this value is recorded as kid of the first paragraph. So when the lua
code later wanders through the box to find all kids of structure it will
also find the content of the \cs{usebox}. This means with lua\LaTeX{} we
get the two paragraph structures \enquote{The duck yellow} and \enquote{The
sun}.
The solution here is to reset the attributes before using the box:
\begin{taglstlisting}
The\setbox\mybox\hbox{yellow} duck
The \tag_mc_reset_box:N\mybox\box\mybox{} sun
\end{taglstlisting}
The box can in both modes be used without problems many times.
\subsection{Boxes with tagging commands}
We assume in the following that the box contains only well balanced tagging
commands and no parts that are \enquote{untagged}. It should be possible to
copy the whole box inside a \verb+\tagstructbegin+/\hspace{0pt}\verb+\tagstructend+
pair. So the following is fine as box content
\begin{taglstlisting}
box=\tagstructbegin{...}\tagmcbegin{} balanced content\tagmcend\tagstructend
box=
\tagmcbegin{}text\tagmcend
\tagstructbegin{...}\tagmcbegin{} balanced content\tagmcend\tagstructend
\tagmcbegin{}text\tagmcend
\end{taglstlisting}
but this not (this case could probably be handled nevertheless with a bit
care at least in lua mode)
\begin{taglstlisting}
box= text\tagmcend\tagstructbegin{...}...\tagstructend\tagmcbegin{}text
\end{taglstlisting}
and this is absolutely unusable:
\begin{taglstlisting}
box= text\tagmcend\tagstructbegin{...}\tagmcbegin{}text
\end{taglstlisting}
We also assume that we want to move the structure of the box to the place
where the box is used (if the structure should stay where the box is saved,
simply save it and that will happen). For this we must add a structure that
we can stash and label.
\begin{taglstlisting}
\tag_mc_end_push: % interrupt an open mc
\tagstructbegin{tag=NonStruct,stash}
\edef\myboxnum{\tag_get:n{struct_num}} % store structure number
\setbox\mybox\hbox %or \vbox or ...
{content}
\tagstructend
\tag_mc_begin_pop:n{}% restart open mc
\end{taglstlisting}
At the place where the box is then used we also have to inject this
structure:
\begin{taglstlisting}
\tag_mc_end_push: % interrupt an open mc
\tag_struct_use_num:n {\myboxnum} % use structure
\box\mybox % use box
\tag_mc_begin_pop:n{}% restart open mc
\end{taglstlisting}
With pdf\LaTeX{} Boxes with tagging commands can currently be used only
once. The tagging commands set labels and reusing the box gives multiple
label warnings.
With lua\LaTeX{} it is possible to reset the attributes as done with the
untagged box and then to reuse at least the content.
\subsection{Detecting tagging commands}
It is possible to detect if a box contains tagging commands by comparing
the state of the mc and structure counter:
\begin{verbatim}
\def\statebeforebox\inteval{\tag_get:n{struct_counter}+\tag_get:n{mc_counter}}
\setbox\mybox ...
%compare numbers against \statebeforebox
\end{verbatim}
\subsection{Putting everything together}
To tag boxes that can be both (without tagging commands or with balanced
tagging commands) the following strategy can be used:
\begin{itemize}
\item when storing the box put around it a structure as needed by the
tagged variant:
\begin{verbatim}
\tag_mc_end_push: % interrupt an open mc
\tagstructbegin{tag=NonStruct,stash}
\edef\myboxnum{\tag_get:n{struct_num}} % store structure number
\def\statebeforebox{\inteval{\tag_get:n{struct_counter}+\tag_get:n{mc_counter}}}
\setbox\mybox\hbox %or \vbox or ...
{content}
%check if there is tagging content and store that
\tagstructend
\tag_mc_begin_pop:n{}% restart open mc
\end{verbatim}
\item when using the box the first time
\begin{itemize}
\item if it has no tagging commands then reset the attribute and use
the box.
\begin{verbatim}
The \tagmcresetbox\mybox\box\mybox{} sun
\end{verbatim}
The stashed \texttt{NonStruct} structure is then thrown away.
\item if there is a structure then use the stashed structure
\begin{verbatim}
\tag_mc_end_push: % interrupt an open mc
\tag_struct_use_num:n {\myboxnum} % use structure
\box\mybox % use box
\tag_mc_begin_pop:n{}% restart open mc
\end{verbatim}
\end{itemize}
\item if the box is used a second time then throw an error with
pdf\LaTeX{}. With lua\LaTeX{} reset the attributes and issue a
warning.
\end{itemize}
\section{Accessibility is not only tagging}
A tagged \PDF{} is needed for accessibility but this is not enough. As
already mentioned there are more requirements:
\begin{itemize}
\item The language must be declared by adding a \texttt{/Lang xx-XX} to
the \PDF{} catalog or -- if the language changes for a part of the
text to the structure or the marked content. Setting the document
language can be done with the \texttt{lang} option of
\cs{DocumentMetadata}. For settings in marked content and structure
the \texttt{lang} key can be used too.
\item All characters must have a Unicode representation or a suitable
alternative text. With lualatex and open type (Unicode) fonts this
is normally not a problem. With pdflatex it could need additional
\verb+\pdfglyphtounicode+ commands.
\item Hard and soft hyphen must be distinct. In luamode this is now handled through
the \texttt{activate/softhyphen} key. For pdftex no solution is known.
\item Spaces between words should be space glyphs and not only a
horizontal movement. See section~\ref{sec:spacechars}.
\item Various small infos must be present in the catalog dictionary,
info dictionary and the page dictionaries, e.g. metadata like title.
This can be done with the options of \cs{DocumentMetadata}. See the
documentation of \texttt{l3pdfmeta} for details.
\end{itemize}
\section{Debugging}
While developing commands and tagging a document, it can be useful to get
some info about the current structure. For this a show command is provided
\begin{docCommand}{ShowTagging}{\marg{key-val}}\end{docCommand}
This command takes as argument a key-val list which implements a number of show options.
\begin{description}
\item[\PrintKeyName{mc-data}] This key is relevant for luamode only.
It shows the data of all mc-chunks created so far.
It is accurate only after shipout, so typically should be issued after a
newpage. The value is a positive integer and sets the first mc-shown. If no value
is given, 1 is used and so all mc-chunks created so far are shown.
\item[\PrintKeyName{mc-current}] This key shows the number and the tag of
the currently open mc-chunk. If no chunk is open it shows only the
state of the absolute counter. It works in all mode,
but the output in luamode looks different.
\item[\PrintKeyName{struct-stack}] This key shows the current
structure stack. Typically it will contain at
least |root| and |Document|.
With the value |log| the info is only
written to the log-file, |show| stops the compilation and shows on the terminal.
If no value is used, then the default is |show|.
\item[\PrintKeyName{debug/structures}] This key is only available if
the package \pkg{tagpdf-debug} has been loaded too. It takes as
value a number (the default is 0), and shows on the terminal and
in the log information about all structures with a number equal or
larger than the number. The data avoids to show PDF object numbers
to make it more usable for test suites.
\end{description}
\section{To-do}
\begin{itemize}
\item Add commands and keys to enable/disable the checks.
\item Check/extend the code for language tags.
\item Think about math (progress: examples using luamml, associated files exists).
\item Think about Links/Annotations
(progress: mostly done, see section~\ref{sec:link+annot} and the code in \pkg{l3pdffield})
\item Keys for alternative and actualtext. How to define the input encoding? Like in Accsupp?
(progress: keys are there, but encoding interface needs perhaps improving)
\item Check twocolumn documents
\item Examples
\item Write more Tests
\item Write more Tests
\item Unicode
\item Hyphenation char
\item Think about included (tagged) \PDF{}. Can one handle them?
\item Improve the documentation (progress: it gets better)
\item Tag as proof of concept the documentation (nearly done)
\item Document the code better (progress: mostly done)
\item Create dtx (progress: done)
\item Find someone to check and improve the lua code
\item Move more things to lua in the luamode
\item Find someone to check and improve the rest of the code
\item Check differences between \PDF{} versions 1.7 and 2.0. (progress:
WIP, namespaces done)
\item bidi?
\end{itemize}
\makeatletter % fix TOC of History
\addtocontents{toc}{\def\string\l@subsection{\string\@dottedtocline{2}{1.5em}{3em}}}
\makeatother
\section{History}
This section lists important changes during the development of the package.
More can be found in the \texttt{CHANGELOG.MD} and by checking the git
commits.
\subsection{Changes in 0.3}
In this version I improved the handling of alternative and actual text. See
section~\ref{sec:alt}. This change meant that the package relies on the
module \texttt{l3str-convert}.
I no longer try to (pdf-)escape the tag names: it is a bit unclear how to
do it at best with luatex. This will perhaps later change again.
\subsection{Changes in 0.5}
I added code to handle attributes and attribute classes, see
section~\ref{sec:attributes} and corrected a small number of code errors.
I added code to add \enquote{real} space glyphs to the \PDF{}, see section
\ref{sec:spacechars}.
\subsection{Changes in 0.6}
\textbf{Breaking change!} The attributes used in luamode to mark the
MC-chunks are no longer set globally. I thought that global attributes
would make it easier to tag, but it only leads to problem when e.g. header
and footer are inserted. So from this version on the attributes are set
locally and the effect of a \verb+\tagmcbegin+ ends with the current group.
This means that in some cases more \verb+\tagmcbegin+ are needed and this
affected some of the examples, e.g. the patching commands for sections with
KOMA. On the other side it means that quite often one can omit the
\verb+\tagmcend+ command.
\subsection{Changes in version 0.61}
\begin{itemize}
\item internal code adaptions to expl3 changes.
\item dropped the compresslevel key -- probably not needed.
\end{itemize}
\subsection{Changes in version 0.8}
\begin{itemize}
\item As a first step to include the code proper in the \LaTeX\ kernel
the module name has changed from \texttt{uftag} to \texttt{tag}. The
commands starting with |\uftag| will stay valid for some time but
then be deprecated.
\item \textbf{Breaking change!} The argument of \texttt{role/new-attribute} (old key name: \texttt{newattribute})
option should no longer add the dictionary bracket \verb+<<..>>+,
they are added by the code.
\item \textbf{Breaking change!} The package now requires the new PDF
management as provided for now by the package
\pkg{pdfmanagement-testphase}. \pkg{pdfmanagement-testphase},
prepares the ground for better support for tagged PDF in \LaTeX{}. It
is part of a larger project to automatically generate tagged PDF
\url{https://www.latex-project.org/news/2020/11/30/tagged-pdf-FS-study/}
\item Support to add associated files to structures has been added with
new keys \texttt{AF}, \texttt{AFinline} and \texttt{AFinline-o}.
\item \textbf{Breaking change!} The support for other 8-bit input
encodings has been removed. utf8 is now the required encoding.
\item The keys |lang|, |ref| and |E| have been added for structures.
\item The new hooks of \LaTeX\ are used to tagged many paragraphs
automatically. The small red numbers around paragraphs in the
documentation show them in action. The main problem here is not to
tag a paragraph, but to avoid to tag too many: paragraphs pop up in
many places.
\end{itemize}
\subsection{Changes in version 0.81}
\begin{itemize}
\item Hook code to tag links (URI and GoTo type) have been added. So
normally they should simply work if tagging is activated.
\item Commands and keys to allow automatic paragraph tagging have been
added. See section~\ref{sec:paratagging}. As can be seen in this
documentation the code works quite good already, but one should be
aware that \enquote{paragraphs} can appear in many places and
sometimes there are even more paragraph begin than ends.
\item A key to test if local or global setting of the mc-attributes in
luamode is more sensible, see \ref{sec:global-local} for more
details.
\item New commands to store and reset mc-tags.
\item PDF 2.0 namespaces are now supported.
\end{itemize}
\subsection{Changes in version 0.82}
A command |\tag_if_active:TF| to test if tagging is active has been added.
This allow external packages to write conditional code.
The commands |\tag_struct_parent_int:| and |\tag_struct_insert_annot:nn|
have been added. They allow to add annotations to the structure.
\subsection{Changes in version 0.83}
|\tag_finish_structure:| has been removed, it is no longer a public
command.
\subsection{Changes in version 0.90}
\begin{itemize}
\item Code has been cleaned up and better documented.
\item \textbf{More engines supported} The generic mode of \pkg{tagpdf}
now works (theoretically, it is not much tested) with all engines
supported by the \PDF\ management. So compilations with Xe\LaTeX{} or
with dvips should work. But it should be noted that these engines and
backends don't support the |interspaceword| option. With Xe\LaTeX{}
it is perhaps possible implement something with
|\XeTeXinterchartoks|, but for the dvips route I don't see an option
(apart from lots of manual macros everywhere).
\item \textbf{MC-attributes are global again} In\sidenote{Breaking
change!} version 0.6 the attributes used in luamode to mark the
MC-chunks were no longer set globally. This avoided a number of
problems with header and footer and background material, but further
tests showed that it makes it difficult to correctly mark things like
links which have to interrupt the current marking code---the
attributes couldn't easily escape groups added by users. See
section~\ref{sec:global-local} for more details.
\item \textbf{key global-mc removed:} Due to the changes in the attribute
keys this key is not longer needed.
\item \textbf{key check-tags removed:} It doesn't fit. Checks are handled
over the logging level.
\item |\tagpdfget| has been removed, use the expl3 version if needed.
\item The show commands |\showtagpdfmcdata|, |\showtagpdfattributes|,
|\showtagstack| have been removed and replaced by a more flexible
command |\ShowTagging|.
\item The commands |\tagmcbegin| and |\tagmcend| no longer ignore
following spaces or remove earlier one. While this is nice in some
places, it also ate spaces in places where this wasn't expected. From
now on both commands behave exactly like the expl3 versions.
\item The lua-code to add real space glyphs has been separated from the
tagging code. This means that |activate/spaces| now works also if
tagging is not active.
\item The key |activate| has been added, it open the first structure, see
above.
\end{itemize}
\subsection{Changes in version 0.92}
\begin{itemize}
\item support for page breaks in pdftex has been added, see
section~\ref{sec:splitpara},
\item header and footer are tagged as artifacts automatically, see
section~\ref{sec:header-footer}.
\item keys \texttt{alttext-o} and \texttt{actualtext-o} has been removed.
\texttt{alttext} and \texttt{actualtext} will now expand once.
\end{itemize}
\subsection{Changes in version 0.93}
\begin{itemize}
\item Support for associated files in the root element (key
\texttt{root-AF}) has been added. This allow e.g. to add a css-file
which is be used if the \PDF\ is converted to html.
\item First steps have been done to adapt the package to planned changes
in \LaTeX{}: The command \cs{DocumentMetadata} will be added to the
format and will take over the role of \cs{DeclareDocumentMetadata}
from \pkg{pdfmanagement-testphase} and additionally will also load
the pdf management code. This will simplify the documents as it will
no longer be needed to load the package.
\item The package has now support for \enquote{structure destinations}.
This is a new type of destinations in \PDF~2.0. For pdftex and luatex
this requires new binaries. They will be included in texlive 2022,
miktex already has the new pdftex, the new luatex will probably
follow soon.
\item The commands \cs{tagpdfifluatexT}, \cs{tagpdfifluatexTF} has been
removed \cs{tagpdfifpdftexT},
\end{itemize}
\subsection{Changes in version 0.94}
In this version a small package, \pkg{tagpdf-base} has been added. It
provides no-op versions of the main expl3 user commands for packages that
want to support tagging but can't be sure if the \pkg{tagpdf} package has
been loaded.
\subsection{Changes in version 0.95}
Small bug fixes.
\subsection{Changes in version 0.96}
\begin{itemize}
\item The \texttt{alttext} key has been renamed to \texttt{alt}, the
other key name exists as alias.
\item The new command |\tag_struct_object_ref:n| allows to create the
object reference of a structure.
\item a new key \texttt{parent} has been added to allow structures to
choose their parent structure.
\item a new option \texttt{paratag} allows to change the tag name used
for the automatically tagged paragraphs.
\item the commands |\tag_start:|, |\tag_stop:|, |\tag_stop:n| and
|\tag_start:n| allow to stop and start tagging (for example in trial
typesetting).
\item Small bug fixes.
\end{itemize}
\subsection{Changes in version 0.98}
\begin{itemize}
\item The declarations of tag namespaces have been externalized and are
now read from files when \pkg{tagpdf} is loaded.
\item The \PDF{} format (and some of the standards) declare various
parent-child rules for structure tags. A first step to implement this
rules and check if they are fulfilled have been done. More
information can be found in section~\ref{sec:parent-child}.
\item As a side effect of the new rule checking, the requirements for new
tags have been tightened: Adding a new tag with add-new-tag now
requires that the target role is defined. Unknown roles error.
\item |\tagmcbegin| no longer requires that a tag is set, instead if will
pick up the tag name from the surrounding structure.
\item Structure destination are now created also with \PDF
\textless\,2.0. They shouldn't harm and can improve the html export.
\end{itemize}
\subsection{Changes in version 0.98a}
Small bug fixes in code and documentation.
\subsection{Changes in version 0.98b}
The main change is from now on every structure has an ID and an IDtree is
added. The ID of a structure can be retrieved with |\tag_get:n|
see~\ref{sec:retrieve}.
\subsection{Changes in version 0.98e}
\begin{itemize}
\item The main change is that the automatic paratagging uses now a
two-level structure. This accompanies development in the \LaTeX\ github
in the \texttt{latex-lab} package regarding the tagging of blocks like
lists or verbatim. See~\ref{sec:paratagging} and also
\texttt{latex-lab-block-tagging.dtx} for more background.
\item The command |tag_struct_end:n| has been add to improve debugging.
\end{itemize}
\subsection{Changes in version 0.98k}
The luamode has been adapted and now allows also the compilation with
dvilualatex. By default it will insert specials for \texttt{dvips} into the
dvi. But be aware that \texttt{dvips} can normally not be used as it can't
handle open type fonts, and extended version would be needed which isn't in
texlive yet. It is also possible to use \texttt{dvipdfmx} as backend (which
already has support for open type fonts), for this you need to use
\texttt{backend=dvipdfmx} in the \cs{DocumentMetadata} command. Real space
chars will work, but are currently not taken from the current font. This
will be improved in the next luaotfload version. The compilation with
dvilualatex is not much tested yet.
\subsection{Changes in version 0.98l}
In 2023 the primitives to write literal code into the pdf have been
extended in all engines and now allow to delay the expansion of their
argument to the shipout. This made it possible to greatly simplify and
speed up the code used in generic mode to number the MC-chunks. In most
cases building the structure should now need only two or three
compilations. The new code requires a current pdfmanagement-testphase and
is then used automatically if the new engines are detected.
\subsection{Changes in version 0.99f}
Deprecated |\tag_start:|, |\tag_stop:|, |\tag_stop:n| and
|\tag_start:n| in favor of |\tag_suspend:n| and |\tag_resume:n|.
\printbibliography[heading=bibintoc]
\appendix
\section{Some remarks about the \PDF{} syntax}
This is not meant as a full reference only as a background to make the
examples and remarks easier to understand.
\begin{description}
\item[postfix notation] \PDF{} uses in various places postfix
notation. This means that the operator is behind its arguments:
\begin{tikzpicture}[baseline=(c.base),alt={Illustration of postfix notation}]
\node[arg](a1) {18};
\node[arg,right=of a1.east](a2) {0};
\node[operator,right= of a2.east](c) {obj};
\draw[->] (c.south) --++(0,-2mm) -| (a1);
\draw[->] (c.south) --++(0,-2mm) -| (a2);
\end{tikzpicture}
\begin{tikzpicture}[baseline=(c.base),alt={Illustration of postfix notation}]
\node[arg](a1) {18};
\node[arg,right=of a1.east](a2) {0};
\node[operator,right= of a2.east](c) {R};
\draw[->] (c.south) --++(0,-2mm) -| (a1);
\draw[->] (c.south) --++(0,-2mm) -| (a2);
\end{tikzpicture} (a reference (operator R) to an object
\begin{tikzpicture}[baseline=(c.base),alt={Illustration of postfix notation}]
\node[arg](a1) {1};
\node[arg,right = of a1.east](a2) {0};
\node[arg,right = of a2.east](a3) {0};
\node[arg,right = of a3.east](a4) {1};
\node[arg,right = of a4.east](a5) {100.2};
\node[arg,right = of a5.east](a6) {742};
\node[operator,right = of a6.east](c) {Tm};
\draw[->] (c.south) --++(0,-2mm) -| (a6);
\draw[->] (c.south) --++(0,-2mm) -| (a5);
\draw[->] (c.south) --++(0,-2mm) -|(a4);
\draw[->] (c.south) --++(0,-2mm) -|(a3);
\draw[->] (c.south) --++(0,-2mm) -| (a2);
\draw[->] (c.south) --++(0,-2mm) -|(a1);
\end{tikzpicture}
\begin{tikzpicture}[baseline=(c.base),alt={Illustration of postfix notation}]
\node[arg](a1) {/P};
\node[arg,right = of a1.east](a2) {<>};
\node[operator,right = of a2.east](c) {BDC};
\draw[->] (c.south) --++(0,-2mm) -| (a1);
\draw[->] (c.south) --++(0,-2mm) -| (a2);
\end{tikzpicture}
\item[Names] \PDF{} knows a sort of variable called a
\enquote{name}. Names start with a slash and may include any regular
characters, but not delimiter or white-space characters. Uppercase
and lowercase letters are considered distinct: \texttt{/A} and
\texttt{/a} are different names. \verb+/.notdef+ and
\verb+/Adobe#20Green+ are valid names.
Quite a number of the options of \texttt{tagpdf} actually define
such a name which is later added to the \PDF{}. I recommend
\emph{strongly} not to use spaces and exotic chars in such
names. While it is possible to escape such names it is rather a pain
when moving them through the various lists and commands and quite
probably I forgot some place where it is needed.
\item[Strings]There are two types of strings: \emph{Literal strings}
are enclosed in round parentheses. They normally contain a mix of
ascii chars and octal numbers:
\verb+(gr\374\377ehello[]\050\051)+.
\emph{Hexadezimal strings} are enclosed in angle brackets. They
allow for a representation of all characters the whole Unicode
ranges. This is the default output of lualatex.
\texttt{<003B00600243013D0032>}.
\item[Arrays] Arrays are enclosed by square brackets. They can contain
all sort of objects including more arrays. As an example here an
array which contains five objects: a number, an object reference, a
string, a dictionary and another array. Be aware that despite the
spaces \texttt{15 0 R} is \emph{one} element of the array.
\mbox{\texttt{[0 15 0 R (hello) <> [1 2 3]]}}
\begin{tikzpicture}[baseline=(c.base),alt={Illustration of array}]
\node[arg](a1) {0};
\node[arg,right = of a1.east](a2) {15 0 R};
\node[arg,right = of a2.east](a3) {(hello)};
\node[arg,right = of a3.east](a4) {<>};
\node[arg,right = of a4.east](a5) {[1 2 3]};
\end{tikzpicture}
\item[Dictionaries] Dictionaries are enclosed by double angle
brackets. They contain key-value pairs. The key is always a
name. The value can be all sort of objects including more
dictionaries. It doesn't matter in which order the keys are given.
Dictionaries can be written all in one line:\\
\texttt{<>}\\
but at least for examples a layout with line breaks and indentation is more readable:
\begin{taglstlisting}
<<
/Type /Page
/Contents 3 0 R
/Resources 1 0 R
/MediaBox [0 0 595.276 841.89]
/Parent 5 0 R
>>
\end{taglstlisting}
\item[(indirect) objects] These are enclosed by the keywords
\texttt{obj} (which has two numbers as prefix arguments) and
\texttt{endobj}. The first argument is the object number, the
second a generation number -- if a \PDF{} is edited objects with a
larger generation number can be added. As with pdflatex/lualatex
the \PDF{} is always new we can safely assume that the number is
always 0. Objects can be referenced in other places with the
\texttt{R} operator. The content of an object can be all sort of
things.
\item[streams] A stream is a sequence of bytes. It can be long and is
used for the real content of \PDF{}: text, fonts, content of
graphics. A stream starts with a dictionary which at least sets the
\texttt{/Length} name to the length of the stream followed by the
stream content enclosed by the keywords \texttt{stream} and
\texttt{endstream}.
Here an example of a stream, an object definition and reference. In the
object 2 (a page object) the \texttt{/Contents} key references the
object 3 and this then contains the text of the page in a stream.
\texttt{Tf}, \texttt{Tm} and \texttt{TJ} are (postfix) operators, the
first chooses the font with the name \texttt{/F15} at the size 10.9,
the second displaces the reference point on the page and the third
inserts the text.
\begin{taglstlisting}
% a page object (shortened)
2 0 obj
<<
/Type/Page
/Contents 3 0 R
/Resources 1 0 R
...
>>
endobj
%the /Contents object (/Length value is wrong)
3 0 obj
<>
stream
BT
/F15 10.9 Tf 1 0 0 1 100.2 746.742 Tm [(hello)]TJ
ET
endstream
endobj
\end{taglstlisting}
In such a stream the \texttt{BT}--\texttt{ET} pair encloses texts while
drawing and graphics are outside of such pairs.
\item[Number tree] This is a more complex data structure that is meant to
index objects by numbers. In the core is an array with number-value
pairs. A simple version of number tree which has the keys 0 and 3 is
\begin{taglstlisting}
6 0 obj
<<
/Nums [
0 [ 20 0 R 22 0 R]
3 21 0 R
]
>>
endobj
\end{taglstlisting}
This maps 0 to an array and 2 to the object reference \texttt{21 0 R}.
Number trees can be split over various nodes -- root, intermediate and
leaf nodes. We will need such a tree for the \emph{parent tree}.
\end{description}
\end{document}