\MakeShortVerb{\|}
% this needs the typehtml.sty, either from ctan or I'll send it you.
% It is exactly the stuff from the dtx file on ctan, except the
% examples cut down a bit to fit two column linewidth.
\makeatletter
\def\multispan{\omit\@multispan}
\def\@multispan#1{%
\@multicnt#1\relax
\loop\ifnum\@multicnt>\@ne \sp@n\repeat}
\def\sp@n{\span\omit\advance\@multicnt\m@ne}
\makeatother
\title{\textsf{typehtml}: A \LaTeX\ package to typeset HTML}
\author{David Carlisle}
\begin{Article}
\section{Introduction}
This package enables the processing of HTML codes. The
\verb|\dohtml| command
allows fragments of HTML to be placed within a \LaTeX\
document,
\begin{verbatim}
\dohtml
html markup ...
\end{verbatim}
The \verb||\ldots\verb|| is \emph{required}. (It is
anyway a good idea to have these tags in an HTML document.)
The \verb|\htmlinput| command is similar, but takes a file
name as argument. In that case the file need not necessarily start
and end with \verb||\ldots\verb||.
This package covers most of the HTML2 DTD, together with the
mathematics extensions from HTML3.\footnote
{The draft specification of HTML3 has expired, and the W3C group
are currently devising a new proposed extension of HTML, so the
mathematics typesetting part of this package may need substantial
revision once a final specification of the HTML mathematics markup
is agreed.}
The rest of HTML3 may be added at a later date.
Its current incarnation has not been extensively tested, having been
thrown together during a couple of weeks in response to a
question on \texttt{comp.text.tex} about the availability of such a
package.
The package falls into three sections. Firstly the options section
allows a certain amount of customisation, and enabling of
extensions. Not all these options are fully operational at present.
Secondly comes a section that implements a kind of SGML parser. This
is not a real conforming SGML parser (not even a close approximation
to such a thing!) The assumption (sadly false in the anarchic WWW)
is that any document will have been validated by a conforming SGML
parser before it ever gets to the stage of being printed by this
package. Finally are a set of declarations that essentially map the
declarations of the HTML DTD into \LaTeX\ constructs.
\section{Options}
\subsection{HTML Level}
The options \texttt{html2} (the default) and \texttt{html3} control the
HTML variant supported. Using the \texttt{html3} option will use up a
lot more memory to support the extra features, and the math entity
(symbol) names. Against my better judgement there is also a
\texttt{netscape} option to allow some of the non-HTML tags accepted
by that browser.
\subsection{Headings}
The six options \texttt{chapter}, \texttt{chapter*}, \texttt{section},
\texttt{section*}, \texttt{subsection} and \texttt{subsection*}
determine to which \LaTeX\ sectional command the HTML element
\texttt{h1} is mapped. (\texttt{h2}--\texttt{h6} will
automatically follow suit.) The default is \texttt{section*}.
\subsection{Double Quote Handling}
Most HTML pages use |"| as as a quotation mark in text, for
example:
\begin{verbatim}
quoted "like this" example
\end{verbatim}
This slot in the ISO latin-1 encoding is for `straight' double
quotes. Unfortunately the Standard \TeX\ fonts in the OT1 encoding
do not have such a character, only left and right quotes, ``like
this''. By default this package uses the \texttt{straightquotedbl}
option which uses the \LaTeX\ command |\textquotedbl| to render
|"|. If used with the T1 encoded fonts |\usepackage[T1]{fontenc}|
then the straight double quote from the current font is used. With
OT1 fonts, the double quote is taken from the |\ttfamily| font,
which looks \texttt{\char'042}like this\texttt{\char'042} which is fairly
horrible, but better than the alternative which is ''like this''.
The \texttt{smartquotedbl} option redefines |"| so that it produces
alternatively an open double quote `` then a close ''. As there is a
chance of it becoming confused, it is reset to `` at the beginning
of every paragraph, whatever the current mode.
Neither of these options affects the use of |"| as part of the SGML
syntax to surround attribute values.
In principle the package ought to have similar options dealing with
the single quote, but there the situation is more complicated due to
its dual use as an apostrophe, so currently the package takes no
special precautions: all single quotes are treated as a closing
quote/apostrophe. Also the conventions of `open' and `close' quotes
only really apply to English. If someone wants to suggest what the
package should do with |"| in other languages\ldots
\subsection{Images}
The default option is \texttt{imgalt} This means that all inline
images (the HTML \texttt{img} element) are replaced by the text
specified by the \texttt{alt} attribute, or \textsf{[image]} if no
such attribute is specified.
The \texttt{imggif} option\footnote{one day\dots} uses the
\verb|\includegraphics| command so that inline images appear as
such in the printed version.
The \texttt{imgps} option\footnotemark[9] is similar to
\texttt{imggif} but first replaces the extension \texttt{.gif} at
the end of the source file name by \texttt{.ps}. This will enable
drivers that can not include GIF files to be used, as long as the
user keeps the image in both PostScript and Gif formats.
\subsection{Hyperref}
Several options control how the HTML anchor tag is treated.
The default \texttt{nohyperref} option ignores \texttt{name} anchors, and
typesets the body of \texttt{src} anchors using |\emph|.
The \texttt{ftnhyperref} option is similar to \texttt{nohyperref},
but adds a footnote showing the destination address of each link,
as specified by the \texttt{src} attribute.
If the \texttt{hyperref} option is specified, the hypertext markup
in the HTML file will be replicated using the
hypertext specials of the Hyper\TeX\ group. If in addition the
\textsf{hyperref} package is loaded, the extra features of that
package may be used, for instance producing `native PDF' specials
for direct use by Adobe Distiller rather than producing the specials
of the hyper\TeX\ conventions.
The \texttt{dviwindo} option converts the hypertext information in
the HTML into the |\special| conventions of Y\&Y's \emph{dviwindo}
previewer for Microsoft Windows.
\subsection{Big Integrals}
\LaTeX\ does not treat integral signs as variable sized symbols,
in the way that it treats delimiters such as brackets. In common
with summation signs and a few other operators, they come in
just two fixed sizes, a small version for inline mathematics, and a
large version used in displays. In fact by default \LaTeX\ always
uses the same two sizes (from the 10\,pt math extension font) even if
the document class has been specified with a size option such as
\texttt{12pt}, or if a size command such as |\large| has been used.
The standard \textsf{exscale} package loads the math extension font
at larger sizes if the current font size is larger than 10\,pt.
The HTML3 math description explicitly states that integral signs
should be treated like delimiters and stretch if applied to a large
math expression. By default this package ignores this advice and
treats integral signs in the standard way, however an option
\texttt{bigint} does cause integral signs to `stretch' (or at least
be taken from a suitably large font). The standard Computer Modern
fonts use a very `sloped' integral which means that they are
not really suitable for being stretched. Some other math fonts, for
instance Lucida, have more vertical integral signs, and one could
imagine in those cases making an integral sign with a `repeatable'
vertical middle section so that it could grow to an arbitrary size, in
the way that brackets grow.
\section{Latin-1 characters}
The SGML character entities for the ISO-Latin1 characters such as
\texttt{\é} are recognised by this style, although as usual,
some of them such as the Icelandic thorn character,
\texttt{\þ}, \verb|\th|, produce an error if the old `OT1'
encoded fonts are being used. These characters will print correctly
if `T1' encoded fonts are used, for example by declaring
\verb|\usepackage[T1]{fontenc}|~.
HTML also allows direct 8-bit input of characters according to the
ISO-latin1 encoding, to enable this you need to enable latin-1 input
for \LaTeX\ with a declaration such as
\verb|\usepackage[latin1]{inputenc}|~.
\section{Mathematics}
The HTML3 \texttt{math} element is fairly well supported, including the
\texttt{box}
and \texttt{class} attributes. (Currently only \texttt{chem} value for class is
supported, and as far as I can see the \texttt{box} attribute is only in the
report, not in the DTD.) The super and subscripts are supported,
including the shortref maps, however only the default right
alignment is
implemented so far. The convention described in the draft report
for using white space to distinguish superscript positioning is
fairly \emph{horrible}!
The documentation that I could find on HTML3 did not include a full
list of the entity names to be used for the symbols. This
package currently \emph{only} defines the following entities, which
should be enough for testing purposes at least.
\begin{itemize}
\item
|gt| ($>$) |lt| ($<$) (Already in the HTML2 DTD)
\item
Some Greek letters.
|alpha| ($\alpha$)
|beta| ($\beta$)
|gamma| ($\gamma$)
|Gamma| ($\Gamma$)
\item
Integral and Sum. $\int$ grows large if the \texttt{bigint} package
option is given.
|int| ($\int$)
|sum| ($\sum$)
\item
Braces (The delimiters (\,)[\,] also stretch as expected in the \texttt{box}
element)
|lbrace| ($\lbrace$)
|rbrace| ($\rbrace$)
\item
A random collection of mathematical symbols:
|times| ($\times$)
|cup| ($\cup$)
|cap| ($\cap$)
|vee| ($\vee$)
|wedge| ($\wedge$)
|infty| ($\infty$)
|oplus| ($\oplus$)
|ominus| ($\ominus$)
|otimes| ($\otimes$)
\item
A Minimal set of trig functions:
|sin| ($\sin$)
|cos| ($\cos$)
|tan| ($\tan$)
\item
Also in the special context as attributes to \texttt{above} and
\texttt{below} elements the entities:
|overbrace| ($\overbrace{\quad}$)
|underbrace| (\,\smash{$\underbrace{\quad}$}\,) and any (\TeX) math accent name.
\end{itemize}
\section{SGML Minimisation features}
SGML (and hence HTML) support various minimisation features that aim
to make it easier to enter the markup `by hand'. These features make
the kind of `casual' attempt at parsing SGML as implemented in this
package somewhat error prone.
Two particular features are enabled in HTML. The so called \texttt{shorttag}
feature means that the name of a tag may be omitted if it may be
inferred from the context. Typically in HTML this is used in
examples like
\begin{verbatim}
A Document Title>
\end{verbatim}
The end tag is shortened to |>| and the system infers that
\texttt{title} is the element to be closed.
The second form of minimisation enabled in HTML is the \texttt{omittag}
feature. Here a tag may be omitted altogether in certain
circumstances.
A typical example is the HTML list, where each list item is started
with || but the closing || at the end of the item may be
omitted and inferred by the following || or || tag.
This package is reasonably robust with respect to omitted
tags. However it only makes a half hearted attempt at supporting the
\texttt{shorttag} feature. The \texttt{title} example above would work, but nested
elements, with multiple levels of minimised end tags will probably
break this package.
It would be possible to build a \LaTeX\ system that had full
knowledge of the HTML (or any other) DTD and in particular the
`content model' of every element. This would produce a more robust
parsing system but would take longer than I was prepared to
spend\ldots\ If you need a fully conforming SGML
parser, it probably makes sense to use an existing one (excellent
parsers are freely available) and then convert the output of
the parser to a form suitable for \LaTeX. In that way all such
concerns about SGML syntax features such as minimisation will have
been resolved by the time \LaTeX\ sees the document.
\section{Examples}
\subsection{A section}
This document uses the \texttt{subsection*} option.
\begin{verbatim}
HTML and LaTeX
\end{verbatim}
\dohtml
HTML and LaTeX
\subsection{An itemised list}
\begin{verbatim}
\end{verbatim}
\dohtml
\subsection{Latin1 Characters}
\begin{verbatim}
é ö
\end{verbatim}
\dohtml
é ö
\subsection{Images}
Currently only the \texttt{alt} attribute is supported.
\begin{verbatim}
An image of me
\end{verbatim}
\dohtml
This is an image of me
\subsection{A Form}
\begin{verbatim}
\end{verbatim}
\dohtml
\subsection{Styles of Mathematics}
\begin{verbatim}
\end{verbatim}
\dohtml
\subsection{Integrals}
Stretchy integrals with the \texttt{bigint} option.
\begin{verbatim}
\end{verbatim}
\dohtml
And the same integral with the standard integral sign.
\begingroup
\makeatletter
\let\HTML@bigint\int
\dohtml
\endgroup
\subsection{Oversized delimiters}
\begin{verbatim}
\end{verbatim}
\dohtml
\subsection{Roots, Overbraces etc}
\begin{verbatim}
\end{verbatim}
\dohtml
\subsection{Arrays}
Most of the array specification is supported. Currently most of the
effort has gone into writing the HTML parser, so currently the column
spacing is not yet ideal, as may be seen by the following examples,
but that is (hopefully!) a small detail that can be corrected in a
later release.
\begin{verbatim}
\end{verbatim}
\dohtml
Repeat that element, but change the \texttt{array} attributes as follows:
\begin{verbatim}
\end{verbatim}
\dohtml
and finally an example of \texttt{colspec}
\begin{verbatim}
\end{verbatim}
\dohtml
\subsection{Tables}
HTML3 tables are not yet supported, but there is a minimal amount to
catch simple cases.
\def\table[#1]{\noindent\begin{minipage}\linewidth\centering}
\def\endtable{\end{minipage}}
\begin{verbatim}
\end{verbatim}
\dohtml
\section{Concluding Remarks}
Some parts of this package are still rather `rough'. In particular
some of the spacing in the mathematics examples above is not perfect.
I plan to revise the package and improve such details when (if?)
a mathematics proposal for HTML to replace the HTML3 draft is
published. Considering that it started off as an example just to show
that \TeX\ is capable of processing markup languages that do not
look like the traditional `backslash' commands, the package has proved
surprisingly capable of handling a wide variety of `real world' HTML
documents. Of the core HTML language the most noticeable feature not
yet supported is graphics inclusion. I plan to support that better in
a future release.
A more difficult conceptual problem is that it is
hard to linearise a hypertext document automatically. A typical
`document' will consist of many HTML files interconnected by links.
Currently one must invoke |\dohtml| or |\htmlinput| separately on each
of these files, and manually order them into a page order for the
typeset version. It would be nice to develop heuristics to traverse
the HTML document and build up the linear typeset version
automatically; however \TeX\ may not be the ideal language for writing
a web-crawler\ldots
\end{Article}