\def\latextohtml{\LaTeX{2}{\tt{HTML}} }
\def\htmladdnormallink#1#2{ #1\footnote{#2}}
\def\bull{$\bullet $ }
% MYFIG
% #1 - name without the .ps extension
% #2 - caption
% Frames the picture.
\newcommand{\myfig}[2]
{\begin{figure*}
\centerline{%
\epsfig{file=#1.ps,height=.9\textheight}}
\caption{#2}
\label{fig:#1}
\end{figure*}}
\title{Text to Hypertext Conversion with \latextohtml}
\author[Nikos Drakos]{Nikos Drakos,\\ Computer Based Learning Unit,\\
University of Leeds, Leeds LS2 9JT, UK.\\
email: {\tt nikos@cbl.leeds.ac.uk}\\
www: {\tt http://cbl.leeds.ac.uk/nikos/personal.html}\\
}
\begin{article}
\begin{abstract}
\latextohtml is a conversion tool that allows existing documents
written in \LaTeX\ to become part of a global multimedia system.
This paper presents some of the reasons for using such a system and
describes the basic conversion process.
\end{abstract}
\section{World Wide Web --- A Global Multimedia System}
\begin{quotation}
Imagine a system that links all the text, data, digital sounds,
graphics and video on all the world's computers into a single
interlinked hypermedia ``web''. This is the potential of the
Internet-based
World Wide Web (WWW or W3) project \ldots \cite{levy:www}
\end{quotation}
The World Wide Web merges hypermedia techniques with networked
document retrieval to provide a global information system of linked
documents. These are traversed by ``clicking'' in textual or iconic
active areas, or searched via query mechanisms \cite{tbl:www}. Hypertext links
may point to a different location in the same document or to another
document which may be located perhaps in another continent!
Documents are not limited to containing only textual information
and may include high resolution images, audio and video samples.
WWW also encompasses most of the services currently available on the
Internet such as Usenet news, ftp, wais, archie, etc. Access
to these services as well as the invocation of arbitrary computer
programs (e.g.\ a database access or a simulation) is completely
transparent to the user who sees them all
as part of some document and interacts with them in a uniform
and intuitive way.
Multimedia documents are written in a language designed specifically
for the World Wide Web called HTML (HyperText Markup Language) which
is based on SGML (Structured Generalised Markup Language). Documents
are written by information providers who just place them on the WWW
using a ``server'' program. Then anyone with access to the Internet
can use a ``client'' or ``browser'' program to access and view
available documents. Clients and servers communicate via the HTTP
protocol (HyperText Transfer Protocol). Apart from navigation
facilities, browsers also allow full text searches, ``cut and paste'',
text or audio annotations, personal ``hotlists'', saving and printing
in multiple formats and others. Such browser and server programs are
freely available for most popular computer configurations.
With the explosive growth of the World Wide Web (500-fold since the
first graphical browsers were made available this year \cite{vern:www}), and a
potential audience of 15 million in more than 50 countries, providing
information via the WWW is becoming an extremely attractive proposition.
\section{\LaTeX\ to HTML Conversion: Why?}
HTML is quite a simple markup language to learn and use. It allows basic
formatting commands, bulleted lists, ``inlined'' images, and hypertext
links to other documents, multimedia sources, internet services or
computer programs. But despite (and
because of) its simplicity it has created a few headaches for
information providers:
\begin{itemize}
\item there are no intuitive authoring tools (yet);
\item yet another hypertext language has to be learned;
\item existing documents available in other formats have to be reprocessed;
\item hypertext document ``webs'' are difficult to maintain;
\item it is difficult or impossible to create highly formatted
documents in HTML.
\end{itemize}
\latextohtml can be used in order to address to a large degree these
problems. The authoring problem simply disappears, existing documents
can be reused immediately and a complex web of interlinked documents
can be generated from a single source document. The automatic
inclusion of formatted information such as tables or mathematical
equations as inlined images also bypasses another serious problem with
HTML. An additional benefit is that the paper-based version of a
document can also be obtained from the same source.
The utility of a conversion tool like \latextohtml can be seen from
the variety of contexts in which it has been applied. Some examples
are listed below.
\begin{itemize}
\item Electronic books (e.g.\ that produced by the Computational
Science Education
Project\footnote{http://compsci.cas.vanderbilt.edu/csep.html} which
is sponsored by the US Department of Energy. This is one of the most
complex documents currently available via the WWW.).
\item General reports (e.g.\ the annual report of the Institute of
Astronomy at Cambridge\footnote{
http://cast0.ast.cam.ac.uk/sub\-$\_$dir/cambridge/annual\-$\_$report/annual$\_$report.html}).
\item User
manuals\footnote{http://cs.indiana.edu:80/elisp/w3/docs.html}.
\item System
documentation\footnote{http://archie.ac.il:8001/papers/papers.html}.
\item Scientific papers such as those on the MIT Transit
Project\footnote{http://www.ai.mit.edu/projects/transit/tn-cat.html}.
\item Electronic journals (e.g.\ Complexity
International\footnote{http://life.anu.edu.au/ci/ci.html} --- a new
Australian electronic journal).
\end{itemize}
\section{\LaTeX\ to HTML conversion: How?}
The basic conversion process relies on the ability to distinguish
between the {\em structure}, the {\em content} and the {\em
formatting} information in a \LaTeX\ document.
On the basis of sectioning information, a document is broken into
separate parts and an iconic navigation mechanism is constructed in
HTML which reflects this structure and allows a user to ``jump''
between different parts. The cross-references, citations, footnotes,
the table of contents and the lists of figures and tables are also
translated into hypertext links. Formatting information which has
equivalent ``tags'' in HTML (lists, quotes, paragraph breaks, type
styles, etc.) is also converted appropriately.
Although in most cases the loss of some formatting information (e.g.
page margins or line widths) is harmless, there are occasions where
the format has meaning e.g.\ when dealing with tables or user defined
environments. Another problem is the replication of the mathematical
equations which must retain both their precise format as well as any
of the predefined special mathematical symbols.
The innovative solution in such cases relies on the ability of HTML
browsers to display inlined images inside the main text. Any part of a
\LaTeX\ document for which it is not obvious how it should be
translated directly into HTML is extracted from the main document and
then placed on a pipeline which converts it into an image. Each image
is then placed at the correct position in the final HTML document.
Special care is taken to preserve contextual information that may
affect the contents of each image (counter values, labels, references,
active style files etc). Some examples of converted documents can be
seen in Figure \ref{fig:mosaic}.
\myfig{mosaic}{A converted document displayed using Mosaic}
\section{Hypermedia Extensions to \LaTeX}
Apart from the obvious hypertext links within a \LaTeX\ document (e.g.
navigation between sections, cross-references and citations) it is
also possible to take full advantage of the HTML links to arbitrary
multimedia sources (e.g.\ audio or video), electronic forms, and other
remote documents or internet services.
This can be done with some new commands defined in a separate style
file ({\tt html.sty}) which are processed in a special way by the
\latextohtml translator. This style file defines commands for
embedding external hypertext links, for extending the basic {\tt
\verb#\#ref-\verb#\#label} mechanism to operate between remote
documents, and specifying that some text should only appear in the
paper-based version or only in the HTML document. In most cases these
commands have no effect when processed in the conventional way.
Another command allows the inclusion of arbitrary HTML markup
directly in a \LaTeX\ document. This can be used to take advantage
of new HTML facilities as soon as they become available (HTML is
currently evolving towards a new specification called HTML+).
A particularly
good use of this feature is in the creation of interactive
electronic forms from within a \LaTeX\ document.
\section{Concluding Remarks}
Conversion tools like \latextohtml provide an easy migration path
from familiar concepts towards authoring complex and format-rich
hypermedia documents. In this way, familiarity with a system
like \LaTeX\ makes it possible to contribute to and benefit from
a rapidly expanding global hypermedia network.
\bibliographystyle{plain}
\begin{thebibliography}{1}
\bibitem{tbl:www}
T.~Berners-Lee, R.~Cailliau, J.~Groff, and B.~Pollerman.
\newblock Worldwide web: The information universe.
\newblock {\em Electronic Networking: Research, Application and Policy}, (1),
1992.
\bibitem{levy:www}
Joe Levy.
\newblock The world in a web.
\newblock {\em {\it The} Guardian}, page~19, November 11 1993.
\bibitem{vern:www}
Vern Paxson.
\newblock Growth trends in wide-area {TCP} connections.
\newblock {\em IEEE Network}, To Appear 1993.
\newblock Available at ftp://ftp.ee.lbl.gov/WAN-TCP-growth-trends.revised.ps.Z.
\end{thebibliography}
\appendix
\section{Further Information}
\latextohtml is written in Perl and requires freely available
software. \htmladdnormallink{More information on how to get, install
and use it is available via the
WWW}{http://cbl.leeds.ac.uk/nikos/\-tex2html/doc/latex2html/\-latex2html.html}
or using anonymous ftp from ftp.tex.ac.uk in
pub/archive/support/latex2html. A new release is planned for early
December 1993.
Several computers on the Internet have public access World Wide Web
clients accessible by telnet e.g.\ \\ \bull telnet info.cern.ch (direct
connection --- no username or password required) \\ \bull telnet
ukanaix.cc.ukans.edu (``Lynx'' requires a vt100 terminal. Log in as
www.)
Information on World Wide Web is also available via anonymous ftp from
{\tt ftp.germany.eu.net} in {\tt pub/infosystems/www}. The Mosaic clients are
in the directory {\tt /pub/infosystems/www/ncsa/Web}.
\end{article}