@* The character set.
One of the main goals in the design of \.{WEB} has been to make it readily
portable between a wide variety of computers. Yet \.{WEB} by its very
nature must use a greater variety of characters than most computer
programs deal with, and character encoding is one of the areas in which
existing machines differ most widely from each other.
To resolve this problem, all input to \.{WEAVE} and \.{TANGLE} is converted
to an internal seven-bit code that is essentially standard ASCII, the
``American Standard Code for Information Interchange.'' The conversion
is done immediately when each character is read in. Conversely,
characters are converted from ASCII to the user's external
representation just before they are output.
Such an internal code can be accessed by users of \.{WEB} by means of
constructions like \.{@@'A'}, which should be distinguished from
\.{'A'}. The former is transformed by
\.{TANGLE} into an integer that is the internal code of \.A, but
the latter, a |char| constant, is not touched by
\.{WEB}, and will be interpreted by the \cee\ complier according to
the machine's character set. (Actually, of course, it gets translated
into \.{WEB}'s internal code just like any other character in the
input file, but then it gets translated back at output time.)
@^ASCII code@>
Here is a table of the standard visible ASCII codes (\.{ } stands for
a blank space):
$$\def\:{\char\count255\global\advance\count255 by 1}
\count255='40
\vbox{
\hbox{\hbox to 40pt{\it\hfill0\/\hfill}%
\hbox to 40pt{\it\hfill1\/\hfill}%
\hbox to 40pt{\it\hfill2\/\hfill}%
\hbox to 40pt{\it\hfill3\/\hfill}%
\hbox to 40pt{\it\hfill4\/\hfill}%
\hbox to 40pt{\it\hfill5\/\hfill}%
\hbox to 40pt{\it\hfill6\/\hfill}%
\hbox to 40pt{\it\hfill7\/\hfill}}
\vskip 4pt
\hrule
\def\^{\vrule height 10.5pt depth 4.5pt}
\halign{\hbox to 0pt{\hskip -24pt\O{#0}\hfill}&\^
\hbox to 40pt{\tt\hfill#\hfill\^}&
&\hbox to 40pt{\tt\hfill#\hfill\^}\cr
04&\:&\:&\:&\:&\:&\:&\:&\:\cr\noalign{\hrule}
05&\:&\:&\:&\:&\:&\:&\:&\:\cr\noalign{\hrule}
06&\:&\:&\:&\:&\:&\:&\:&\:\cr\noalign{\hrule}
07&\:&\:&\:&\:&\:&\:&\:&\:\cr\noalign{\hrule}
10&\:&\:&\:&\:&\:&\:&\:&\:\cr\noalign{\hrule}
11&\:&\:&\:&\:&\:&\:&\:&\:\cr\noalign{\hrule}
12&\:&\:&\:&\:&\:&\:&\:&\:\cr\noalign{\hrule}
13&\:&\:&\:&\:&\:&\:&\:&\:\cr\noalign{\hrule}
14&\:&\:&\:&\:&\:&\:&\:&\:\cr\noalign{\hrule}
15&\:&\:&\:&\:&\:&\:&\:&\:\cr\noalign{\hrule}
16&\:&\:&\:&\:&\:&\:&\:&\:\cr\noalign{\hrule}
17&\:&\:&\:&\:&\:&\:&\:\cr}
\hrule width 280pt}$$
We introduce new types to distinguish between the transliterated characters
and the characters in the outside world. Let all
``interesting'' values that a |char| variable may take lie between
|first_text_char| and |last_text_char|; for the ASCII code we can
take |first_text_char=0| and |last_text_char=0177|. We will tell \.{WEB}
to convert all input characters in this range to its own code, and balk at
characters outside the range. We make two assumptions:
|first_text_char>=0| and |char| has room for at least eight bits.
@^system dependencies@>
@d first_text_char = 0 /* lowest interesting value of a |char| */
@d last_text_char = 0177 /* highest interesting value of a |char| */
@=
typedef char ASCII; /* type of characters inside \.{WEB} */
typedef char outer_char; /* type of characters outside \.{WEB} */
@ The \.{WEAVE} and \.{TANGLE} processors convert between ASCII code and
the user's external character set by means of arrays |xord| and |xchr|
that are analogous to PASCAL's |ord| and |chr| functions.
@=
ASCII xord[last_text_char]; /* specifies conversion of input characters */
outer_char xchr[0200]; /* specifies conversion of output characters */
@ Every system supporting \cee\ must be able to read and write the 95
visible characters of standard ASCII above (although not necessarily using the
ASCII codes to represent them). Conversely, these characters, plus
the newline, are sufficient to write any \cee\ program. Other
characters are desirable mainly in strings, and they can be referred
to by means of escape sequences like \.{'\t'}.
The basic implementation of \.{WEB}, then, only has to assign an
|xord| to these 95 characters (newlines are swallowed by the reading
routines). The easiest way to do this is to assign the characters to
their positions in |xchr| and then invert the correspondence:
@c
common_init()
{
strcpy(xchr," !\"#$%&'()*+,-./0123456789\
:;<=>?@@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ");
@;
@;
}
@ The following system-independent code makes the |xord| array contain
a suitable inverse to the information in |xchr|.
@= {
int i; /* to invert the correspondence */
for (i=first_text_char; i<=last_text_char; i++) xord[i]='\040';
for (i=1; i<0177; i++) xord[xchr[i]]=i;
}
@ Some \cee\ compilers accept an extended character
set, so that one can type things like \.^^Z\ instead of \.{!=}.
If that's the case in your system, you should change this module,
assigning positions |01| to |037| in the most convenient way;
for example, at MIT you can just say
$$\hbox{|for (i=1; i<=037; i++) xchr[i]=i;|}$$
since \.{WEB}'s character set is essentially identical to MIT's,
even with respect to characters less than |040| (see the definitions
below). If, however, the changes do not conform with these
definitions you should change the definitions as well.
@^system dependencies@>
@^notes to myself@>
@= /* nothing needs to be done */
@
@d text_char = char /* the data type of characters in text files */
@=
typedef char ascii_code; /* ascii codes from 0 to 127 */
typedef FILE *text_file;
@ One of the \ASCII{} codes below 040 has been given a
symbolic name in \.{TIE} because it is used with a special
meaning.
@d tab_mark = '\t' /* \ASCII{} code used as tab-skip */
@ When we initialize the |xord| array and the remaining
parts of |xchr|, it will be convenient to make use of an
index variable, |i|.