4.1038 Unicode and the TEI (1/74)

Elaine Brennan & Allen Renear (EDITORS@BROWNVM.BITNET)
Fri, 15 Feb 91 13:00:39 EST

Humanist Discussion Group, Vol. 4, No. 1038. Friday, 15 Feb 1991.

Date: Fri, 15 Feb 91 15:45:20 MET
From: Harry Gaylord <galiard@let.rug.nl>
Subject: Unicode

I have been asked by several people to say something about the
implications of the arrival of Unicode for TEI. Several useful comments
in general have appeared on Humanist, TEI-L, and 10646 about relevant
issues. Yet it is difficult to say anything succintly at this point.

One thing is clear. No character set so far has tackled the problem of
the need to encode the lang characteristic in texts. This was already
pointed out in P1 and elsewhere. This, it seems to me, is very important
regardless of which coded character set one uses.

There are advantages and disadvantages to both Unicode and ISO 10646 as
they are currently formulated. Hopefully they will be merged into one
ISO standard. There is no need for two multi-byte standards to be used
in different systems or even worse in single systems.

Unicode and 10646 and the 8859 family of coded character sets have a
different understanding of what a character is how it will be used.
Unicode says nothing about the imaging of texts on a screen or printing
on paper. In a Unicode file the Greek letter alpha + IOTA SUBSCRIPT +
ROUGH BREATHING MARK + GRAVE ACCENT would be coded in 4 16-bit bytes.
The software used to image this text would have to recognize this
combination of one spacing and three non-spacing characters and put the
image on your screen. ISO 10646 and the 8859 family the approach has
been to have each combination as a different coded character. Therefore
this combination would be one byte in 10646. This would be a 32-bit
byte if one were using the full 10646 set or possibly 16 or 8 bits if
one were using one of the compression techniques. The software running
the system with Unicode would also have to know that since there are two
accents above they have to be located differently above the letter than
if there is only one.

On the other hand some languages have so many different combinations that
it is common practice to use "floating accents" or graphic character
combination encoding. An example of this is Hebrew which has 23
consonants and 5 final forms. Its vowels and other signs are imaged in
relation to the consonants. If one had a coded character for each
possible combination, it would be enormous. Therefore present systems,
e.g. Nota Bene SLS, and others encode these separately. This is also
true of Unicode and 10646. It is uneconomical to do it otherwise.

Two basic criticisms of the present proposals in 10646 are the very large
number of wasted control character positions in it, and inadequate
provision for graphic character combination encoding. In the latter
there is an appendix referring to they way this can be done under
another ISO standard, but this appendix is not a required part of the
standard itself.

The TG on character sets is in contact with Unicode and ISO with our
concerns for their work.

We must remember that the final outcome of what is delivered is still
very uncertain. The standards have to be formulated and then hardware
manufacturers have to be convinced of the importance of them and
implement them. This all takes time. It is also important to note that
the big players have people working in the Unicode consortium and the
ISO 10646 committee.

One concern that I have is the need for representing text as it is
contained in older books and manuscripts. Neither standard as far as I
can see has the long s of English printing in earlier books. Yet we
need it for many scholarly purposes. From the standpoint of both of
these standards it would be classified as a "presentational variant" of
s and be placed in a completely different section of the character set.
This is even more true of letter shapes as they appear in manuscripts.

There is room in each proposal for private use characters which can be
used by agreement of two or more parties. Yet the more that is included
in a standard as standard, the better off we are.

There are currently attempts to combine the work of the Unicode
consortium and the committee for 10646. Let's hope they are successful
and that the results improve on both.

Harry Gaylord