4.1011 Unicode (2/130)
Elaine Brennan & Allen Renear (EDITORS@BROWNVM.BITNET)
Mon, 11 Feb 91 15:16:53 EST
Humanist Discussion Group, Vol. 4, No. 1011. Monday, 11 Feb 1991.
(1) Date: Mon, 11 Feb 91 16:12:36 GMT (10 lines)
From: DEL2@PHOENIX.CAMBRIDGE.AC.UK
Subject: Unicode
(2) Date: Wed, 6 Feb 91 15:45:50 PST (120 lines)
From: Ken Whistler <whistler@ZARASUN.METAPHOR.COM>
Subject: [reply to Douglas de Lacey, Re: Unicode 1.0]
(1) --------------------------------------------------------------------
Date: Mon, 11 Feb 91 16:12:36 GMT
From: DEL2@PHOENIX.CAMBRIDGE.AC.UK
Subject: Unicode
For those of you interested in the Unicode (character-encoding) debate:
for e-mail responses the deadline has been extended to 25 February. So
order your copy now from microsoft!asmusf@uunet.uu.net, and make sure
your voice is heard.
Regards, Douglas de Lacey.
(2) --------------------------------------------------------------21----
Date: Wed, 6 Feb 91 15:45:50 PST
From: Ken Whistler <whistler@ZARASUN.METAPHOR.COM>
Subject: [reply to Douglas de Lacey, Re: Unicode 1.0]
I sent the following reply letter to Mr. de Lacey's recent
comments on Unicode 1.0. Since he posted his comments to
TEI-L, I am also forwarding my reply to that list. (Ken Whistler)
[Whistler asked that this also be posted on Humanist. -- Allen]
Mr. de Lacey,
Asmus Freytag forwarded your comments to several of us who are currently
working on the Unicode 1.0 draft. While formal resolution of commentary
will await decisions by the Unicode Technical Committee, I thought it
might prove useful to clarify a few things now. These are my own
opinions, and do not necessarily reflect the decisions of the UTC.
Many of the bizarre characteristics of the symbols area that you note
(encoding of fractions, Roman numerals, etc.) are simply the price we
have had to pay to preserve interconvertability with other, important
and already-implemented character encodings. We fully expect that
any "smart" Unicode implementation will ignore most of the fraction
hacks, for example, and encode fractions in a uniform and productive
way. There is, in fact, a dual argument fraction operator in Unicode
(U+20DB) to support such implementations.
The coexistence of composite Latin letters (e.g. E ACUTE) with
productive composition using non-spacing diacritics is also forced
by compromises between competing requirements for mapping to old
standards and implementation needs of the various parties which
will use Unicode. While this has been (accurately) criticized as
leading to non-unique encoding--in the sense that alternative,
correct "spellings" of the "same text" can be generated--it is
my considered opinion, after long arguments with proponents of
other approaches, that uniqueness is not obtainable. In other
words, we could design a scheme which could theoretically lead
to unique encoding, but it would be unacceptable as a practical
character encoding--so we wouldn't get it anyway. Unicode
started out as you envision it--with only baseforms and non-spacing
diacritics for Latin/Greek/Cyrillic, so that all accented letters
would be composed. But that allowed for no acceptable evolutionary
path from where we are to where we would like to be. The other
approach, which tries to encode every single combination anyone
could use (i.e. ISO DIS 10646), is necessarily incomplete, in
that it refuses to acknowledge productivity in application of diacritics
(e.g. for IPA).
So Unicode is admittedly a chimera--but a practical, real chimera
that will be implemented, rather than an impractical and
unimplementable one.
You identify a problem which arises from non-uniqueness, namely:
>two encodings of an identical text may thus turn out to be very
>different; and for anyone using computer comparison of texts this could be
>quite problematic.
I would imagine this also disturbs the dreams of many who are working
on the text encoding initiative. But again, I think there is no
way to guarantee uniqueness. Furthermore, the entire notion of
"identical text" requires rigorous definition before algorithmic
comparisons by computer make any sense. Is a text on a Macintosh
comparable to the "identical text" on an IBM PC? Well, perhaps,
once considerations of several layers of hardware, software, and
text formatting, together with character set mapping are resolved.
Such comparisons involve appropriate filters, so that canonical
forms are properly compared. All Unicode implementers I know
of are fully aware of the problem of canonical form for text
representation. (By the way, it might be fair to say that this
is an order-of-magnitude more critical problem for corporate
database implementors than it is for text analysis.)
Another thing to keep distinct in understanding Unicode is that
not everything which can appear on a page can be encoded in Unicode
plain text. Changes of font, changes of language, or metatextual
references to a particular glyph:
>"There are three possible form of LATIN SMALL LETTER G CEDILLA (U+0123)
>and they look like ..."
require a higher level of text structure than simply a succession
of characters one after another. Unicode is definitely not going
to be defining a bunch of ESCAPE code sequences to be embedded into
text with particular semantics such as "change font to...". Modern
text editing, analyzing, and rendering software deals with such things
by means of distinctions on a "plane above" the text itself. The plain answer
to the question, "could the whole of the manual as printed be
sensibly encoded in Unicode?", is clearly no, since it requires
a layer of formatting and distinguishes multiple fonts.
The particular case of the GREEK SMALL LETTER SCRIPT THETA is just
baggage dragged along from mistakes made in earlier encodings (thus
also the other admitted glyphs encoded separately in the Greek block).
There is a scheme for indicating preferential rendering (where possible)
using ligatures (such as Greek "kai"). The ZERO WIDTH JOINER (U+200D)
and ZERO WIDTH NON-JOINER (U+200C) can be used as rendering hints
for ligatures, as well as serving as an important part of the
proper implementation of cursive scripts such as Arabic.
I don't think there is a LATIN CAPITAL LETTER WYNN to be found. This
is a good case for following the "How to Request Adding a Character
to Unicode" guidelines. If you can provide clear textual evidence that
wynn appears in regular use with a case distinction, then a capital
form would be a good candidate for addition.
The Greek semicolon was unified with MIDDLE DOT (U+00B7).
The diacritic ordering algorithm (centre-out) is meant to apply
independently to diacritics on top and to diacritics on the bottom.
The issue of how to specify unambiguously side-by-side ordering
within diacritics at the same vertical level is a good one, and I think
it will have to be addressed in the final draft.
I hope these clarifications are helpful.
--Ken Whistler