4.1070 Responses to Humanists on Unicode (2/287)
Elaine Brennan & Allen Renear (EDITORS@BROWNVM.BITNET)
Thu, 21 Feb 91 17:21:25 EST
Humanist Discussion Group, Vol. 4, No. 1070. Thursday, 21 Feb 1991.
(1) Date: Tue, 19 Feb 91 19:15:04 PST (117 lines)
From: Ken Whistler <whistler@ZARASUN.METAPHOR.COM>
Subject: Re: CHAR ENCODING AND TEXT PROCESSING
Forwarded from TEI-L (UICVM)
(2) Date: Tue, 19 Feb 91 20:52:12 PST (170 lines)
From: Ken Whistler <whistler@ZARASUN.METAPHOR.COM>
Subject: Re: Unicode
Forwarded from TEI-L (UICVM)
(1) --------------------------------------------------------------------
Date: Tue, 19 Feb 91 19:15:04 PST
From: Ken Whistler <whistler@ZARASUN.METAPHOR.COM>
Subject: Re: CHAR ENCODING AND TEXT PROCESSING
Dear Mr. Cover,
I would like to respond to your recent note, and the implications of
the abstract you have made from Gary Simon's article. (In this I
am speaking personally, and my opinions do not necessarily
represent those of the Unicode Technical Committee.)
First of all, I want to make it clear that Unicode is not, nor does
it purport to be, a text description language. It is a character
encoding. We need to code the LATIN CAPITAL LETTER A and the
ARABIC LETTER ALEF and the DEVANAGARI LETTER A in order for any
text to be encoded, and for any textual process to be programmed
to operate on that text. However, assigning 16-bit values to
those characters (0041, 0627, and 0905, respectively) does not,
ipso facto, specify whether the LATIN CAPITAL LETTER A is being
used in an English, Czech, or Rarotongan text, or the ARABIC
LETTER ALEF in Arabic, Sindhi, or Malay, or the DEVANAGARI LETTER A
in Hindi or Nepali. Trying to mix the character encoding with
specification of textual language is guaranteed to mess up the
character encoding; the appropriate place to handle this is at
a metalevel of text/document description above the level of
the character encoding.
On the other hand, the bidirectional text problem is specifiable
independent of any particular language--or even script, for that
matter, since the generic problem is the same for Hebrew as it
is for Arabic (scripts). The fundamental reason why Unicode is
going to great lengths to include a bidirectional plain text model
is that without an explicit statement of how to do this, the
content of texts which contain both left-to-right and right-to-left
scripts mixed can be compromised or corrupted when such texts
are interchanged. If we do not come down squarely in favor of
an implicit model (or an explicit model with direction-changing
controls, or a visual order model), then bidirectional Unitext will
regularly get scrambled, and no one will know how to interpret a
number embedded in bidi text, etc., etc.
Regarding form/function distinctions, I think you are preaching to
the converted. I do not think you will be able to find another
multilingual character encoding of this scope which has been
developed with such a meticulous attention to the distinctions you
mention:
character vs. glyph (i.e. "graph" as you quote Simons)
We have been educating people about this for years. Granted, there
are glyphs encoded as characters in Unicode, too, but the main
reason they got there is because Unicode has to be interconvertible
to a lot of other "character" standards which couldn't distinguish
the two. And why does Unicode have to be interconvertible? A) Because
that is the only way to get it accepted and move into the future,
and B) Because that serves the purpose of creating better software
to handle text processing requirements for preexisting data.
glyph vs. image
Also clearly distinguished amongst our discussions. I think
Unicoders are supportive of the concept of proceeding to develop
a definitive registry of glyphs. This would be most helpful to
font foundaries and font vendors, but also would help the software
makers in performing the correct operations to map characters
(in particular language and script contexts) into glyphs for
rendering as images. But registry of glyphs is a different task
from encoding of characters. For one thing, the universe of glyphs is
much larger than the universe of characters. Unicode 1.0 is aimed
at completing the character encoding as expeditiously and
correctly as possible, rather than at taking on the larger glyph registry
problem.
language vs. script
Also clearly distinguished. Unicode characters, taken by blocks,
can be assigned to scripts. Hence the characters from 0980 to
09F9 are all part of the Bengali script. But no one is confusing
that with the fact that some subset of those is used in writing
the Bengali language and another subset in writing Assamese.
script vs. writing system
Again, I think you will find us sympathetic and non unaware of the
distinctions involved. For example, most of us have worked on
or are currently working on implementations of the Japanese writing system for
one product or another on computer. Anyone with a smattering of
knowledge of Japanese knows that the writing system is a complicated
mix of two syllabaries, Han characters (kanji), and an adapted
form of European scripts which can be rendered either horizontally
or rotated for vertical rendering. It is a complicated writing
system which is difficult to implement properly on computer--but
that is a separate issue from how to encode the characters.
You quote Gary Simons as stating that: "We need computers,
operating systems, and programs that can potentially work in any
language and can simultaneously work with many languages at the
same time." I can guarantee you that this is the passionate
concern of those who have been working on Unicode for the last
two years. It is precisely because the character encoding
alternatives (ISO 2022, ISO DIS 10646, various incomplete
corporate multilingual sets, and font-based encodings which
confuse characters and font-glyphs) are so dismal that we have
worked so hard to design a multilingual character set with the
correct attributes for support of multilingual operating
systems, multilingual applications, multilingual text interchange
and email, multilingual displays and printers, multilingual
input schemes, and yes, multilingual text processing.
Don't expect the holy grail by Tuesday, but if we really think
all those things are worth aiming for, it is vitally important
that those who build the operating systems, the networks,
the low-level software components, and the high-level applications
reach a reasonably firm consensus about the character encoding
now.
--Ken Whistler
(2) --------------------------------------------------------------172---
Date: Tue, 19 Feb 91 20:52:12 PST
From: Ken Whistler <whistler@ZARASUN.METAPHOR.COM>
Subject: Re: Unicode
Dear Mr. Reuter,
I addressed some of your concerns in my reply to Robin Cover, but I
would like to respond to a few of the specific points which you have
raised. (Disclaimer: These are personal opinions, and do not
necessarily reflect the position of the Unicode Technical Committee.)
Regarding your point a., that Unicode seems biased towards display
rather than other forms of data processing. First of all, you
must understand that Unicode has been visited with the sins of
our fathers. The medial and final sigma are already distinguished
in the Greek standard. We cannot unify them without Hellenic
catastrophe. (In fact the Classicists inform us that there are
good reasons why we must introduce a third sigma, the "lunate
sigma", in order to have a correct and complete encoding.) Nobody
likes the Roman numerals, or the parenthesized letters, or the
squared Roman abbreviations, ... The general reaction has been
Sheesh! But important Chinese, Japanese, and Korean standards which
have to be interconvertible with Unicode have already encoded
such stuff, and we are stuck with it. Why? Because the design
goal of a perfect, de novo, consistent, and principled character
encoding is unattainable (believe me, we tried), and because the
higher goal of attaining a usable, implementable, and
well-engineered character encoding in a finite time is greatly
furthered by including as much as possible of the preexisting
character encoding standards.
You also noted that the semantic overlaps are very acute in the
mathematical symbol area. Nobody can tell us how many distinct
semantic usages there are for "tilde", for example. Should we
encode 1, 3, 7, 16 of them?? We made what I think is the best
compromise we could under the circumstances. The TILDE OPERATOR
is encoded as a math operator (distinct from accents, whether
spacing or non-spacing), but no further attempt is made to separate
all the possible semantics applicable. Note that if we start
trying to distinguish "difference" from "varies with" from "similar",
from "negation", etc., we would be forcing applications (and users)
to encode the correct semantic--even when they don't know or
can't distinguish them. This has the potential for being
WORSE for text processing, rather than better. Over-differentiation
in encoding is just as bad as under-differentiation.
I don't understand your concern about not distinguishing hacek and
superscript v. Unicode does not encode superscript v at all.
Except for those superscripts grandfathered in from other standards
(remember the sins of our [grand]fathers), superscript variants
of letters are considered rendering forms outside the scope of
Unicode altogether. If someone uses a font which has hacek
rendered in a form which looks like a superscript v, that is a
separate issue. From a Unicode point of view that would simply
be mapping the character HACEK onto the glyph {LATIN SMALL V} in
some particular typeface for rendering above some other glyph.
A font vendor could do that. It might even be the correct thing
to do, for example, in building a paleographic font for manuscript
typesetting.
Regarding your b. item concerns about the layout of Unicode: First
of all, I am sensitive about your using the term "code page" in
referring to the Unicode charts. "Code page" is properly
applied to 8-bit (or to some double 8-bit) encodings which can
be "swapped in" or "swapped out" to change the interpretation of
a particular numeric value as a character. Unicode values are
fixed, unambiguous, and unswappable for anything else. The charts
are simply a convenient packaging unit for human visual consumption
and education. The fact that we tried to align new scripts with
high byte boundaries resulted from the implementation requirement
that software have easy and quick tests for script identity.
The subordering within script blocks does attempt to follow
existing standards, where feasible. We tried the alternative of
simply enumerating all the characters in a script and then packing
them in next to each other in what would pass for the "best"
alphabetic order, but that introduces other problems AND makes
the relevant "owners" of that script gag at the introduction of
a layout unfamiliar to them. In the end all such processes such
as case folding, sorting, parsing, rendering, etc. depend on
table lookup of attributes and properties. There is no hard-coded
shortcut which will always work--even for 7-bit ASCII. The
compromise which pleased the most competing interests (and which,
by the way, got us to a conclusion on this issue) was to follow
national standards orders as applicable. You might note that the
one REALLY BIG case where we have to depart from this is in
unifying 18,000+ Han characters. The only way to do this is to
depart from ALL of the Asian standards--so nobody can convert
from a Chinese, Japanese, or Korean standard to Unicode by a
fixed offset! Believe me, that has occasioned much more grumbling
(to put it mildly) than any ordering issue for Greek or Cyrillic!
Concerning your point c., about diacritics being specified as
following a baseform rather than preceding it: Clearly we had
to come down on one side or the other. Not specifying it would
be disastrous. So we made a choice. Granted, that having diacritics
follow rather than precede baseforms favors rendering algorithms
over parsing algorithms. To have made the opposite choice would
have reversed the polarity of benefits and costs. It is a tradeoff with no
absolutely right answer. Nevertheless, I think the choice made was
the correct one.
First, the rendering involved is not really as you have characterized
it. "Non-spacing" diacritics are NOT backspacing. Such terminology
is more properly applied to spacing diacritics (such as coded in
ISO 8859-1 or ISO DIS 10646), which for proper rendering use require the
sending of a BACKSPACE control code between a baseform and an accent.
That's the way composite characters used to be printed on daisy-wheel
printers, for example. But that is a defective
rendering model which ignores the complex typographical relationship
between baseforms and diacritics. The kind of rendering model we
are talking about involves "smart" fonts with kerning pair tables.
The "printhead" is not trundled back so that an accent can be
overstruck; instead, a diacritic "draws itself" appropriately, in
whatever medium, on a baseform in context. The technology for
doing this is fairly well-understand but quite complex. I think
it would be fair to say that if I were writing a text processing
program (and I have), I would rather have system support for such
rendering and deal with the look-ahead problem than have to deal
with font rendering problems in my program.
Second, the "state" that has to be maintained in parsing diacritics
is quite different from the "state" that Unicode claims to eliminate.
Parse states have to be maintained for all kinds of things. If I
am parsing Unicode which uses non-spacing diacritics, then I have
to maintain a parse state to identify text elements; but even parsing
for word boundaries, for example (an elementary operation in editing)
has to maintain state to find boundaries which may depend on
combinations of punctuation, or on ambiguous interpretation of some
characters which can only be disambiguated in context, etc., etc.
More complicated parsing often maintains elaborate parse trees
with multiple states. The "statefulness" that Unicode is trying to
eliminate is a state in which the interpretation of the bit
pattern for a character changes, depending on which state you are
in. This is the "code page sickness", where one time the 94 means
"o-umlaut", and next time it means "i-circumflex", and next time it
means "partial differential symbol", depending on what code page you
are using, and what code page shift state you happen to be in.
The two-byte encodings currently are horrible in this respect, since
they may mix single-byte and two-byte interpretations in ways which
may mean that figuring out what a particular byte is supposed to
represent in any random location can be very difficult. You have to
find an anchor position from which you can parse sequentially,
maintaining state, until you get to the byte in question to find out
what it means. Unicode eliminates THAT kind of state maintenance.
I find myself agreeing with your statement that "there seem to be
conflicts of interest between different applications, which
*necessarily* lead to ambiguities or difficulties for someone."
The way I would put it, following a distinction made elegantly
by Joe Becker, is that there is no way that any encoding of
CODE ELEMENTS (i.e. the "characters" assigned numbers in Unicode)
will automatically result in one-to-one mappability to all the
TEXT ELEMENTS which might ever be of interest to anyone or have
to be processed as units for one application or another. Your
mention of "ch" as a collation unit for Spanish is one obvious
example. Fixing the CODE ELEMENTS of Unicode should not preclude
efforts to identify appropriate TEXT ELEMENTS for various
processes. Such TEXT ELEMENTS will have to be identified as to
their appropriate domain of application--and that does include
language as well as other factors. But it is not the job of
the character encoding to do that work. The character encoding
should be designed so as not to impede TEXT ELEMENT identification
and processing--for example, it would be crazy to refuse to encode
LATIN LETTER SMALL I because it could be composed of a dotless-i
baseform and a non-spacing dot over! But character encoding cannot
BE the TEXT ELEMENT encoding, however much we might desire
a simpler world to work with.
--Ken Whistler