4.1091 Character Sets -- ISO10646 (1/53)

Elaine Brennan & Allen Renear (EDITORS@BROWNVM.BITNET)
Tue, 26 Feb 91 22:37:00 EST

Humanist Discussion Group, Vol. 4, No. 1091. Tuesday, 26 Feb 1991.

Date: Tue, 26 Feb 91 14:24:00 CST
From: Michael Sperberg-McQueen <U35395@UICVM>
Subject: ISO 10646 is *not* a variable-width character set

Lest there be confusion, I have to point out that Pierre Mackay appears
to misrepresent ISO 10646 in his note on it and Unicode of 24 February
1991. He says:

ISO 10646 is a descendant of ISO 2022. It is based on "octets"
which come in groups of variable length. That is no great
problem for a communications standard, but is a real pain for a
computer coding standard.

This simply isn't true.

(Warning to non-technical readers: alphabet soup ahead.)

Mackay's statement flatly contradicts the version of ISO 10646 I have in
hand (the Draft Proposal of January 1989) and runs counter to every
account I have read in the last four years of current work in ISO JTC1
WG2. ISO 10646 is not 'descended' from ISO 2022 in any way I can see --
if anything, 10646 is designed to make the heart of 2022 unnecessary
wherever 10646 has been implemented. The text says quite clearly that
ISO 10646 includes no 'floating' diacritics (which do, as Mackay says,
make life very hard for software developers and programming language
standardizers): "When conforming to this International Standard these
[i.e. all -MSM] diacritics shall be used as free-standing characters."
(Clause 18.7, p. 15).

10646 does allow for various forms of use which involve compression of
its thirty-two-bit characters to 24, 16, or 8 bits, but these have
nothing to do with the issue of floating diacritics or code-page
switching, which are the major problems for programs which rely on
having fixed-width characters. Anyone who implements 10646 in a program
and does not use a fixed width for characters has no one to blame but
the programmer.

Unicode's designers are well aware of the problem, and I think Mackay is
right in describing their intentions as being for fixed-width 16-bit
bytes. As it happens, though, that is not *quite* what the document now
actually defines: it does allow floating diacritics, and that means it
does allow variable-width characters. There are serious reasons for the
inclusion, which the designers have stated quite clearly in the postings
which have been distributed on this and other lists. They are the same
reasons that lead some to be unhappy with ISO 8859 and ISO 10646 and to
prefer ISO 6937: floating diacritics make for a much larger character
repertoire, and for compatibility with existing character
representations, at the cost of not knowing whether the thirtieth
character in a line is the thirtieth bute or the thirty-first, ...

Mackay, of all people, should know better than to confuse what is
intended for the long run and what is defined in the document. Unicode
does *not* now provide a fixed-width encoding for all characters. And
ISO 10646, by contrast, does.

-Michael Sperberg-McQueen
University of Illinois at Chicago