4.1068 Codes -- Unicode v. ISO10646 (5/170)

Elaine Brennan & Allen Renear (EDITORS@BROWNVM.BITNET)
Thu, 21 Feb 91 01:14:24 EST

Humanist Discussion Group, Vol. 4, No. 1068. Thursday, 21 Feb 1991.


(1) Date: Wed, 20 Feb 1991 21:35:29 EST (19 lines)
From: "Allen Renear, CIS, Brown Univ. <ALLEN@BROWNVM>
Subject: Character Set Lists (again)

(2) Date: Fri, 15 Feb 91 09:42:36 EST (35 lines)
From: Edwin Hart <HART@APLVM.BITNET>
Subject: Having more than one global-multibyte code
Forwarded from: ISO10646 (JHUVM)

(3) Date: Fri, 15 Feb 91 17:29:00 CET (44 lines)
From: "J. W. van Wingen" <BUTPAA@HLERUL2.BITNET>
Subject: Re: Having more than one global-multibyte code
Forwarded from: ISO10646 (JHUVM)

(4) Date: Fri, 15 Feb 91 17:45:00 CET (27 lines)
From: "J. W. van Wingen" <BUTPAA@HLERUL2.BITNET>
Subject: Re: Interworking with Unicode
Forwarded from: ISO10646 (JHUVM)

(5) Date: Mon, 18 Feb 91 15:39:00 CET (45 lines)
From: "J. W. van Wingen" <BUTPAA@HLERUL2.BITNET>
Subject: AEILS Newsletter
Forwarded from: ISO10646 (JHUVM)

(1) --------------------------------------------------------------------
Date: Wed, 20 Feb 1991 21:35:29 EST
From: "Allen Renear, CIS, Brown Univ. 401-863-7312" <ALLEN@BROWNVM>
Subject: Character Set Lists (again)


Again I am cross-posting a selection of postings on character set
issues. This batch is from the listserv list ISO10646 at JHUVM, which
is for discussing multibyte character sets proposals such as ISO 10646
and Unicode. Tomorrow I will post a selection from the character set
discussion on the listserv list TEI-L at UICVM, which is for discussion
the Text Encoding Initiative Guidelines.

I believe the discussions of character set standards taking place on
these lists are very important to computing humanists -- and I think
you will find them surprising intellectually engaging as well.
(There are also delightful asides: most recently on Sanskrit acrostics
and diacritical marks in Serbo-Croatian crossword puzzles.)

-- Allen
(2) --------------------------------------------------------------37----
Date: Fri, 15 Feb 91 09:42:36 EST
From: Edwin Hart <HART@APLVM.BITNET>
Subject: Having more than one global-multibyte code

Gentlemen,

I made as statement to the SHARE membership to the effect that having two
multibyte codes would be a disaster for the information industry. I still
feel that way. At this point, Unicode and DIS 10646 are incompatible. To
solve this problem, we have three choices:

1. Merge Unicode and 10646 someway (probably mutually disagreeable to
both parties)
2. Support 10646 and ignore Unicode
3. Ignore 10646 and support Unicode

What appears more likely is that we will have both codes available and used.
Unicode will not disappear because certain people in ISO have ignored it.
ISO 10646 will not disappear because people coding in Unicode want to ignore
it. Because of the different philosophies behind 10646 and Unicode, coversion
between the two coding schemes will not be as "easy" as converting between
ASCII and EBCDIC where the repertoires are the same except for 3 ASCII and
3 EBCDIC characters. I said "easy" because I was the editor of a 100 page
SHARE position paper to IBM. That paper described several classes of problems
converting character data from ASCII to EBCDIC and vice versa. The paper
described problems converting between two coding schemes with the same
coding philosophy. Converting between two codes with different philosphies
AND different repertoires AND with 30,000 defined code positions may prove
impossible without losing some of the information. That is the disaster we
are facing. It may keep several consultants in business for decades.

I am not trying to blame ISO or the Unicode Consortium for this problem. I
am concerned with finding a solution to avoid having the problem.

Ed Hart
(3) --------------------------------------------------------------46----
Date: Fri, 15 Feb 91 17:29:00 CET
From: "J. W. van Wingen" <BUTPAA@HLERUL2.BITNET>
Subject: Re: Having more than one global-multibyte code

Dear Colleagues

> I made as statement to the SHARE membership to the effect that having two
> multibyte codes would be a disaster for the information industry. I still
> feel that way.

I completely agree.

> At this point, Unicode and DIS 10646 are incompatible. To
> solve this problem, we have three choices:
>
> 1. Merge Unicode and 10646 someway (probably mutually disagreeable to
> both parties)
> 2. Support 10646 and ignore Unicode
> 3. Ignore 10646 and support Unicode

There are two aspects. First the formal one. Unicode is not now an
International Standard and never will become one as long as the ISO
procedures are not followed. This involves that it will not become an
European Standard. Conformance to these is a requirement for any
government contract in Europe in the future. Products offering only
Unicode will be excluded from this market. Thus choice 3 is unrealistic.
Choice 1 is only possible if parties are prepared to cooperate under the
established rules as are given in ISO Directives.

> What appears more likely is that we will have both codes available and used.
> Unicode will not disappear because certain people in ISO have ignored it.
> ISO 10646 will not disappear because people coding in Unicode want to ignore
> it. Because of the different philosophies behind 10646 and Unicode, coversion

If people prefer short-term thinking it is at their own risk. But it is
an historical fact that ISO standards always win in the long run. About
1970 almost nobody used ASCII, and new non-IBM computers were built
around EBCDIC. We see the same thing happeing with SNA vs OSI.

This does not mean that I consider 10646 perfect. But we should spend
all our energy to improve it, even if the eventual result would have a
strong likeness to Unicode.

Best regards, Johan van Wingen
(4) --------------------------------------------------------------29----
Date: Fri, 15 Feb 91 17:45:00 CET
From: "J. W. van Wingen" <BUTPAA@HLERUL2.BITNET>
Subject: Re: Interworking with Unicode

Dear Colleagues

> Through the discussion of "unbounded repertoire", we have learned that the
> concept of character identity is different in Unicode. The difference might
> first appear only academic and philosophical, but practical implications are
> very serious. My conclusion is that code conversion between Unicode and
> existing character sets may be impossible without good knowledge of the script
> and the writing systems in general. Here's why:

The concept of character is used in many other fields of information
technology, for example programming languages (SC22), data base systems
(SC21 WG5), text communication and networks (X400), Text and Office
Systems (SC18: ODA, SGML). Georges Clemenceau (French Prime Minister in
1918) has said: War is a too important subject to leave it to generals.
In the same way I say: Characters are a too important subject to leave
it to linguists. The nature of characters in programming languages has
been under study for two years, and it appeared to be difficult and
little understood. Anyhow the concept that emerges shall be adapted to
many requirements, not only to those of the linguists.

Best regards, Johan van Wingen
Liaison representative from ISO/IEC JTC1/SC2 Characters and Information
Coding to SC22, Languages (for Information Technology)
(5) --------------------------------------------------------------47----
Date: Mon, 18 Feb 91 15:39:00 CET
From: "J. W. van Wingen" <BUTPAA@HLERUL2.BITNET>
Subject: AEILS Newsletter

Last Friday I eceived a copy of the Feb 1991 issue of the AEILS
Newsletter. It contains a Table 1, Comparison of ISO/IEC 10646 and
Unicode. This is a nice and informative scheme, but I think it is
rather unwittingly biassed towards Unicode.

With Unicode there are two aspects, that of procedure, and that of
content.

With the first, there is no regular way to discuss it in standards
committees. We have no control over what would happen with a vote based
on a national position. To put it quite bluntly, with ISO 10646 we know
what the effect will be of a negative vote from the Netherlands. We
prefer results based on Rights over that based on Courtesy from the
designers only. In other words, if the choice is between International
Democracy and Californian Dictate, we know very well where to stand.

Now ECMA (European Computer Manufacturers Association) is opposing
Unicode, the rest of Europe will follow. Given the strict rules for
government procurement, Unicode will not be accepted, and use of it in
files will severely restrict their portability.
This may not bother people in the US. In fact they seem to think that it
is only a question of competing proposals. This a grave mistake. You may
start printing your own dollar notes, because the official design does
not satisfy you. And in fact, it appears that many of your friends and
relations are prepared to accept them, even prefer them to official
money. But as soon as you arrive far from home, people tell you that
your values are worthless. This is not an invented example. People who
have experienced the variety of pound notes in Scotland, and how these
are refused for a change in London, know what I mean.

With the second, there is a misleading claim by Unicode that it is a
fixed two-byte code. However, as a result of using floating diacritics,
languages using characters with many diacritics will have to code some
with two, some with 4, 6, 8, like in Vietnamese. On the contrary, ISO
10646 contains a code for all the separate characters, and is thus a
true fixed byte code, which may take 2, 3 or 4 bytes, but the same fixed
number for the whole of the file. There is considerable confusion about
this with people who have not read the original documents.

Best regards, Johan van Wingen
Mail to P. O. Box 486, 2300AL LEIDEN, The Netherlands