3.681 standards for character sets (92)

Willard McCarty (MCCARTY@vm.epas.utoronto.ca)
Wed, 1 Nov 89 20:51:21 EST

Humanist Discussion Group, Vol. 3, No. 681. Wednesday, 1 Nov 1989.

Date: 31 October 1989 17:37:31 CST
From: "Michael Sperberg-McQueen 312 996-2477 -2981" <U35395@UICVM>
Subject: character set standards, ISO, SGML

Joe Giampapa asks about character sets better than ASCII, and urges
someone to take the problem on. Since character sets have great
symbolic value for humanists (who wants to look at stuff in
transliteration or strewn with garbage if you can look at it in its
correct form?), it's probably worth mentioning what exists, so that
Humanists can support the standards, or lobby their support staffs
to do so.

First, there are the vendor-dependent eight-bit character sets (that of
the IBM PC, the related one of the IBM PS/2, the system set of the Mac,
and the various character sets created for the Mac by users and
third-party vendors.) These are all non-standard, though useful for (a)
processing on one's own machine and (b) interchange among like machines.
They are *not* a solution. For interchange among unlike machines,
something more standard is needed -- preferably a real national or
international standard.

And it does exist. Yes, Virginia, there *is* an eight-bit ASCII. Or,
more correctly, there is an ISO eight-bit character set for Western
European languages which is a simple extension of ASCII. Or, more
correctly still, the international standard ISO 8859 parts 1-8 defines a
family of eight-bit codes for single-byte representations of the
Latin-based alphabets in the official languages of Western and Eastern
Europe, Latin-and-Cyrillic, Latin-and-(Modern)-Greek, Latin-and-Arabic,
and Latin-and-Hebrew. EBCDIC code pages have been defined which
correspond to each of these codes, which means EBCDIC to ASCII
translation may someday be less fraught with problems than it is now.
There should soon be an ANSI version of ISO 8859-1, but if it's out
I haven't seen it.

A competing standard from the same ISO working committee, ISO 6937,
defines a character set with dead keys which handles (of course) an even
greater variety of Latin-based languages. Using 6937, however, some
characters take one byte and others (the composite characters) take two
or more. This wreaks havoc with computer languages and programs built
around the assumption that each character has a length of one byte.
That's why 6937 has received so little vendor support, and why 8859 was
developed.

There are other standard character sets (e.g. that developed by the
American Library Association, ANSI Z39.47-1985, which also uses dead
keys and is respected by a number of library automation systems) and the
European Computer Manufacturers Association runs an international
character-set registry on behalf of ISO. But ISO 8859 seems clearly to
have more support than any competitor, as a character set for
general-purpose data processing in North America and Western Europe.
Further development is proceeding, and perhaps Harry Gaylord, who serves
on one of the responsible standards committees, will report on it.

For further information subscribe to ISO8859@JHUVM and read its old
logs.

Joe Giampapa also mentions an "SGML initiative" which was to come up
with something; I assume he means the Text Encoding Initiative, which is
working with SGML and which will certainly make recommendations for the
use and documentation of character sets. SGML (ISO 8879 -- similar but
different number) itself, Humanists will be relieved to hear, does not
require any particular character set. The choice of SGML, therefore,
does not require any commitment to any particular solution to the
character set problem.

SGML does provide a method for naming a character using "safe"
characters and special delimiters (a-umlaut might be encoded
"&aumlaut;") and defines public sets of such character names, which will
be useful for interchange among systems. Both character-set use and
standard character names must be part of any sound set of TEI
recommendations.

The TEI's character-set working group is headed by Steven DeRose; he is
between addresses right now so I can't post his e-mail address, but
anyone interested in working on character set problems may contact me
and I will pass your name to him when I talk to him. Those whose
languages are *not* catered to by ISO 8859 should definitely
speak up, so we can get cracking on your problems.


-Michael Sperberg-McQueen
ACH / ACL / ALLC Text Encoding Initiative
University of Illinois at Chicago