5.0412 Answers to Character Code Questions (2/193)

Elaine Brennan & Allen Renear (EDITORS@BROWNVM.BITNET)
Fri, 25 Oct 1991 20:08:46 EDT

Humanist Discussion Group, Vol. 5, No. 0412. Friday, 25 Oct 1991.

(1) Date: Fri, 25 Oct 91 09:06:18 CDT (95 lines)
From: john@utafll.uta.edu (John Baima)
Subject: Unicode

(2) Date: Fri, 25 Oct 91 17:10:02 MET (98 lines)
From: Harry Gaylord <galiard@let.rug.nl>
Subject: Unicode and ISO10646

(1) --------------------------------------------------------------------
Date: Fri, 25 Oct 91 09:06:18 CDT
From: john@utafll.uta.edu (John Baima)
Subject: Unicode

I'd like to answer John Hughes's questions by referring to Unicode
rather than ISO 10646 which will go through at least one more DIS
(Draft International Standard) and could flip-flop yet again.

(1) Do either ISO 10646 or Unicode provide support for or some protocol
for distinguishing between left-to-right and right-to-left scripts?
[stuff deleted]

A: Yes. Unicode is designed to handle both left to right and right to
left languages.

(2) How do ISO 10646 and Unicode specify that compound characters be
coded? For example, does ISO 10646 or Unicode support floating
diacritics?

A: Unicode supports (and is the prime advocate of) floating diacritical
marks. The Unicode philosophy is to make productive use of accents and
diacritical marks. They do, however, have all of the precomposed
characters in all of the ISO 8859 standards (and several other ISO
standards as well).

(3) Do either ISO 10646 or Unicode provide support for variable width
characters (e.g., "m" is wider than "i")?

A: This is a function of rendering and not encoding. It is not
discussed in Unicode. It is assumed that both fixed width and variable
width characters will be used.

(4) Do either ISO 10646 or Unicode provide support for combining
variable width characters with diacritics so that the positioning of the
diacritic is relative to the width of the character.

A: Same as (3)

(5) Generally, for those of us not familiar with ISO 10646 and Unicode,
how will these standards affect future operating systems and
applications in ways that are beneficial to persons involved in
multilingual word processing and related tasks? How specifically do you
envision them making our tasks easier.

A: Apple, IBM, Microsoft, NeXT, Lotus, Word Perfect, Claris, Sun
Microsystems, Metaphor and others comprise the Unicode Consortium.
Unicode will allow us to share "plain Unicode files" like we share
"plain ASCII" today. However, Unicode 1.1 can handle virtually all of
the known languages, excepting some archaic scripts. This includes
literally thousands of languages. General purpose applications (word
processing, presentation software, communications software, databases,
etc.) will be able to handle Hebrew and Arabic as well as English or
whatever.

(6) Finally, I have no quibbles with John Baima's unhappiness about DOS.
However, I fail to see how Unicode, ISO 10646, Windows, Type 1 fonts,
TrueType fonts, or ATM will solve the problem I described earlier on
HUMANIST about printer drivers, changes to printer ROMs, [stuff deleted]

A: The BIG difference is that when a font fails to print on a given
printer, whose job is it to correct the problem? With DOS solutions, it
is each software application. However, if I use Adobe Type 1 fonts and
ATM under Windows and that font does not print on a Brand X printer, it
is the responsibility of the printer manufacturer and Adobe. Not only is
it not my responsibility as a application developer, I *cannot* solve
the problem. The problem gets fixed once, not N times for N different
DOS applications.

These ROM problems are more often than not only related to downloading
characters to the printer. However, ATM does not use downloaded fonts
but prints the text in graphics mode. All Toshiba dot matrix
printers, for example, use the same graphics standard but widely
different character downloading (if you include the old Toshiba
printers).

Finally, I would like to encourage HUMANIST's to think about Unicode.
If it sounds like a "good thing", please post why (or why not). I would
like to make a collection of these responses and pass them on to the
ISO and Unicode lists. I have been told by one of the computer OS
manufactures that a good requirement definition as to why Unicode would
be a "good thing" (and help sell computers) may help push them to
insert this feature sooner rather than later.

John Baima
john@utafll.uta.edu

(2) --------------------------------------------------------------95----
Date: Fri, 25 Oct 91 17:10:02 MET
From: Harry Gaylord <galiard@let.rug.nl>
Subject: Unicode and ISO10646

Let me answer briefly the questions which John Hughes raised in Vol. 5,
No. 0409. Control codes do not form a part of the required sections of
10646. Reference is made to the Control codes of ISO 6429:1988. Until
now control functions have been encoded in positions 0 - 31 (C0) and 128
- 159 (C1) or are introduced by an escape sequence in ISO standards. In
the recent ISO discussion documents it has become apparent that UNICODE
contains control/ formatting codes in places where a graphic character
would be expected. These have now been placed in an appendix which means
devices can conform to the requirements of 10646 without using them. If
this sounds confusing, the result is simple. There are going to be two
sets of control codes which are not mutually interchangable.

1. Your question about bidi (bidirectionality) can illustrate the
problem well.

UNICODE offers a relatively brutal way of handling this.

(a) The bracket and parenthesis marks are right or left in shape
relative to the direction of the language.

(b) There are codes for starting and overriding left to right and right
to left directions.

ISO 6429 offers two ways of changing directionality.

(i) Select Presentation Directions (SPD) There are 7 possible parameters
for this to indicate the character and line progression. For Hebrew this
is 3 "the direction of the character path is from right to left; the
direction of the line progression is from top to bottom.

(ii) Start Reversed String (SRS) can have two paramenters. They indicate
the start and end of reversing the current direction. A technical
committee of ECMA is currently working on making this ISO control
function more powerful. An example of what they are working on as far
as I understand it is to have a parameter in which the Arabic or Hebrew
text is presented right to left, but numbers from left to right.

So the answer to your first question is that 10646 in itself does not
have any protocol for bidi, but it can be used with two different sets of
protocols which can achieve this. This is certainly going to cause some
problems.

2. Let me break this questions into its parts:

a. There will be 4 levels of code extension. In one floating accents
will not be included, in two floating accents will be available with the
ISO mechanism only, in three the combining method of UNICODE but not the
ISO can be used, in four both UNICODE and ISO methods can be used. I
assume that a file will have to have a header indicating the level.
There is provision for what you call floating accents. There are also a
large number of preformed compound characters. Thus your &eacute; can be
coded as [00E9] or [0065]+[0301], i.e. LATIN SMALL LETTER E + COMBINING
ACUTE ACCENT and in one other way with the ISO SGCI. Software designers
will have to decide how they will this. The basic question for them is
to design tables which multiencodings are possible and to perhaps
normalize them in some way.

b. Your Hebrew question is another question entirely. This is a problem
of imaging, not character encoding. It would be absurd to include your
1000s of possible combinations. A program will have to read the stream
of characters and produce a readable image of them on screen or printer.

3. Variable width in presentation is only marginally provided for in both
10646 and UNICODE. This also is a question of imaging, not character
encoding.

4. Ibid.

5. I cannot predict what will happen in the future, but if 10646 takes
off it will mean that character sets are provided for at the system level
where they should be, not on application level. Everyone who has a 10646
system will have access to the necessary characters for multilingual
processing in an enormous number of languages. That can only be an
improvement. There will be some hickups with sorting out which
extension level and control function wins out. There will also be some
revision of what characters are included within the short term, I think.
It will mean that your word processor, full text data retrieval system,
and SGML application (if compliant) will use the same multilingual
character set. With networks supporting it we can exchange these files
anywhere without corruption. If machines running EBCDIC move up to
10646, other major compatibility problems will be solved.

6. The problem of printing as you can see from the above is not dealt
with by a character coding standard. The whole issue of presentation
forms will have to be handled outside of this standard. This will be
done by using Postscript download fonts etc. The number of glyphs which
can be place on paper and screen is considerably larger than 2^16 which
is the maximum number of characters in this standard.

Apologies for the length of this answer. I will send the more lengthy and
informative information to the listservers soon.

[some reformatting by the editor. -- ahr]