4.0976 Unicode (1/84)

Elaine Brennan & Allen Renear (EDITORS@BROWNVM.BITNET)
Tue, 5 Feb 91 14:16:56 EST

Humanist Discussion Group, Vol. 4, No. 0976. Tuesday, 5 Feb 1991.

Date: Tue, 05 Feb 91 10:56:20 GMT
From: DEL2@PHOENIX.CAMBRIDGE.AC.UK
Subject: Unicode 1.0

I recently requested a copy of the draft spec of Unicode 1.0 character
encoding. Although not able to give it all the time I'd have liked, my
brief look does raise a number of comments. I'm grateful to have the
opportunity to plug my comments into the general discussion (via TEI,
HUMANIST and the UNICODE team themselves:microsoft!asmusf@uunet.uu.net).

(a) There are a number of significant typos; is anyone keeping a master
record of these?

(b) Robin Cover <ZRCC1001@SMUVM1> has raised the question why there are
not separate encodings for Hebrew SIN and SHIN. They are certainly at
least as distinct as, say, LATIN E followed by ACUTE and LATIN E ACUTE.
I take it that the reason the latter case has two encodings is because of
previous ISO encodings; but since those are in any case ASCII encodings
(and Unicode is intended as a replacement for ASCII) how relevant is
that? The question also raises a more fundamental problem in my mind.
There are a number of situations where a glyph (or conglomerate of
glyphs) can reasonably be encoded in alternative ways; HYPHEN
(U+2010=U+002d) would be a case in point. We are told that some of
these redundancies are there so that natural pairing can be used "if
desired" (page 6). However, these coded pairs are not consistently
undertaken (eg CAPITAL DOTTED I). But what worries me is that two
encodings of an identical text may thus turn out to be very different;
and for anyone using computer comparison of texts this could be quite
problematic. So over against those who complained that, eg, separate
codings for GREEK ALPHA+GRAVE are not available I would voice the
opposite disquiet: the encodings are too comprehensive. If ALL
accentuation was added as a separate code I think comparison of texts
would be easier.

The ordering of the accents would then of course be important, and I
don't think the algorithm given (centre-out) is terribly helpful; which
is nearest the cente in GREEK ROUGH BREATHING+ACUTE+IOTA SUBSCRIPT?
Wouldn't an additional algorithm (clockwise starting at twelve o'clock)
be useful?

(c) While we're on Greek, I couldn't find a Greek semicolon (raised dot).
Maybe I just didn't look hard enough, but full punctuation would be
useful. But see my comment (e) below.
I also failed to locate LATIN CAPITAL LETTER WYNN.

(d) In general I approve of the policy that by adding the special Coptic
forms to the Greek alphabet one can generate Coptic text, with hard copy
generated by choosing an appropriate font. (And mutatis mutandis for
other languages.) However, there are some drawbacks to this policy; I
foresee the following problems:

(i) It may be necessary to indicate to someone (if only the compositor)
where to change font. Could a coding for change-of-language be
incorporated?

(ii) In some Greek texts it may be important to indicate where ligatures
are used; there seems no way in this encoding to distinguish between
GREEK KAPPA + GREEK ALPHA + GREEK IOTA on the one hand and the ligature
which stood for "kai" on the other. I am sometimes in the position of
needing to say (as indeed the authors of the manual were) something like
"There are three possible form of LATIN SMALL LETTER G CEDILLA (U+0123)
and they look like ..." How could I encode my ellipsis? Could the whole
of the manual as printed be sensibly encoded in Unicode? Oddly, there
are some forms which are exclusively graphic variants (ie one would not
find them together in a "natural" text) which do attract separate
codings; GREEK SMALL LETTER SCRIPT THETA for instance. Perhaps
consistency is unattainable, but to me it is a desideratum.

(e) The encoding of special numerals seemed odd. AS well as a select
group of fractions (thirds, quarters and eighths, I think) there is the
top half of fractional 1/nnn (U+215f). How is its use envisaged?
Wouldn't a generalised "fractional line" be better (let's call it
U+nnnn) so that <number string1>nnnn<number string2> is to be
interpreted as a fraction?

Similarly, Roman 12 (XII) is encoded as U+216b, but 13 (XIII) must be
(presumably) U+2169 2162. Why not a single code for "roman numbers
follow here:" (or just use ROMAN CAPITAL LETTER X &c)?

If codes for general *modes* like "Greek font"; "roman numeral",
"fraction" were included, then many ambiguities and problems could be
reduced. My Greek semicolon, for instance, could be "GREEK FONT + ;"

This contribution could be better thought-out, but it was this or
nothing. If the latter seems preferable; please discard!

Sincerely,
Douglas de Lacey.