4.0972 Unicode Encoding for Hebrew (1/69)

Elaine Brennan & Allen Renear (EDITORS@BROWNVM.BITNET)
Mon, 4 Feb 91 00:07:56 EST

Humanist Discussion Group, Vol. 4, No. 0972. Monday, 4 Feb 1991.

Date: Fri, 01 Feb 91 16:28:56 CST
From: "Robin C. Cover" <ZRCC1001@SMUVM1>
Subject: UNICODE ENCODING FOR HEBREW


With the "UNICODE vs. ISO 10646" war heating up, I wonder whether other
HUMANISTS (as semitists and orientalists) have looked at the UNICODE code
points for Hebrew. Have you? Particularly, I would be interested in the
opinions of Israeli scholars in biblical studies.

Relevance: as has been summarized already on this forum, UNICODE is
finalizing the draft specification for its 16-bit fixed-width character
encoding scheme, and will close the comment period on February 15th. This
emerging "standard" is backed by IBM, Apple, Microsoft, Metaphor, NeXT,
Sun, Xerox, The Research Libraries Group, and other powerful commercial
groups -- so the consequences cannot be taken lightly. ECMA has thrown its
weight solidly behind the ISO 10646 group in opposing UNICODE. I don't
know what the Unicode consortium will do about the ECMA/ISO opposition, but
it seems prudent that humanities scholars address both groups with their
concerns.

My specific concern relates to the absence of unique codes for Hebrew (and
Aramaic) SHIN and SIN. We could argue about the precise orthographic
stratum to which "dotted" SHIN and SIN belong (relative to vowel points,
accents, cantillation and punctuation marks), but the results would
probably not be determinative for this discussion. It is materially
relevant (1) that these ARE historically distinct consonants, and (2) that
the original Phoenician writing system IS underspecified for Hebrew. But
the concern is simply pragmatic: should software designers and programmers
be forced to reckon with SHIN and SIN as double-width characters when in
linguistic terms these are distinct consonants? UNICODE makes 05C1
"HEBREW POINT SHIN DOT" and 05C2 "HEBREW POINT SIN DOT". If this decision
can be defended historically in terms of Hebrew orthography, it can hardly
be defended as optimal for any text processing other than printing of
non-vocalized text.

Nothing seems lost or compromised in the UNICODE scheme if undifferentiated
(viz, "undotted) "HEBREW LETTER SHIN" is left at code point 05E9, for modern
and epigraphic Hebrew, and two of the unassigned slots (05F6 - 05FF) are
used for linguistically and/or orthographically distinct SHIN and SIN. This
would seem a small concession considering that five slots are given to final
forms -- an overspecification that could be handled in applications (as
would seem preferable in keyboarding, for example, as I assume it is in
Arabic). I am not a programmer, so perhaps I overestimate the negative
consequences of current UNICODE for implementors in having to work around
this problem, and the performance penalty. Programmers already have to deal
with five cases of overspecification at applications level (the five final
forms), so in principle having three characters for SHIN/SIN, SHIN and SIN
would be consistent and economical.

Less problematic (in my view) but not entirely felicitous is that the
"dot" for daghesh (05BC) is also used for mappiq and (the "dot") in
shureq.

Time is short, but those interested in the UNICODE draft should contact
Asmus Freytag at Microsoft: Tel: (1 206) 882-8080; FAX: (1 206) 883-8101;
Email: microsoft!asmusf@uunet.uu.net

For those who cannot obtain a draft copy in time, UNICODE currently
contains 84 16-bit characters for Hebrew:

(1) 31 code points for "cantillation marks and accents"
(2) 20 code points for "points and punctuation"
(3) 27 code points for consonants (based upon ISO 8859/8)
(4) 3 code points for Yiddish digraphs (double-vav; vav-yod; double-yod)
(5) 2 code points for "additional punctuation" (geresh; gershayim)
(6) 1 code point for Ladino/Judezmo (point VARIKA)

Robin Cover

BITNET: zrcc1001@smuvm1
Internet: robin@ling.uta.edu
Internet: robin@txsil.lonestar.org