8.0068 Encoding Question: Orthography and Palaeography (1/58)

Elaine Brennan (EDITORS@BROWNVM.BITNET)
Tue, 21 Jun 1994 22:57:22 EDT

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Elaine Brennan: "8.0069 CFP: Natural Language Understanding... (1/129)"
Previous message: Elaine Brennan: "8.0067 S/W Qs: Text Databases; Search Engines (2/85)"

Humanist Discussion Group, Vol. 8, No. 0068. Tuesday, 21 Jun 1994.

Date: Fri, 17 Jun 1994 12:25:06 -0400 (EDT)
From: David J Birnbaum <djbpitt+@pitt.edu>
Subject: Orthography and Palaeography

I am currently elaborating an inventory of early Cyrillic characters
needed for electronic text processing, for which I am trying to
distinguish characters (informational units) from glyphs (presentational
units). As a modern example, changing letters means changing characters,
while changing typefaces means changing glyphs. My encoding should
represent all character information, but should exclude glyphic
information.

>From a manuscript perspective, I would like to say that items of
orthographic interest (pertaining to the distribution of signs) are
character data, while items of palaoegraphic interest (pertaining to
variation in the shapes that may represent the same sign) are glyph level.
But individual decisions turn out to be subjective: if I think the scribe
knew that two signs were different, I regard them as characters, while if
he didn't, they are glyphs.

The equation of characters and letters of the alphabet is tempting, but it
doesn't provide sufficient granularity, since certain units that are not
considered independent letters are nonetheless traditional objects of
orthographic study. For example, some Slavic manuscripts distinguish
narrow and broad omicron (in addition to omega, and sometimes other o-type
letters), and their distribution can be orthographically (and even
linguistically) significant. Yet narrow and broad omicron have never been
considered separate letters of the alphabet.

The comparison to phonemes (= characters) and phones (= glyphs) is also
seductive, but characters don't necessarily influence meaning. For
example, medieval Slavic manuscripts traditionally spell the sound [u] as
either a single <u> letter or an <ou> digraph. Their distribution is
sometimes governed by orthographic rules (e.g., write <ou> at the
beginning of a word and <u> elsewhere), and the rules are not always
carried out with complete regularity. One thing Slavic philologists study
is the orthographic norms of individual manuscripts and the deviations
from these norms, which means that unifying <u> and <ou> as a single
character during input would prohibit such analysis.

I should add that I'm not talking about font design: fonts are inventories
of glyphs, not of characters. (For example, the English ff ligature is a
glyph but not a character. I would expect an early Cyrillic font to
contain many more symbols than an early Cyrillic character set.)

One can most safely err on the side of overinclusion (and normalize during
analysis, rather than input), but I'd like to avoid that if possible. Has
anyone tackled a similar problem who can offer some guidance?

(For more information on character / glyph differences, interested persons
may consult "The Unicode Standard," volume 1, Reading: Addison-Wesley,
1991, ISBN 0-201-56788-1.)

Thanks,

David

Professor David J. Birnbaum djbpitt+@pitt.edu
The Royal York Apartments, #802
3955 Bigelow Boulevard voice: 1-412-687-4653
Pittsburgh, PA 15213 USA fax: 1-412-624-9714

Next message: Elaine Brennan: "8.0069 CFP: Natural Language Understanding... (1/129)"
Previous message: Elaine Brennan: "8.0067 S/W Qs: Text Databases; Search Engines (2/85)"