Text in attribute values in the TEI encoding scheme
Christian Wittern
wittern at kanji.zinbun.kyoto-u.ac.jp
Mon Mar 10 19:44:34 EST 2003
Dear council members,
I would like to re-open the discussion of text in attribute values by
formulating my position more extensively below. I propose to first
discuss this here and if necessary turn it into an agenda item for the
conference call next week.
All the best,
Christian
<p><p><p>Text in attribute values in the TEI encoding scheme
Author: Christian Wittern
Date: 2003-03-11
_________________________________________________________________
The problem
The underlying model of the TEI encoding scheme up to P4 has been that
of some `Urtext' plus markup. If somebody would want to see the
Urtext, all that was needed so it seems, to strip off the tags.
Conveniently, Emacs' PSGML has even a function for doing exactly this,
hiding the tags and presenting only the textual content. For exactly
this reason, redundancy had been built in for things like the <corr>
and <sic> element, which do express the same information content in
two complementary ways. This poses unnecessary burdens on implementors
and implies an unwanted priority on part of the encoder for either
view of the text. There is no way to impartially state a fact, no, one
has to be either pro or con.
The only exception I have encountered so far in the TEI encoding
scheme is textcritical encoding, where lemma and readings are (at
least in the parallel segmentation method used inline) apposed at the
same level, providing different paths through the text. No mythical
`Urtext' could be discovered by simply hiding the tags, it would be
only a very confusing array of text fragments be left over for
inspection.
One of the consequences of this model is that in many cases attribute
values can hold text of much the same type as the element content. The
consequence of this is that no meta information about the contents of
this text in attribute values can be given, specifically
* language identification
* linguistic annotation
* segmentation
* linking
* expansions
* corrections
* character annotation
can not be expressed for the content of attribute values. This is just
a list of things that came to my mind, there might be more I did not
think of yet. This shows clearly however, that the decision to place a
piece of text in an attribute value rather than as the content of an
element has rather severe consequences for its future possibilities.
While the character encoding group has mainly focused on the fact that
any markup construct that we might think of for representing
characters can not live in an attribute value, the fact that the
language of text in attribute values can not be specified has even
more severe consequences for proper text processing. Language
identification of attribute values has been ambiguious in P4 and is
one of the major areas of friction with the xml:lang attribute used in
other XML specifications.
Towards a new understanding of text and markup
Practical experience with Markup over the past 15 years has gradually
led to a different view of text and markup. Text and markup can not be
easily separated as seems to be possible when trying to recover the
$-1òøtext under the markupòù; rather, they form a functional unit that only-A
through proper interpretation (for example through a stylesheet or any
other transformation) makes the text available for the reader for
inspection. Giving this new function, the above mentioned deficit of
annotation for attribute values could be easily evaded by changing the
whole coding scheme in a way that allocates any place that might hold
textual information to element contents and reserves attribute values
for token-like meta information about textual items.
This would require a major overhaul of the whole TEI encoding scheme
and introduce a wealth of incompatibilities with P4. However, if this
is indeed considered a valid consideration, this will be a necessary
step and now would be the time to take this step.
_________________________________________________________________
<p> Date: 2003-03-11, Author: Christian Wittern.
--
Christian Wittern
Institute for Research in Humanities, Kyoto University
47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN
More information about the tei-council
mailing list