Text in attribute values in the TEI encoding scheme

Christian Wittern wittern at kanji.zinbun.kyoto-u.ac.jp
Mon Mar 10 19:44:34 EST 2003



Dear council members,

I would like to re-open the discussion of text in attribute values by
formulating my position more extensively below.  I propose to first
discuss this here and if necessary turn it into an agenda item for the
conference call next week.

All the best,

Christian

<p><p><p>Text in attribute values in the TEI encoding scheme

    Author: Christian Wittern
      Date:    2003-03-11
     _________________________________________________________________

The problem

   The underlying model of the TEI encoding scheme up to P4 has been that
   of some `Urtext' plus markup. If somebody would want to see the
   Urtext, all that was needed so it seems, to strip off the tags.
   Conveniently, Emacs' PSGML has even a function for doing exactly this,
   hiding the tags and presenting only the textual content. For exactly
   this reason, redundancy had been built in for things like the <corr>
   and <sic> element, which do express the same information content in
   two complementary ways. This poses unnecessary burdens on implementors
   and implies an unwanted priority on part of the encoder for either
   view of the text. There is no way to impartially state a fact, no, one
   has to be either pro or con.

   The only exception I have encountered so far in the TEI encoding
   scheme is textcritical encoding, where lemma and readings are (at
   least in the parallel segmentation method used inline) apposed at the
   same level, providing different paths through the text. No mythical
   `Urtext' could be discovered by simply hiding the tags, it would be
   only a very confusing array of text fragments be left over for
   inspection.

   One of the consequences of this model is that in many cases attribute
   values can hold text of much the same type as the element content. The
   consequence of this is that no meta information about the contents of
   this text in attribute values can be given, specifically
     * language identification
     * linguistic annotation
     * segmentation
     * linking
     * expansions
     * corrections
     * character annotation

   can not be expressed for the content of attribute values. This is just
   a list of things that came to my mind, there might be more I did not
   think of yet. This shows clearly however, that the decision to place a
   piece of text in an attribute value rather than as the content of an
   element has rather severe consequences for its future possibilities.

   While the character encoding group has mainly focused on the fact that
   any markup construct that we might think of for representing
   characters can not live in an attribute value, the fact that the
   language of text in attribute values can not be specified has even
   more severe consequences for proper text processing. Language
   identification of attribute values has been ambiguious in P4 and is
   one of the major areas of friction with the xml:lang attribute used in
   other XML specifications.

Towards a new understanding of text and markup

   Practical experience with Markup over the past 15 years has gradually
   led to a different view of text and markup. Text and markup can not be
   easily separated as seems to be possible when trying to recover the
   $-1òøtext under the markupòù; rather, they form a functional unit that only-A
   through proper interpretation (for example through a stylesheet or any
   other transformation) makes the text available for the reader for
   inspection. Giving this new function, the above mentioned deficit of
   annotation for attribute values could be easily evaded by changing the
   whole coding scheme in a way that allocates any place that might hold
   textual information to element contents and reserves attribute values
   for token-like meta information about textual items.

   This would require a major overhaul of the whole TEI encoding scheme
   and introduce a wealth of incompatibilities with P4. However, if this
   is indeed considered a valid consideration, this will be a necessary
   step and now would be the time to take this step.
     _________________________________________________________________

<p>    Date:    2003-03-11,  Author: Christian Wittern.


-- 

 Christian Wittern 
 Institute for Research in Humanities, Kyoto University
 47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN




More information about the tei-council mailing list