on spec grp 4, coded values (was "Re: [tei-council] datatypes")

Sun Sep 18 23:30:50 EDT 2005

[Note: despite the In-Reply-To field, this commentary is based on the
 new version Lou just put out at
 http://www.tei-c.org.uk/Drafts/DTYPES/ (the main site is down).]

On Specification group 4: Datatypes: coded values
-- ------------- ----- -- ---------- ----- ------

* tei.data.code: the change permits any pointer; the recommendation
  in ED W 90 is for local pointers only. We've talked about this
  somewhat over the past few weeks or so, but IIRC, no one has said
  anything on this issue except Lou and Syd. Lou has said (at least
  twice) that he thinks it should be a generic pointer, but has not
  expressed why.

  I think there are several good reasons to restrict these kinds of
  references to local pointers.

  - It mimics what P4 had.

  - It provides a system (as did ID/IDREF) where the attribute value
    itself may be used as a code for a useful semantic distinction
    (thus the name). E.g., new= and old= of <handShift> -- the real
    details are provided by the <hand> to which they point, but
    mnemonic values like "#Scribe1brown" and "#Scribe2red" can convey
    a lot of meaning by themselves, or at least be easy for humans to
    associate with the appropriate <hand>.

  - By changing the values of the xml:id= attributes of the elements
    pointed at (e.g., <hand>), the encoder gets to create the
    vocabulary used.

  - People want to know where things point. (Or go -- some of the
    bigger complaints about P4 are about places where it says that
    something should be documented "in the TEI header" without saying
    where in the TEI header.)

    If we stick to local pointers, we can define the place or places
    (likely in the TEI header) to where each tei.data.code attribute
    is supposed to point, document it accordingly, and perhaps write
    a Schematron rule to verify that it does so.

    If we permit any pointer, documentation where these things point
    becomes somewhat harder. E.g., if the new= of <handShift> points
    to an external file, does it still need to point at a <tei:hand>?
    Presumably so. But unless the external file is another encoded
    document that happened to be written by the same scribes, it is
    likely to be just a project repository for <hand> elements. What
    goes in the <text> element of that document? We also need to
    rewrite the prose so that it is clear that the <hand> in one TEI
    document may well be documenting the handwriting in another.

  In either case (whether tei.data.code is declared as a pointer or a
  local pointer), a user could easily switch to the other by
  modifying her customization ODD file. I think it will provide a
  more coherent system (and be easier for us to boot) if we provide a
  local pointer mechanism; users who change it to general pointers
  will know up front they are making an extension that may change
  some of the semantics of some bits of the Guidelines.

* tei.data.enumerated: the change permits whitespace in the values.
  I'm not sure about this. For consistency with tei.data.name and
  tei.data.code (as proposed in ED W 90), I think whitespace should
  not be permitted. Especially in cases of "open" valLists,
  disallowing whitespace would help users understand that the value
  should be a code, not a paragraph. On the other hand, is there any
  really strong reason to insist on things like 
    <lg type="couplet-alexandrine"> 
  instead of
    <lg type="Alexandrine couplet">?
  Even if we use token w/o the restriction, for consistency it should
  probably be xsd:token.

* tei.data.key: the change is from 'xsd:token' to 'rng:string'. I
  think it should be left as 'xsd:token' just for consistency. There
  is no difference in validation at all (since there are no
  enumerated values).

* tei.data.name: the change is from any string that does not contain
  whitespace (but may include, e.g., punctuation marks, currency
  symbols, math symbols, etc.) to an XML NMTOKEN. I am not sure why
  we'd want to exclude the non-letter, non-digit characters (other
  than .-_:, which are permitted in NMTOKEN). Why shouldn't the
  Tibetan Paluta character be allowed?

* tei.data.names: the change is from multiple occurrences of
  whitespace-delimited tokens to only one. Despite your claim, Lou,
  this really must be an error. I bet you meant 
     tei.data.names = list { tei.data.name+ }

* tei.data.name(s): the change is from the name "token(s)" to
  "name(s)". I dunno. Feels very much like out of the frying pan into
  the fire. While "token" is confusing because W3C mis-named their
  datatype, "name" is confusing because in fact these values are most
  likely not proper nouns at all, and have nothing to do with the
  <name> element, the 'names' class, or the 'naming' class.
  How about
  - term
  - sign
  - TOKEN
  - character-sequence
  - character-sequence-sans-whitespace
  - charSeq
  - noWScharSeq
  - datum ('data' for plural)
  - the Latin or French word for "token"
  Oh dear. I'm almost ready to live with the confusion over "token".