on spec grp 4, coded values (was "Re: [tei-council] datatypes")

Mon Sep 19 10:43:46 EDT 2005

Syd Bauman wrote:
> [Note: despite the In-Reply-To field, this commentary is based on the
>  new version Lou just put out at
>  http://www.tei-c.org.uk/Drafts/DTYPES/ (the main site is down).]
> 
> On Specification group 4: Datatypes: coded values
> -- ------------- ----- -- ---------- ----- ------
> 
> * tei.data.code: the change permits any pointer; the recommendation
>   in ED W 90 is for local pointers only. We've talked about this
>   somewhat over the past few weeks or so, but IIRC, no one has said
>   anything on this issue except Lou and Syd. Lou has said (at least
>   twice) that he thinks it should be a generic pointer, but has not
>   expressed why.

It seems to me self-evident that once we changed from using IDREF to 
URIref, which we did long ago, the distinction between "local" and 
"generic" pointer became meaningless.

> 
>   I think there are several good reasons to restrict these kinds of
>   references to local pointers.
> 
>   - It mimics what P4 had.
> 

Not a good reason.

>   - It provides a system (as did ID/IDREF) where the attribute value
>     itself may be used as a code for a useful semantic distinction
>     (thus the name). E.g., new= and old= of <handShift> -- the real
>     details are provided by the <hand> to which they point, but
>     mnemonic values like "#Scribe1brown" and "#Scribe2red" can convey
>     a lot of meaning by themselves, or at least be easy for humans to
>     associate with the appropriate <hand>.
> 

You can still do that if you choose to.

>   - By changing the values of the xml:id= attributes of the elements
>     pointed at (e.g., <hand>), the encoder gets to create the
>     vocabulary used.

You can still do that too.

> 
>   - People want to know where things point. (Or go -- some of the
>     bigger complaints about P4 are about places where it says that
>     something should be documented "in the TEI header" without saying
>     where in the TEI header.)
> 

True, but irrelevant.

>     If we stick to local pointers, we can define the place or places
>     (likely in the TEI header) to where each tei.data.code attribute
>     is supposed to point, document it accordingly, and perhaps write
>     a Schematron rule to verify that it does so.
> 

We can specify the target element type, and maybe we should do so in 
some cases, but I don't see any advantage at all to trying to specify 
"where" the element instance is. It's all out there in cyberspace, man!

>     If we permit any pointer, documentation where these things point
>     becomes somewhat harder. E.g., if the new= of <handShift> points
>     to an external file, does it still need to point at a <tei:hand>?

Yes, obviously.

>     Presumably so. But unless the external file is another encoded
>     document that happened to be written by the same scribes, it is
>     likely to be just a project repository for <hand> elements. What
>     goes in the <text> element of that document? We also need to
>     rewrite the prose so that it is clear that the <hand> in one TEI
>     document may well be documenting the handwriting in another.
> 

The thing that handshift points to must be a <hand> element. It doesn't 
matter where that <hand> element is located, or what else is in the 
document containing it.

>   In either case (whether tei.data.code is declared as a pointer or a
>   local pointer), a user could easily switch to the other by
>   modifying her customization ODD file. I think it will provide a
>   more coherent system (and be easier for us to boot) if we provide a
>   local pointer mechanism; users who change it to general pointers
>   will know up front they are making an extension that may change
>   some of the semantics of some bits of the Guidelines.
> 

I don't know of any way of defining a local pointer, other than to say 
that the URL reference must begin with a # -- in which case you break 
things for the obsessive person who includes the full URL even when the 
pointer is to somewhere in the current document.

> * tei.data.enumerated: the change permits whitespace in the values.
>   I'm not sure about this. For consistency with tei.data.name and
>   tei.data.code (as proposed in ED W 90), I think whitespace should
>   not be permitted. Especially in cases of "open" valLists,
>   disallowing whitespace would help users understand that the value
>   should be a code, not a paragraph. On the other hand, is there any
>   really strong reason to insist on things like 
>     <lg type="couplet-alexandrine"> 
>   instead of
>     <lg type="Alexandrine couplet">?
>   Even if we use token w/o the restriction, for consistency it should
>   probably be xsd:token.

I am open minded (or open-minded) on this question. I finally came down 
on the side of allowing *normalised* whitespace within the value because 
(a) there are quite a few real life examples in P4
(b) i assumed that whoever defined xsd:token that way was probably not a 
complete idiot.

There are quite a few things which are linguistically single tokens but 
which include a space, all right?

> * tei.data.key: the change is from 'xsd:token' to 'rng:string'. I
>   think it should be left as 'xsd:token' just for consistency. There
>   is no difference in validation at all (since there are no
>   enumerated values).

There is all the difference in the world. key is *explicitly* defined as 
something that is to be validated externally and over which the TEI 
should therefore place no (additional) syntactic constraints. We cannot 
possibly second guess what syntactic constraints every database system 
in the world might impose, ergo our only choice is to not impose any at all.

> 
> * tei.data.name: the change is from any string that does not contain
>   whitespace (but may include, e.g., punctuation marks, currency
>   symbols, math symbols, etc.) to an XML NMTOKEN. I am not sure why
>   we'd want to exclude the non-letter, non-digit characters (other
>   than .-_:, which are permitted in NMTOKEN). Why shouldn't the
>   Tibetan Paluta character be allowed?

I assume the latter is not a serious suggestion. I thought the reason 
was fairly obviously to do with ease of (XML) processing.

> 
> * tei.data.names: the change is from multiple occurrences of
>   whitespace-delimited tokens to only one. Despite your claim, Lou,
>   this really must be an error. I bet you meant 
>      tei.data.names = list { tei.data.name+ }
> 

Yes, sorry. It is right in CVS...

> * tei.data.name(s): the change is from the name "token(s)" to
>   "name(s)". I dunno. Feels very much like out of the frying pan into
>   the fire. While "token" is confusing because W3C mis-named their
>   datatype, "name" is confusing because in fact these values are most
>   likely not proper nouns at all, and have nothing to do with the
>   <name> element, the 'names' class, or the 'naming' class.
>   How about
>   - term
>   - sign
>   - TOKEN
>   - character-sequence
>   - character-sequence-sans-whitespace
>   - charSeq
>   - noWScharSeq
>   - datum ('data' for plural)
>   - the Latin or French word for "token"
>   Oh dear. I'm almost ready to live with the confusion over "token".
> 

We did discuss this a a bit on the list and nobody came up with a better 
suggestion. I think using "token" would really be asking for confusion 
-- precisely because we do mean something different from a datatype 
which the W3C calls "token" -- whether it's in caps or not. I also 
considered "ident" and "label". The key thing about it, surely though, 
is that it is a way of naming something, even if it's not a proper name?

We can iron this one out maybe later. At least we agree on what it is, 
and until someone comes up with a better token, let's call it a name!