[tei-council] datatype issues (part 1) continued,,,

Tue Sep 13 05:49:19 EDT 2005

Syd Bauman wrote:

>>5. tei.data.regexp is used only in two rather obscure places: do we
>>need it? 
> 
> 
> I don't think we need it, although again, it may be a useful place to
> put an explanation. 
> 

I propose to change this to tei.data.formula which maps to xsd:token

The only places it is used at present is for attributes like metDecl 
which have as their value a string of gobbledegook in some special 
syntax defined by the TEI. This seems a useful category of information. 
Calling it regexp suggests to me that it is a (Unix style) regular 
expression, which it isn't necessarily -- though obviously one could 
define a reg exp that matched any such formula!

Whether or not an attribute the value of which *was* a Unix regular 
expression should use this datatype is a bridge I would prefer to cross 
when we actually have such an attribute. I would have thought it would 
be much better as content anyway.

> 
>>8. In earlier discussion I had proposed that tei.data.token should
>>differ from rng:token in that the former should not permit included
>>whitespace. Thinking about this again, I think I might have been
>>wrong: it might be less confusing to use <rng:token> directly
>>wherever we want a "tei.data.token", thus allowing people to use
>>XML whitespace normalization in attribute values in the same way as
>>they can in content.
> 
> 
> There is no XML whitespace normalization of any content in TEI, yet,
> is there? When we're done straightening out the classes and stuff,
> there may be one or two obscure places where it is useful.
> 
> 
>>If we do define tei.data.token as proposed (i.e. as an xsd:token
>>with a facet saying that whitespace is not allowed), we should
>>really give it a different name, or expect to spend the rest of
>>eternity explaining why our usage differs from W3C and RNG's (ok,
>>we were there first, but still).
> 
> 
> I think a "no internal whitespace" restriction is a really good thing
> to have[1]. But I think you are absolutely right, we should change
> the name. It's not our fault that W3C and RelaxNG deliberately use
> the term "token" in a manner that is counter-intuitive to end users
> (although perhaps makes sense to those writing validators).
> Nonetheless, if we use the same term in the more normal way, we are
> dooming users to even more confusion. Problem is, it's hard to come
> up with an alternative. How about tei.data.term?
> 

Not bad, but we do use "term" rather a lot in slightly different ways 
elsewhere in the Guidelines, and, crucially, in my book a "term" is 
taken from a human language not an artificial one. So my proposal is now

1. rename tei.data.token as tei.data.ident, mapping it to NMTOKEN.
2. tei.data.tokens is a list of tei.data.ident
3. tei.data.enumerated is a ref to a tei.data.ident
4. tei.data.key is mapped to NCName

I hesitated a long time over NMTOKEN and NCName. The former allows 
hyphens but not underscore; the latter allows underscore but not hyphen. 
Syd's proposed pattern allows either and also comma. I am open to 
persuasion that both tei.data.key and tei.data.ident should have the 
same mapping; less to defining something which is not either NMTOKEN or 
NCName.

The distinction between tei.data.key and tei.data.ident is that the 
latter need not actually map onto anything anywhere, it's just a name.
So in order of ascending tightness of constraint, we have

tei.data.enumerated : the value is defined by a valList (type=closed)
tei.data.code : the value is defined by a pointer to something which 
must exist
tei.data.key : the value is defined by an enumeration elsewhere e.g. a 
database key
tei.data.ident : the value is a name or identifier of some kind but not 
necessarily enumerated or enumeratable

Does that make sense?