on spec grp 4, coded values (was "Re: [tei-council] datatypes")
Syd Bauman
Syd_Bauman at Brown.edu
Sun Sep 18 23:30:50 EDT 2005
[Note: despite the In-Reply-To field, this commentary is based on the
new version Lou just put out at
http://www.tei-c.org.uk/Drafts/DTYPES/ (the main site is down).]
On Specification group 4: Datatypes: coded values
-- ------------- ----- -- ---------- ----- ------
* tei.data.code: the change permits any pointer; the recommendation
in ED W 90 is for local pointers only. We've talked about this
somewhat over the past few weeks or so, but IIRC, no one has said
anything on this issue except Lou and Syd. Lou has said (at least
twice) that he thinks it should be a generic pointer, but has not
expressed why.
I think there are several good reasons to restrict these kinds of
references to local pointers.
- It mimics what P4 had.
- It provides a system (as did ID/IDREF) where the attribute value
itself may be used as a code for a useful semantic distinction
(thus the name). E.g., new= and old= of <handShift> -- the real
details are provided by the <hand> to which they point, but
mnemonic values like "#Scribe1brown" and "#Scribe2red" can convey
a lot of meaning by themselves, or at least be easy for humans to
associate with the appropriate <hand>.
- By changing the values of the xml:id= attributes of the elements
pointed at (e.g., <hand>), the encoder gets to create the
vocabulary used.
- People want to know where things point. (Or go -- some of the
bigger complaints about P4 are about places where it says that
something should be documented "in the TEI header" without saying
where in the TEI header.)
If we stick to local pointers, we can define the place or places
(likely in the TEI header) to where each tei.data.code attribute
is supposed to point, document it accordingly, and perhaps write
a Schematron rule to verify that it does so.
If we permit any pointer, documentation where these things point
becomes somewhat harder. E.g., if the new= of <handShift> points
to an external file, does it still need to point at a <tei:hand>?
Presumably so. But unless the external file is another encoded
document that happened to be written by the same scribes, it is
likely to be just a project repository for <hand> elements. What
goes in the <text> element of that document? We also need to
rewrite the prose so that it is clear that the <hand> in one TEI
document may well be documenting the handwriting in another.
In either case (whether tei.data.code is declared as a pointer or a
local pointer), a user could easily switch to the other by
modifying her customization ODD file. I think it will provide a
more coherent system (and be easier for us to boot) if we provide a
local pointer mechanism; users who change it to general pointers
will know up front they are making an extension that may change
some of the semantics of some bits of the Guidelines.
* tei.data.enumerated: the change permits whitespace in the values.
I'm not sure about this. For consistency with tei.data.name and
tei.data.code (as proposed in ED W 90), I think whitespace should
not be permitted. Especially in cases of "open" valLists,
disallowing whitespace would help users understand that the value
should be a code, not a paragraph. On the other hand, is there any
really strong reason to insist on things like
<lg type="couplet-alexandrine">
instead of
<lg type="Alexandrine couplet">?
Even if we use token w/o the restriction, for consistency it should
probably be xsd:token.
* tei.data.key: the change is from 'xsd:token' to 'rng:string'. I
think it should be left as 'xsd:token' just for consistency. There
is no difference in validation at all (since there are no
enumerated values).
* tei.data.name: the change is from any string that does not contain
whitespace (but may include, e.g., punctuation marks, currency
symbols, math symbols, etc.) to an XML NMTOKEN. I am not sure why
we'd want to exclude the non-letter, non-digit characters (other
than .-_:, which are permitted in NMTOKEN). Why shouldn't the
Tibetan Paluta character be allowed?
* tei.data.names: the change is from multiple occurrences of
whitespace-delimited tokens to only one. Despite your claim, Lou,
this really must be an error. I bet you meant
tei.data.names = list { tei.data.name+ }
* tei.data.name(s): the change is from the name "token(s)" to
"name(s)". I dunno. Feels very much like out of the frying pan into
the fire. While "token" is confusing because W3C mis-named their
datatype, "name" is confusing because in fact these values are most
likely not proper nouns at all, and have nothing to do with the
<name> element, the 'names' class, or the 'naming' class.
How about
- term
- sign
- TOKEN
- character-sequence
- character-sequence-sans-whitespace
- charSeq
- noWScharSeq
- datum ('data' for plural)
- the Latin or French word for "token"
Oh dear. I'm almost ready to live with the confusion over "token".
More information about the tei-council
mailing list