[tei-council] comments on edw90

Syd Bauman Syd_Bauman at Brown.edu
Wed Aug 31 16:49:11 EDT 2005


> Sebastian and I have done some testing of the feasibility of doing
> local over-rides as described here. Our conclusions are that this
> is not possible on technical grounds,

This is quite a shame, as such a capability could provide the means
for significant simplification of the TEI scheme, and much of the work
put into ED W 90 was predicated on this possibility. Such is life.


> Our recommendation is that
> - the tei.typed class should be removed
> - elements bearing a type attribute (and former members of the
>   typed class) should be checked to see whether their valLists
>   constitute an open or closed list
> - for closed list, the datatype will be an alternation of the
>   possible values

I presume you mean "set of permissible values" or some such, not
"datatype". (Up until now we've been using the term "datatype" to
express an abstract grouping of values with similar syntactic
constraints and similar semantics, presumably implemented by some form
of indirection. However, there is the potential for lots of confusion,
because the child of <attDef> which is used to abstractly declare the
attribute type is called <datatype>. This is quite unfortunate a
situation, for which I must take the blame -- Sebastian specifically
asked me if any of the names he chose for these elements were
problematic before he cemented them into our processing chain, and I
missed this obvious thorn.)

> - for open (or semi) lists, the datatype will be tei.enumerated,
>   i.e. a single token not containing whitespace

I'm confused. What then is the difference between tei.data.enumerated
and tei.data.token?

In some ways this may feel like a big step backwards, as it is
essentially the system we were using before Paris. But I think this
works almost as well as the system Council sketched out in Paris. If
a TEI-schema-designer wants to restrict users to the values provided
in an open (or semi) list, all she needs to do is change the type of
<valList> from "open" to "closed". (Hmmm... I just realized that this
is actually currently not true. I think she'd have to copy-and-paste
the entire <valList> from the tagdoc into her ODD, which isn't nearly
as nice. Sebastian, could we arrange the process so that just
specifying, e.g.
          <elementSpec module="core" ident="note" mode="change">
            <attList>
              <attDef ident="place">
                <valList type="closed"/>
              </attDef>
            </attList>
          </elementSpec>
would work? It's not ambiguous that the user is trying to remove all
the possible values, because it makes no sense to have an empty
<valList>, especially if the type is "closed"; of course it also
isn't valid ala the current TD.)


> 1.2 Over-riding of attributes
> 
> I think we can actually make explicit some of what Syd is
> describing here by using the RelaxNG method of defining facets. So
> we could for example say that a datatype was basically a positive
> integer, but with the added constraint that it has a value less
> than 43 by a construct such as
>     <datatype>
>       <rng:data type="nonNegativeInteger">
>         <rng:param name="maxInclusive">42</rng:param>
>       </rng:data>
>     </datatype>

Yes, I think this is doable, but it really only addresses a small
subset of what I think we set forth in Paris to do. (And note that
quite a few of the proposed datatypes make use of this.) In
particular, since the restriction is a facet, we can't use this
feature to have one TEI datatype for numeric representations
(tei.data.numeric), and say "this attribute is one of
tei.data.numeric, but must be a non-negative integer".


> If (and only if) there is a 1:1 mapping between a TEI datatype and
> [a W3C Schema] datatype we could presumably also do
>    <datatype>
>      <rng:data>
>        <rng:ref name="tei.nonNegativeInteger">
>        <rng:param name="maxInclusive">42</rng:param>
>      </rng:data>
>    </datatype>

I tried a quick test, and both nxml-mode and trang objected to such a
schema.


> In the general case, however, we think that all datatype definitions
> should be complete and appropriate. In practice, we think the vast
> majority of TEI attributes are already catered for by a very small
> number of datatypes. (of the 500+ attributes listed by Syd, about
> 400 are covered by derivations of tei.data.token, tei.data.pointer
> and tei.data.uboolean)

I'm not exactly sure what you mean by "derivations" here, but if it's
the adding restrictions to a generic datatype that we can't do, it
means we have a problem for lots of those attributes, no?


> We conclude that
> 
> - datatypes should be expressed as <rng:data> expressions

I'm not sure I see why we want to impose such a restriction. E.g., for
the TEI sex datatype, why would we prefer

     <rng:data>
       <rng:param name="pattern">^\s*(f|m|u|x)\s*$</rng:param>
     </rng:data>

to either

    <rng:choice>
      <rng:value>f</rng:value>
      <rng:value>m</rng:value>
      <rng:value>u</rng:value>
      <rng:value>x</rng:value>
    </rng:choice>

or far better

    <valList type="closed">
      <val>f</val> <desc>female</desc>
      <val>m</val> <desc>male</desc>
      <val>u</val> <desc>unknown or undetermined</desc>
      <val>x</val> <desc>not applicable or indeterminable</desc>
    </valList>


> - for commonly occurring cases (see below) we should define a small 
>   number of macros, which will be named in the way Syd proposes for
>   datatypes

I think I may be confused: for commonly occurring cases of what? In the
previous bullet point did "datatype" mean "the declaration of the
allowed values of a particular attribute" or "an abstract constraint
which can be applied to the allowed values of any given attribute"?


> - it should be possible to map all datatypes to W3C basic datatypes, 
>   possibly with additional constraints

If I understand this correctly, I don't think there's any immediate
problem with it. (I'm presuming that anything that is expressed as a
list of values is something that can be mapped ) I think it is
probably fine as a goal for the current short-term project.

However, as a long-term principle I think it is a very bad idea to tie
TEI to W3C datatypes. While I am far from a computer scientist who has
studied these issues, it's clear that W3C datatypes leave a lot to be
desired. It is quite reasonable to expect that other datatype
libraries will be published (e.g., OASIS DSDL part 5), or that we
would want to create a datatype library ourselves, perhaps using DTLL
if & when it becomes fully worked out.


> The TEI has always proposed additional constraints in remarks,
> valDesc, and descriptive prose. We think we should use the
> Schematron language to express some of these: the primary use case
> being constraints on acceptable GIs as targets for various pointing
> attributes.
> 
> We are not sure where these constraints go in ODD-world, but
> probably not in the <datatype>. We recommend using Schematron for
> them because (a) we know it does the job (b) it is a candidate ISO
> recommendation.

I agree with all of the above. Although I think perhaps we should
avoid features of ISO Schematron that are not available in Schematron
1.x, as processors for the former are hard to come by. (That may
change by the time this becomes an issue, of course.)


> I think anything we can do to reduce the complications consequent on
> the whitespace rules of XML is an unalloyed Good Thing, and propose
> to be even more draconian than Syd suggests.

Hear hear!


> My suggestion is that we allow only token, nmtoken, and
> tei.data.token.

I'm not sure *where* this restriction occurs, since in the next
paragraph you propose we keep "tei.data.tokens".

- token: do you mean "rng:token" or "xsd:token"?

- xsd:NMTOKEN: interesting; where would you want to use it? I have
               found no good use for this datatype for any of the 541
               attributes I looked at. In every case that NMTOKEN is
               currently used, I think we should be using xsd:Name (or
               perhaps xsd:NCName), except for the 1 oddball case of
               unit= on <timeline>, which should be an enumerated list
               or folded into interval=.


> While sympathising with the motivation for it, I feel that the
> distinction Syd proposes between "tei.data.string" and
> "tei.data.tokens" will only confuse people. If the value of a
> sequence of tokens is to be interpreted as a single string, then it
> probably shouldn't be an attribute at all.

Really good point. As I said, I'm very back-and-forth on this issue,
and Lou's argument tips me back. The only attribute that I can think
of that does not fit the "tei.data.tokens" semantic and should remain
an attribute is the value= of <metSym>. And since it's a single
attribute, IMHO it doesn't have to be declared as a "datatype" (i.e.,
with indirection), and even if Council thinks it does, we could just
use tei.data.tokens and live with it. (Remember, the validation would
be exactly the same, it's only that the prose explanation might not
fit perfectly well. Does anyone use this attribute, anyway?)


> These I like:
> tei.data.token, tei.data.tokens, tei.data.pointer, tei.data.pointers

Me too.


> Constraining tei.data.token/s further as NMTOKEN/S/NCName/QNAME etc. is 
> possible, but I am not sure how many elements would benefit from it

Between half a dozen and a dozen attributes, I suspect. Most, if not
all, of which should be xsd:Name. 


> Names I would prefer:
> for tei.data.uboolean -> tei.data.truthValue

I *like* it. Unless there are rousing objections, I'll plan to change
this in EDW90 and the corresponding database later this week.


> Names I'm not sure about
> tei.data.temporalExpression: how does this map to ISO 8601?
> (I assume it doesn't include dateRanges, for example)

Hey, you thought of this name! See separate thread James started for
ISO 8601 alignment discussion.


> tei.data.duration
>   We should adopt a consistent policy as to whether quantities like this 
> include their units, or whether the units are supplied as a separate 
> attribute. I think I prefer the second option, as being more
> flexible.

If we're going to use W3C datatypes, then at least in those cases
where W3C puts the unit in with the quantity (xsd:duration explicitly,
and the various date and time formats implicitly) we'd have to do the
same. 


> tei.data.probability
>   Not convinced we need this. There are very few candidates in the
> EDW90 table (I find 1, to be exact!)

My fault, table had typos. This one is for expressing a range from 0
to 1 (or 0% to 100% or none to all). Currently only 3 attributes make
use of it (scope= of <handNote>, usage= of <language>, weights= of
<alt>). Since (IIRC) Council agreed in Paris that whenever 2 or more
attributes share the same constraint, a datatype should be abstracted
out, I did so. (There was even some discussion that there should be a
datatype even if only 1 attribute has a particular constraint, IIRC.)


> tei.data.numeric
>   I'm now coming round to the view that we also need a
>   tei.data.integer

I'm wondering if the concept of "positive integer or 0" is simple
enough that we don't need to bother creating a TEI datatype for it,
and could just use xsd:nonNegativeInteger directly when needed.


> tei.data.language
>   I agree that we need to document exactly what this means somewhere and 
> providing a TEI name for it is a good way of doing so.

JC> Just to make sure I'm understanding this...would that datatype
JC> then be used for validation of @xml:lang's format? It seems
JC> strange to me to be using a tei.datatype to validate and/or
JC> document use of a non-TEI element/attribute.

I agree with Lou. There are 3 reasons to make a datatype
(tei.data.language) that maps directly to xsd:language.

* It occurs more than twice (I'm not sure this is a compelling
  argument on its own): langKey= & otherLangs= of <textLang>, ident=
  of <language>, mainLang= of <hand>, and xml:lang= of everything.

* Although it maps directly to xsd:language, the explanation of
  xsd:language is both hard to find and hard to read & understand.
  (Quite unlike the explanation of xsd:nonNegativeInteger, which is
  easy to find, and not all that hard to read & understand -- besides,
  it's obvious enough that almost no one bothers.)

* As Christian pointed out, it would be nice to have someplace in the
  reference documentation to plunk the explanation of how xml:lang= is
  related to ident= of <language>.


I will post a reply on tei.data.code and tei.data.key issue separately.




More information about the tei-council mailing list