[tei-council] datatype issues (part 1)

Syd Bauman Syd_Bauman at Brown.edu
Sun Sep 11 10:06:56 EDT 2005


> 1. <specDesc ident="tei.data.xxxx"/> will extract the <desc> from
> the referenced macroSpec. It would be nice also to be able to
> extract the <content> or <stringVal> part for display.

Yes, it would.


> 2. The definition of <macroSpec> allows it to contain multiple
> <content> or <stringVal> children. Why?

I can't come up with a good reason off the top of my head.


> 3. tei.data.certainty is defined as either an enumeration ("high",
> "low", "medium", "unknown") or a reference to tei.data.probability,
> which is a real value between 0,1 or an integer between 0,100. I
> wonder if it wouldn't be less confusing to restrict the values for
> tei.data.certainty to the literal only, since any attribute for
> which we to allow either kind of value can do so by giving an
> alternation of datatypes (I think)

I don't see that it is less confusing, but it may have the advantage
of making it easier on the user who wants to modify his ODDs so that
certainty is expressed only as one or the other. It makes no
difference to the constraint expressed in the end, so seems like a
fine idea to me.


>  4. tei.datatype.language is isomorphic with xsd:language: do we
>  need it?

We've discussed this at least twice in the past, and have concluded
that while tei.data.language is not entirely necessary, it would be a
good place to put information about how one uses xsd:language in TEI
-- that it is co-referenced with ident= of <language> (optionally if
it starts with "i-", 2 letters, or 3 letters, required if it starts
with "x-"). Otherwise this information won't show up in the reference
documentation at all except perhaps in the tagdoc for <language>
itself. 


> 5. tei.data.regexp is used only in two rather obscure places: do we
> need it? 

I don't think we need it, although again, it may be a useful place to
put an explanation. 

> If we do, is the reference to appx. F of the xsd spec really the
> canonical place to define what sort of regexp we mean ?

Are you asking
* whether we want to use W3C XSD regular expressions or would we do
  better to use some other regular expression language, or
* whether appendix F is the canonical place to refer to W3C XSD
  regular expressions?

I think the answer is "yes" to the former, but could be convinced
otherwise. The answer is a definitive "I don't know" to the latter.
(Especially since in the draft of XSD 1.1 it is in appendix H, not
F.) If we agree that this is the regexp language we want to use, I
will chase down the proper canonical reference to it (I'm presuming
there is one ... oh dear).


> 6. tei.data.sex defines four alphabetic values (m f x u) which
> correspond to ISO 5218 numeric codes 1 2 0 and 9. Should we not
> rather use the ISO codes?

Ooohhh ... really good question. Better conformance to external
standards, or more human-readable values? Tough choice in this case.
Part of me really wants to just use
  "not known" | "male" | "female" | "not specified"
and avoid the question.


> 7. Furthermore, where (as with sex) the datatype is a closed
> enumeration, it makes sense to represent this in the macrospec as a
> <rng:choice> containing several <rng:value>s. But there is
> currently no scope to provide a gloss for what each value means,
> since <valList> is not allowed within <macrospec>.

Errr... I'm not sure I internalized that bit of information (that
<valList> is not permitted within <macroSpec>) when I thought about
these. I know META has been disbanded, but this really seems like an
issue that should be looked at and quite possibly changed. <valList>s
are really the right thing for the job, here.


> 8. In earlier discussion I had proposed that tei.data.token should
> differ from rng:token in that the former should not permit included
> whitespace. Thinking about this again, I think I might have been
> wrong: it might be less confusing to use <rng:token> directly
> wherever we want a "tei.data.token", thus allowing people to use
> XML whitespace normalization in attribute values in the same way as
> they can in content.

There is no XML whitespace normalization of any content in TEI, yet,
is there? When we're done straightening out the classes and stuff,
there may be one or two obscure places where it is useful.

> If we do define tei.data.token as proposed (i.e. as an xsd:token
> with a facet saying that whitespace is not allowed), we should
> really give it a different name, or expect to spend the rest of
> eternity explaining why our usage differs from W3C and RNG's (ok,
> we were there first, but still).

I think a "no internal whitespace" restriction is a really good thing
to have[1]. But I think you are absolutely right, we should change
the name. It's not our fault that W3C and RelaxNG deliberately use
the term "token" in a manner that is counter-intuitive to end users
(although perhaps makes sense to those writing validators).
Nonetheless, if we use the same term in the more normal way, we are
dooming users to even more confusion. Problem is, it's hard to come
up with an alternative. How about tei.data.term?


> Same applies, mutatis mutandis, to tei.data.tokens: it might indeed
> be simpler to define that as xsd:token rather than as a list of our
> weird tei.data.tokens. On the other hand....

I don't think it matters much. I think it is useful to have the
parallelism between tei.data.pointer(s) and tei.data.token(s), and
whatever others may come up. But since they'll provide the same
constraint, it's not very important.


Note
----
[1] Note that I said "internal". Important detail: because W3C Schema
    regular expressions are implicitly anchored,
       <foo type="
       bar"/>
    is not permitted using current definition of tei.data.token.
    Since
      <foo type=
       "bar"/>
    is, of course, permitted, I don't see this as a problem at all,
    and even think it makes thinking about the attribute easier. ("No
    whitespace" vs "no whitespace, except leading and trailing which
    will get trimmed off before comparison for validation") I think
    the restriction on no internal whitespace is important; I think
    the restriction on leading & trailing is a nicety worth having.




More information about the tei-council mailing list