[tei-council] Datatype : roundup

Thu Sep 22 01:58:21 EDT 2005

Thank you for the succinct summary, Lou. I'm going to address it
quickly now, but can't really give it thorough treatment.

> 1. No-one has dissented from the basic objective of providing tei
> datatypes which together cover the full range of current
> requirements as identified in Syd's edw90 table.

I'm not 100% sure on what you mean by this. I thought we had agreed
*not* to have a TEI datatype for xsd:boolean. I also thought it might
be worth not bothering with the indirection for xsd:nonNegativeInteger
and xsd:Name. At the moment I don't care very much, but I think some
users really get fed up with indirection, and when it's almost
completely pointless ...

> a. tei.data.notation : renamed as tei.data.pattern and explicitly
> tied to regular expression syntax

Sounds good to me.

> [some debate is needed on how we define this: syd's original
> proposal suggested we should support only the W3C rather restricted
> version of regexps, i.e. the pattern has to be "anchored". Is that
> OK, or are we supporting apache-style perl-compatible regexps? or
> just the original syntax built into grep (but not egrep)?]

We initially went with W3C some 2 or 3 years ago using the (perhaps
flawed) logic that it was a regular expression language that any XML
software would have to know in order to support the W3C Schema
"pattern" facet anyway. And although they are anchored, I think the
characterization of them as "restricted" is very misleading. The W3C
regular expression language is basically the Perl language with quite
a few useful extensions like
  \i == matches any char that's legal as an initial char in an
        xsd:Name  
  \c == matches any char that's legal in an xsd:Name
  \p{Po} == matches any char from the PUA

> b. split the currently defined tei.data.name into two:
> tei.data.ident and tei.data.name -- the former is used for those
> cases where the name concerned *must* be an XML-compatible name and
> maps to xsd:Name ; the latter for names of any kind excluding
> spaces (mapping to NMTOKEN?)

I like tei.data.ident (although again, it's not clear to me that
abstracting to a TEI datatype gains us much -- I guess the schema
extender who knows ahead of time she will never want to use colonized
names or names > 31 chars long could re-declare it as
  xsd:NCName { maxLength = "31" }
so I'll take that back; I'm all in favor.)

As for tei.data.name:
* the name is really bad; I'd prefer to live with the confusion of
  tei.data.token. (Remember, the string "xsd:token" will only appear
  a few times in all of P5; in the declarations of at most half a
  dozen datatypes.)
* I think we should probably be more permissive than NMTOKEN.

> c. add a pattern to the list of alternatives proposed for
> tei.data.temporal which supports right-truncated times (just don't
> say i didnt tell you it'll all end in tears)

OK, I won't say that. But what do you think could happen to make it
end in tears?

> d. define  tei.data.probability as a value between 0 and 1 only

I really don't see why not permit percentages. Users had the choice
in P4, when we couldn't even validate it. Now we can.

> e. define tei.data. numeric as double, rather than decimal

Oh dear. This is the painful one. I was almost hoping to not have to
tackle this. "Tackle what?" you ask. If I understand correctly, Lou
is now proposing to declare that all TEI numbers be an xsd:double
rather than the proposed xsd:double | xsd:long.

What's the difference? Since xsd:double can represent a *much* larger
number range than xsd:long, why would anyone care? Because it doesn't
represent it exactly. In order to gain its incredible range, the trade
off is precision. When I wrote EDW90 I had no idea how precise an
xsd:double is (it's not easy to find out), but I knew it was less
than the 19 digits you can get for integers from xsd:long. I figured
it's not hard for an application to tell the difference (if it has a
dot or a letter e, it's an xsd:double, otherwise it's an xsd:long),
so why not just use 'em both?

But (Lou seems to be asking) do we really need the xsd:long? I've
just spent some time delving into this, and it seems that an
xsd:double can represent numbers with up to ~16 significant
digits.[1] So my addition of xsd:long to our datatype seems to only
support those users who wish to precisely indicate integers between
1e16 (10,000,000,000,000,000) and 1e19 (10,000,000,000,000,000,000).
That is to say, if you wanted to record the combined gross domestic
product of the EU and USA to the penny, you *might* need an xsd:long
to do so. If you don't mind representing it in euros or dollars, an
xsd:double will do. (I have never seen a representation of GDP that
could not be expressed in a 32-bit floating point number, e.g.
$5.625e13 is the CIA estimate for 2004.)

The executive summary is that 
* I now think xsd:double alone will do
* Someone should check my math on this, it is quite complicated stuff 
* If someone has a use-case where we'd want to represent a 16-digit
  or greater integer precisely (i.e., to the 1s place), we should
  reconsider xsd:long. (Yes, a credit card number along with the
  security code on the back is 19 digits long, but it's not really a
  number, it's a string that just happens to be composed of only
  digits.)

Note
----
[1]
http://cch.loria.fr/documentation/IEEE754/numerical_comp_guide/ncg_math.doc.html#555