[tei-council] Datatype : roundup

Lou Burnard lou.burnard at computing-services.oxford.ac.uk
Thu Sep 22 05:19:11 EDT 2005



Syd Bauman wrote:
> Thank you for the succinct summary, Lou. I'm going to address it
> quickly now, but can't really give it thorough treatment.
> 
> 
> 
>>1. No-one has dissented from the basic objective of providing tei
>>datatypes which together cover the full range of current
>>requirements as identified in Syd's edw90 table.
> 
> 
> I'm not 100% sure on what you mean by this. I thought we had agreed
> *not* to have a TEI datatype for xsd:boolean. I also thought it might
> be worth not bothering with the indirection for xsd:nonNegativeInteger
> and xsd:Name. At the moment I don't care very much, but I think some
> users really get fed up with indirection, and when it's almost
> completely pointless ...

It is not complettely pointless. It provide us with a space in which to 
define TEI-specific semantics in addition to the basic datatyping. So we 
can say not just this is a number, but this is a number which is used to 
count something.

> 
> 
> 
>>[some debate is needed on how we define this: syd's original
>>proposal suggested we should support only the W3C rather restricted
>>version of regexps, i.e. the pattern has to be "anchored". Is that
>>OK, or are we supporting apache-style perl-compatible regexps? or
>>just the original syntax built into grep (but not egrep)?]
> 
> 
> We initially went with W3C some 2 or 3 years ago using the (perhaps
> flawed) logic that it was a regular expression language that any XML
> software would have to know in order to support the W3C Schema
> "pattern" facet anyway.

I am not sure who the "we" in that sentence is -- possibly the SO work 
group?

And although they are anchored, I think the
> characterization of them as "restricted" is very misleading.

Well, as one who has done a lot of programming in various 
pattern-matching languages, I think the characterization is not VERY 
misleading. But it hardly matters... I am quite happy for us to stick 
with the W3C regexp language if others agree, for the good pragmatic 
reason given above, provided that we make explicit what its shortcomings 
are.


  The W3C
> regular expression language is basically the Perl language with quite
> a few useful extensions like
>   \i == matches any char that's legal as an initial char in an
>         xsd:Name  
>   \c == matches any char that's legal in an xsd:Name
>   \p{Po} == matches any char from the PUA


Aaargh, so it's not even a clean subset!

> 
> 
> so I'll take that back; I'm all in favor.)
> 
Good

> As for tei.data.name:
> * the name is really bad; I'd prefer to live with the confusion of
>   tei.data.token. (Remember, the string "xsd:token" will only appear
>   a few times in all of P5; in the declarations of at most half a
>   dozen datatypes.)

That's irrelevant to the issue here: we want people to use the TEI name 
and not be confused when they talk to others about it. No-one has yet 
proposed a better name than name.

> * I think we should probably be more permissive than NMTOKEN.
> 

We can tweak the definition if you like, but I don't understand why you 
would want to.


> 
> 
>>c. add a pattern to the list of alternatives proposed for
>>tei.data.temporal which supports right-truncated times (just don't
>>say i didnt tell you it'll all end in tears)
> 
> 
> OK, I won't say that. But what do you think could happen to make it
> end in tears?
> 
> 
(a) difficulties in implementation
(b) confusion caused by lack of timezone information

> 
>>d. define  tei.data.probability as a value between 0 and 1 only
> 
> 
> I really don't see why not permit percentages. Users had the choice
> in P4, when we couldn't even validate it. Now we can.
> 
> 

simplicity, clarity, precision...

> 
>>e. define tei.data. numeric as double, rather than decimal
>


I think this is a mistake, actually. Decimal was a better choice, since 
it can represent any number, real or integer, no matter how big. It 
means you can;t use scientific notation, which someone folks on TEI-L 
suddenly woke up and asked for. I now think maybe we should have a 
different datatype for that.

Credit card numbers, by the way, are tei.data.ident, clearly.


> 
> Oh dear. This is the painful one. I was almost hoping to not have to
> tackle this. "Tackle what?" you ask. If I understand correctly, Lou
> is now proposing to declare that all TEI numbers be an xsd:double
> rather than the proposed xsd:double | xsd:long.
> 
> What's the difference? Since xsd:double can represent a *much* larger
> number range than xsd:long, why would anyone care? Because it doesn't
> represent it exactly. In order to gain its incredible range, the trade
> off is precision. When I wrote EDW90 I had no idea how precise an
> xsd:double is (it's not easy to find out), but I knew it was less
> than the 19 digits you can get for integers from xsd:long. I figured
> it's not hard for an application to tell the difference (if it has a
> dot or a letter e, it's an xsd:double, otherwise it's an xsd:long),
> so why not just use 'em both?
> 
> But (Lou seems to be asking) do we really need the xsd:long? I've
> just spent some time delving into this, and it seems that an
> xsd:double can represent numbers with up to ~16 significant
> digits.[1] So my addition of xsd:long to our datatype seems to only
> support those users who wish to precisely indicate integers between
> 1e16 (10,000,000,000,000,000) and 1e19 (10,000,000,000,000,000,000).
> That is to say, if you wanted to record the combined gross domestic
> product of the EU and USA to the penny, you *might* need an xsd:long
> to do so. If you don't mind representing it in euros or dollars, an
> xsd:double will do. (I have never seen a representation of GDP that
> could not be expressed in a 32-bit floating point number, e.g.
> $5.625e13 is the CIA estimate for 2004.)
> 
> The executive summary is that 
> * I now think xsd:double alone will do
> * Someone should check my math on this, it is quite complicated stuff 
> * If someone has a use-case where we'd want to represent a 16-digit
>   or greater integer precisely (i.e., to the 1s place), we should
>   reconsider xsd:long. (Yes, a credit card number along with the
>   security code on the back is 19 digits long, but it's not really a
>   number, it's a string that just happens to be composed of only
>   digits.)
> 
> Note
> ----
> [1]
> http://cch.loria.fr/documentation/IEEE754/numerical_comp_guide/ncg_math.doc.html#555
> 
> _______________________________________________
> tei-council mailing list
> tei-council at lists.village.Virginia.EDU
> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
> 




More information about the tei-council mailing list