[tei-council] Datatype : roundup

Thu Oct 6 10:57:38 EDT 2005

SB> I really don't see why not permit percentages. Users had the
SB> choice in P4, when we couldn't even validate it. Now we can.

LB> simplicity, clarity, precision...

SB> While I suppose it is simpler for software writers to have 1
SB> system rather than 2, there is nothing more clear nor more
SB> precise about "0.824" than "82.4%". While I have some sympathy
SB> with the idea of reducing choices for users, this is one place
SB> where I think users like the choice.

LB> I see no evidence for this asserttion at all. Since both
LB> representations mean exactly the same thing, and are exactly
LB> inter-convertible, I think it just looks silly not to come down
LB> on one side of the fence or the other.

If we ask a random sampling of 50-200 TEI users whether they'd prefer
to always use decimal representation (whether xsd:decimal or
xsd:double) like "0.824" or always use percentages like "8.24%", I'll
buy you a beer if > 0.80 of them all agree on one or the other.

LB> I think this is a mistake, actually. Decimal was a better choice,
LB> since it can represent any number, real or integer, no matter how
LB> big. It means you can;t use scientific notation, which someone
LB> folks on TEI-L suddenly woke up and asked for.

SB> So you think we have more users who really want to represent
SB> numbers with greater than 16 (decimal) digits of precision than
SB> users who want to represent numbers in scientific notation? As I
SB> had hoped my example would demonstrate, that much precision is
SB> not something we humans generally deal with.

LB> No, I think that if there are 10 people in the world who want to
LB> use a numeric datatype, 8 of them might want to use what one
LB> might call unscientific notation, and 9.9 of them will want to
LB> represent values representable to an accuracy less than 8 decimal
LB> digits!

That seems like a pretty strong argument in favor of using
xsd:double! If 20% of people in the world want to use scientific
notation, but only 1% want > 8 places of precision, let alone > 16
places of precision, we should most certainly use xsd:double, not
xsd:decimal. I'd much prefer to tell the far less than 1% that they
can't represent their *huge* numbers precisely than to tell the 20%
they can't represent their numbers with scientific notation.

My point is this: even in the hard sciences, representation of any
number to > 16 digits of precision is almost unheard of.
Representation of numbers that are extremely unwieldy using decimal
notation (which is, after all, why scientific notation was invented)
is quite common.

* the speed of light is defined by CGPM to only 9 digits of
  precision, according to Wikipedia. Most of us would probably prefer
    <measure quantity="3e8" unit="cm/s">C</measure>
  to
    <measure quantity="300000000" unit="m/s">C</measure>
  or the more precise
    <measure quantity="299792458" unit="m/s">C</measure>
  or
    <measure quantity="1.08e9" unit="km/h">C</measure>
  to
    <measure quantity="1080000000" unit="km/h">C</measure>
  or the more precise
    <measure quantity="1079252849" unit="kn/h">C</measure>
  but any of these is perfectly OK if quantity is an xsd:double.

* Avogadro's number is usually described as roughly 6.022e23. Can you
  imagine needing to express this as <num
  value="602214199000000000000000">? Kind of defeats the purpose,
  especially when there's really no solid agreement on what the
  number *is* past that first "1".

* If I understand correctly, the computers NASA used to send men to
  the moon had "double precision" capability, but these instructions
  used two 16-bit registers, i.e., roughly the same precision as
  today's xsd:int and xsd:float.

BTW, the one obvious reason to want to use xsd:decimal (whether we
use xsd:double alone as I'm now advocating, or xsd:double|xsd:long,
which I'd be perfectly happy with, too) are encoding books of
mathematical tables and tables of physical constants and the like.

I just skimmed through my copy of CRC Standard Mathematical Tables,
and found
 - tens, perhaps hundreds of pages of things to 5 & 6 places
 - a few pages of things to 8 places
 - one large multi-page table to 9 places (compound interest --
   I wonder what that tells us! :-)
 - one entry (the number of seconds per radian) to 11 places
 - only 3 tables that would require > 16 digits precision:
   sums of reciprocal powers, Bernoulli numbers, and Euler numbers.
   (Anyone remember what those things are? I sure don't :-)

LB> I now think maybe we should have a different datatype for
LB> [scientific notation].

SB> If we split scientific notation out to a different datatype, won't we
SB> need a disjunction of the two datatypes in most if not all instances
SB> anyway? And the disjunction (whether of two separate TEI datatypes or
SB> of two xsd: datatypes inside a TEI datatype) might be a bit confusing
SB> for implementers. But it shouldn't be impossible to deal with (could
SB> always just assume that if it's not in scientific notation, it is an
SB> xsd:decimal).

LB> I dont think that sort of "assumption" is something an XML
LB> validator can do, is it? 

True, but the validator doesn't have to do any such thing. It just
has to say "yes, this thing is a valid xsd:decimal OR a valid
xsd:double". Other processes will have to actually figure out what
the value *means*, e.g. to sort by it, or generate a scaled value of
it for use in an SVG or whatever.

LB> But we could certainly say the numeric datatype maps onto
LB> xsd:decimal|xsd:float if that would make you happy.

[I presume you mean "xsd:double" -- we haven't considered xsd:float,
and I see no reason why we should.]

I think the worst that could happen in this case is some developers
curse us for having been so cavalier with their time and effort.
Fine, let's go with 
   data.numeric = xsd:decimal | xsd:double