on spec grp 2, datatypes normalized (was "Re: [tei-council] datatypes")
Syd Bauman
Syd_Bauman at Brown.edu
Sun Sep 18 11:15:56 EDT 2005
It seems to make sense to divide these comments up along with the
clever groupings Lou has used. Thus this one is
On Specification group 2: Datatypes: normalised
-- ------------- ----- -- ---------- ----------
* tei.data.temporal (I like the name better than the previous clumsy
"temporalExpression"): the change removes the capability to express
a time without seconds. The change also brings the datatype closer
in line with being "normalized" to W3C Schema part 2, in the sense
that the only component whose value space is not a date or time has
been removed.
However, as is often the case, I don't think the datatypes
described in W3C Schema part 2 would do a particularly good job of
modeling or representing humanities data. (Remember, they were
written by corporations for corporations.) One might even go so far
as to say W3C is a bit dyslexic in the area of dates and times. I
concede that I may have missed or be misunderstanding some
pertinent detail thereof; and I am going to resist the urge to bash
W3C date & time datatypes in general, and will try to stick to the
detailed issue at hand. However, so that no one gets the wrong
idea, I will note that as much as I can pick holes in W3C
representations, they've done a better job than I would have done
at first crack, and have what I think is a very good examination of
how dates, times, and durations are compared and thus sorted.
The problem at hand boils down to the W3C conception of a time:
"time represents an instant of time ..." (3.2.8). Thus, to the W3C,
"09:07:51" is just a less precise description of some instant,
which would be more precisely described as "09:07:51.209831". But
in the humanities, I submit, we often need to encode times less
precisely than to the instant, or even the second. Requiring
seconds (or even thousands thereof) when you're logging incoming
support phone calls or web hits makes sense. But requiring them for
"They had all frozen at the same time, on a snowy night, seven
years before, and after that it was always ten minutes to five in
the castle" or "Six in the evening" is problematic. "16:50" says
something different than "16:50:00" just as "that building is 14.2
meters high" is different than "that building is 14200 millimeters
high". The former time denotes a minute (the 1,011th minute of the
day), the latter time denotes a second (the 60,601st second of the
day). The latter height implies one used a ruler with millimeter
markings; the former does not.
Rather than suggest TEI should require precision to the second, I'm
actually surprised that someone hasn't suggested, for consistency
at least, that TEI should permit precision only to the hour. (After
all, pending this discussion we permit precision to the year,
month, day, minute, and second.)
All that said, if we do go with the EDW90 recommendation of
permitting times to the minute, the prose should note that it is
possible that some processors might not be able to properly compare
or sort them. (Unlikely, though, as the W3C algorithm for doing so
applies to times without seconds as well as with; see 3.2.7.4.)
* tei.data.duration: The change removes the capability to express a
duration in a more common syntax, sticking only with the W3C syntax
that uses the 8601 "format with time-unit designators". Here the
difference, as far as I know, is only syntax. My reasoning for
recommending we allow more readable syntax is that I think many, if
not most, applications within the TEI world would be using these
sorts of values more for regularization than for normalization.
Admittedly, the W3C datatype performs both functions, if almost
unreadably. (Why couldn't they have recommended 8601 alternate
syntax?)
So I still prefer
xsd:duration | xsd:token { pattern =
"[0-9]+(\.[0-9]+)? (year|month|week|d|h|min|s|ms|μs" }
(Note that is *not* the same as the one in EDW90; this one is,
IMHO, much better, and conforms to NIST recommendations (which use
SI wherever possible); however, it does not permit expression of a
negative duration.) But again, unlike xsd:duration where the
semantics are actually different, in this case it is only syntax,
so I don't care very much. I just think TEI users are going to be
much happier encoding <person age="3 month"> than <person
age="P3M"> and <event dur="13 ms"> than <event dur="PT0.013S">.
* tei.data.sex: We've already hashed this through, and it's only
syntax, so again, I don't care much. But I am not sure this one
meets "Syd's rule". I think for many TEI applications "u", "m",
"f", and "x" (or "not known", "male", "female", "not specified")
will be demonstrably significantly better -- the humans will be
able to proofread the texts.
More information about the tei-council
mailing list