on spec grp 2, datatypes normalized (was "Re: [tei-council] datatypes")

Sun Sep 18 11:15:56 EDT 2005

It seems to make sense to divide these comments up along with the
clever groupings Lou has used. Thus this one is

On Specification group 2: Datatypes: normalised
-- ------------- ----- -- ---------- ----------

* tei.data.temporal (I like the name better than the previous clumsy
  "temporalExpression"): the change removes the capability to express
  a time without seconds. The change also brings the datatype closer
  in line with being "normalized" to W3C Schema part 2, in the sense
  that the only component whose value space is not a date or time has
  been removed.

  However, as is often the case, I don't think the datatypes
  described in W3C Schema part 2 would do a particularly good job of
  modeling or representing humanities data. (Remember, they were
  written by corporations for corporations.) One might even go so far
  as to say W3C is a bit dyslexic in the area of dates and times. I
  concede that I may have missed or be misunderstanding some
  pertinent detail thereof; and I am going to resist the urge to bash
  W3C date & time datatypes in general, and will try to stick to the
  detailed issue at hand. However, so that no one gets the wrong
  idea, I will note that as much as I can pick holes in W3C
  representations, they've done a better job than I would have done
  at first crack, and have what I think is a very good examination of
  how dates, times, and durations are compared and thus sorted.

  The problem at hand boils down to the W3C conception of a time:
  "time represents an instant of time ..." (3.2.8). Thus, to the W3C,
  "09:07:51" is just a less precise description of some instant,
  which would be more precisely described as "09:07:51.209831". But
  in the humanities, I submit, we often need to encode times less
  precisely than to the instant, or even the second. Requiring
  seconds (or even thousands thereof) when you're logging incoming
  support phone calls or web hits makes sense. But requiring them for
  "They had all frozen at the same time, on a snowy night, seven
  years before, and after that it was always ten minutes to five in
  the castle" or "Six in the evening" is problematic. "16:50" says
  something different than "16:50:00" just as "that building is 14.2
  meters high" is different than "that building is 14200 millimeters
  high". The former time denotes a minute (the 1,011th minute of the
  day), the latter time denotes a second (the 60,601st second of the
  day). The latter height implies one used a ruler with millimeter
  markings; the former does not.

  Rather than suggest TEI should require precision to the second, I'm
  actually surprised that someone hasn't suggested, for consistency
  at least, that TEI should permit precision only to the hour. (After
  all, pending this discussion we permit precision to the year,
  month, day, minute, and second.)

  All that said, if we do go with the EDW90 recommendation of
  permitting times to the minute, the prose should note that it is
  possible that some processors might not be able to properly compare
  or sort them. (Unlikely, though, as the W3C algorithm for doing so
  applies to times without seconds as well as with; see 3.2.7.4.)

* tei.data.duration: The change removes the capability to express a
  duration in a more common syntax, sticking only with the W3C syntax
  that uses the 8601 "format with time-unit designators". Here the
  difference, as far as I know, is only syntax. My reasoning for
  recommending we allow more readable syntax is that I think many, if
  not most, applications within the TEI world would be using these
  sorts of values more for regularization than for normalization.
  Admittedly, the W3C datatype performs both functions, if almost
  unreadably. (Why couldn't they have recommended 8601 alternate
  syntax?)

  So I still prefer
     xsd:duration | xsd:token { pattern =
     "[0-9]+(\.[0-9]+)? (year|month|week|d|h|min|s|ms|&mu;s" }
  (Note that is *not* the same as the one in EDW90; this one is,
  IMHO, much better, and conforms to NIST recommendations (which use
  SI wherever possible); however, it does not permit expression of a
  negative duration.) But again, unlike xsd:duration where the
  semantics are actually different, in this case it is only syntax,
  so I don't care very much. I just think TEI users are going to be
  much happier encoding <person age="3 month"> than <person
  age="P3M"> and <event dur="13 ms"> than <event dur="PT0.013S">.

* tei.data.sex: We've already hashed this through, and it's only
  syntax, so again, I don't care much. But I am not sure this one
  meets "Syd's rule". I think for many TEI applications "u", "m",
  "f", and "x" (or "not known", "male", "female", "not specified")
  will be demonstrably significantly better -- the humans will be
  able to proofread the texts.