[tei-council] date attributes: summary, problems, and some suggestions

Syd Bauman Syd_Bauman at Brown.edu
Mon Jan 22 17:23:37 EST 2007


First, a listing of the attributes that are directly involved with
dating. ("dating" as in "timing", not as in "courtship" :-)


States of Play, P4
------ -- ----- --
  <date> calendar=
         certainty=
         value=
  <time> type=
         value=
         zone=
  <dateRange> calendar=
              exact=
              from=
              to=
  <timeRange> exact=
              from=
              to=
  <dateStruct> a.temporalExpr ( value=, key=, reg=, type=, full= )
               calendar=
               exact=
  <timeStruct> a.temporalExpr ( value=, key=, reg=, type=, full= )
               zone=
  <distance> a.temporalExpr ( value=, key=, reg=, type=, full= )
             exact=
  <offset> a.temporalExpr ( value=, key=, reg=, type=, full= )

Additionally there is admin/@date, birth/@date, docDate/@value,
@crdate of all pointer elements, teiHeader/@date.created,
teiHeader/@date.updated, and writingSystemDeclaration/@date; also
<day>, <hour>, <minute>, <month>, <second>, <week>, and <year> are all
members of a.temporalExpr. Did I miss any?

The rule in P4 for all of the attributes that held a date/time value
is pretty simple. It boils down to "if you can use ISO 8601, do so; if
not, document whatever you do in <stdVals> in the header".


State of Play, P5
----- -- ----- --
  <date> att.datePart ( value=, dur= )
         att.editLike ( cert= )
         att.datable  ( notBefore=, notAfter=, from=, to= )
         att.typed    ( type=, subtype= )
         calendar=
         precision=
  <time> att.datePart ( value=, dur= )
         att.editLike ( cert= )
         att.datable  ( notBefore=, notAfter=, from=, to= )
         att.typed    ( type=, subtype= )
  <distance> att.datePart ( value=, dur= )
             att.typed    ( type=, subtype= )
             exact=

Additionally there is birth/@date, change/@date, death/@date,
docDate/@value, and when/@absolute. Furthermore, the following
elements are members of att.datable:

    <acquisition>, <affiliation>, <age>, <binding>, <birth>,
    <custEvent>, <death>, <education>, <faith>, <floruit>,
    <langKnowledge>, <langKnown>, <nationality>, <origDate>,
    <origin>, <persEvent>, <persName>, <persState>, <persTrait>,
    <provenance>, <relation>, <residence>, <seal>, <sex>, and
    <socecStatus>


Problems
--------
* Some are distressed by the fact that attributes that are of the
  same datatype (data.temporal) and serve similar functions have
  different names, in particular:
      value=  of  <date>, <time>, <distance>, and <docDate>
      date=   of  <birth>, <change>, and <death>
  I am not bothered by this in the least, because I think the
  semantics are clearer with these names, and the combined
  alternative (dateValue=) is at least cumbersome if not misleading
  (i.e., on <time>).
  Suggestion: leave names as they are. 

* We haven't implemented classes as well as we could.
  Suggestions: 

  - Put <docDate> into att.datePart. This has the disadvantage of
    giving <docDate> a dur= attribute, but I'm not sure it is worth
    making another class just for this one case. Thoughts?

  - Create a new attribute class for the date= of <birth>, <death>,
    and <change>. (Any suggestions for the name?) 

  - If we keep <distance>[1] we may wish to reconsider its class
    membership, as value= is a bit silly on <distance>. It needs only
    dur= from att.datePart, making two cases that benefit from
    splitting att.datePart. (See <docDate>, above.)

* The precision= attribute is superfluous, as the precision is
  represented in the value of the value=, dur=, notBefore=, notAfter=,
  from=, or to= attribute(s).
  One might argue that we should instead change this attribute to
  indicate that a time is precise only to the minute or hour (as
  opposed to second or fraction thereof) and thus not require our
  extension to W3C datatypes. However, this may become a non-issue
  depending on outcomes of the items discussed below; besides, I
  wouldn't make this argument, so ...
  Suggestion: delete it.

* Users want a method of expressing things like "Oct 27 of 1909, 1910,
  or 1911" or "an Oct 27, but I don't know which one". The W3C format
  that express only a month and day explicitly (xsd:gMonthDay) means
  "a set of one-day long, annually periodic instances". These users
  don't want the entire set, they want only one. ISO 8601:2004 does
  not seem to have even a method to represent the set, let alone a
  singleton. (James, can you verify that? How would one represent
  month & day, no year, in 8601?)
  Suggestion: I haven't got one, thus defer to P5 1.1.

* Currently we are using one attribute for both regularization
  and normalization.[2]
  Solutions: see below

* We have created a dating datatype (data.temporal) based mostly on
  W3C datatypes. W3C datatypes have several shortcomings when compared
  to ISO 8601, which itself has several shortcomings with respect to
  the flexibility scholars need. On the other hand, W3C datatypes have
  known software support, which the others do not, and a large
  percentage of users will never need anything more.
  Solutions: see below

* At least one user has expressed a need to express dates in other
  than the [proleptic] Gregorian calendar. He believes this would be a
  requirement of many historians were they to use TEI.
  Solutions: see below

Below
-----
Two different suggestions have been floated for trying to get a handle
on the last three problems, to which I will add two more. 

The basic idea is to provide two capabilities: 
* simple date format: conform to W3C spec, easily validatable, software
  support in the world-at-large
* complex date format: should conform to ISO 8601 if possible

Note that "simple" and "complex" are mostly just labels: it is
possible to have a W3C date expression that is more complex than some
other format. The complex date format could be split into two: those
that conform to ISO 8601 and those that don't; this would give us
three formats, W3C, ISO, and User-generated.

Note that P4 has only complex format dates. Further note that right
now our P5 dates are very like the simple date format, except that a
single complexity has been added: expressing times precise only to the
minute or hour. This complexity is validatable, but enjoys no support
in the world of XSLT 2.0. If we go with *any* of the following
systems, I recommend that our "simple date" formats revert to being
truly W3C-only, and thus those who need to express times less
precisely than to the second would be forced to use the "complex date"
format.

The question is at what level to apportion these capabilities. Here
are the four possibilities I have come up with. Note: the names are
ones I have MADE UP on the spot, and are merely stand-ins for whatever
Council eventually decides they should be named.

attribute level: each of the dating attrs is split into two
datatype level: we provide one datatype for each date format, user
                chooses which for each attribute
class level: for each attribute set, we provide two (or more) classes,
             one for each format, user chooses which for each element
all-in-one: syntax of attribute value differentiates

datatype level
-------- -----
We create two or three datatypes, one for each date format. 

data.w3cTemporal = xsd:date | xsd:gYear | xsd:gMonth | xsd:gDay |
                   xsd:gYearMonth | xsd:gMonthDay | xsd:time |
                   xsd:dateTime

data.isoTemporal = [if & when a datatype library is written, plug it
                    in here; in the meantime, a bunch of gnarly
                    regexes might do the trick.]

data.usrTemporal = xsd:token [3] or whatever user chooses to use

(Latter two could easily be rolled into one 'data.looseTemporal'.)

The user, at schema-creation time (perhaps with easy radio buttons in
Roma) chooses which datatype to use for any given attribute. (A nice
UI feature would allow user to select a datatype for *all* dating
attributes at one shot.)

attribute level
--------- -----
For every dating attribute, we use two actual attributes. E.g.:
  <date> value.w3c=  notBefore.w3c=  from.w3c=
         value=      notBefore=      from=
or
  <date> value=        notBefore=        from=
         value.loose=  notBefore.loose=  from.loose=
or
  <date> normValue=  normNotBefore=  normFrom=
         regValue=   regNotBefore=   regFrom=
or whatever. In each case, those on the top line would be the simple
date formats declared with datatype data.temporal.w3c, and those on
the bottom line would be the complex date format declared with
datatype data.temporal.loose (remember, names are just made up).

Each element or class simply declares twice as many dating attributes
as it used to.

class level
----- -----
Declare two or three datatypes as above. Then, for each class that
uses dating attributes (currently att.datePart and att.datable, plus
we'll need some more) create a second and perhaps third class that is
similar, but uses the corresponding datatype.

Thus instead of having the user choose which datatype to use for each
attribute, the user can choose which class for each element. Two
scenarios are possible:

* classes use different attribute names, and can thus be used
  simultaneously if user desires

* classes have same attribute names, and thus are definitionally
  mutually exclusive. I don't know how we'd build this concept into
  ODD or this capability into roma.

Presuming the first scenario, the classes would be set up something
like the following.

  att.datable.w3c: notBefore= notAfter= from= to=
  att.datable.iso: iso-notBefore= iso-notAfter= iso-from= iso-to=
  att.datable.usr: usr-notBefore= usr-notAfter= usr-from= usr-to=

(Again, latter two could be rolled into one.)

all-in-one
----------
We keep the current set of attributes (rather than doubling or
tripling them) and we continue to have only one datatype. However,
like RFC 3066 language tags, and thus just like our xml:lang= values,
we differentiate the syntax of the value with a prefix:

  no prefix   = entire value is a simple format (i.e. W3C) date
  prefix 'i-' = remainder of value is an ISO 8601 date
  prefix 'x-' = remainder of value is a user-defined format date


Choosing one of these requires some thought and discussion. Up front I
can only say that I don't like the 'attribute level' solution. It's
just too confusing for the average user, most of whom have little or
nothing to gain. I.e., to borrow a phrase from Perl, while it does
make the hard things possible, it does not make the easy things easy.
The others all do, presuming we make simple format dates the default
for the class or datatype level.


Notes
-----
[1] Lou is arguing that we drop the <distance> element altogether.
    Although I'm interested in arguments for keeping it, it's hard to
    see what purpose it serves that couldn't be handled equally well
    by typed use of <date>, <time>, and <measure>. At the moment I
    lean towards nuking it.

[2] By "regularization" I mean changing the format so that it is
    consistent, and thus can be easily processed by software. E.g.
    if the original text is "5 foot 7 in.", a regularized version
    might read "5 ft 7 in".
    By "normalization" I mean changing the format and possibly the
    value so it is regular with respect to a particular external
    standard. So in the 5'7" example, a normalized version might read
    "1702 mm".
    In many cases, of course, they're the same thing.

[3] Remember, an xsd:token is not a 'token' in any normal person's use
    of the term. It is a string. A string upon which whitespace
    normalization is performed before matching for validity.




More information about the tei-council mailing list