[tei-council] Feature request 1933198: 'precision'

Thu Apr 9 09:37:43 EDT 2009

A detailed response from Tim.

---------- Forwarded message ----------
Date: Thu, 09 Apr 2009 10:48:47 +0800
From: Tim Finney <tjf2n at virginia.edu>
To: David Sewell <dsewell at virginia.edu>
Cc: Markus Flatscher <markus.flatscher at gmail.com>
Subject: Re: More on dates, contd. (David: low priority)

David,

Thanks for telling me about this.

The proposal looks good but has a few rough edges.

Here is the proposal:

---

(1) Creation of a new element, <precision/> (empty element);

(2) element inherits global attributes and those from att.dimensions
(@min, @max, @atLeast, @atMost are the essential ones);

(3) Create a new attribute class (“att.qualifier” vel sim.) with the
attributes @degree, @locus, @target (currently defined ad hoc for
<certainty/>;

(4) define new attribute @stdDeviation for <precision/>, with datatype
numeric.

---

Here's what I think:

(1) is OK.

(2) Isn't @atLeast the same as @min and @atMost the same as @max? In my
view it is essential to have @assertedValue (or, better, @mostLikely) as
well so that you can give what you think is the most probable value
within an interval.

I would add @circa too, which can be used in place of @min and @max when
someone is encoding what has been given as an approximate quantity with
no mention of the range (i.e. interval) of possible values. I don't
think that @circa and @mostLikely are interchangable, but could live
with just @mostLikely.

See here for definitions of confidence interval and confidence level.
The most probable value is often but not always the central value within
the range defined by the interval:

http://www.stat.yale.edu/Courses/1997-98/101/confint.htm

You might also take a look at the section called "Confidence interval
for the simple matching distance" in chapter three of this:

http://alpha.reltech.org/tfinney/ATV/book/index.html

In humanities, our assertions are rarely based on a statistical
analysis. However, a person often has a reasonable idea of the upper and
lower limits of probable values and an idea of the most likely value
within the interval defined by these limits.

(3) OK, but I would prefer to use @conf (for confidence) rather than
@degree seeing that "confidence" is the conventional term in this
context. I would not be unhappy with @cert, but think that @conf is
better.

I think that the guidelines should give some guidance about what to use
for @conf in different situations. If you really do know what the
confidence level is then you could give it, e.g. 95%. (This necessarily
implies that you have done a statistical analysis.) If you haven't done
an analysis then you need to use forensic categories instead (e.g.
beyond reasonable doubt, more probable than not, doubtful). I believe
that in these circumstances, the encoder should not create an illusion
by plucking a percentage (or probability) out of the air but should
instead use categories (e.g. high, medium, low) which correspond to
notional ranges of confidence (e.g. > 95%, >50%, >5%). See my essay for
a slightly deeper treatment:

http://alpha.reltech.org/tfinney/uncertainty/uncertainty.html

The forensic categories (high, medium, low) also need to include
"unknown" for cases when the confidence level is unknown (e.g. a circa
date).

(4) I don't think that @stdDev is useful or necessary if @min, @max, and
@conf are available. The standard deviation of a set of measurements can
be used to construct a confidence interval under certain circumstances.
However, your average punter has no idea what range of possible values a
standard deviation implies or when it is a bad idea to use because the
sample is too small or not randomly selected or the sampling
distribution is not normal, etc. People who do know these things can
just present a confidence interval as a courtesy to the reader. The
choice is this: Give @stdDev, which is almost useless on its own. ("If I
put a bunch of numbers into my calculator and press SD then I get this
number. I haven't told you how many numbers I used or whether they
represent a random sample so you can't tell what the number really
means. But here it is anyway. It makes this look scientific, doesn't
it."). Alternatively, give the implied confidence interval, which most
people will understand. ("I'm pretty sure the value is between A and
B.")

Tim

On Wed, 2009-04-08 at 14:02 -0400, David Sewell wrote:
> Tim (& Markus),
>
> Gaby Bodard filed a formal proposal for the <precision> element. Council
> is set to approve unless there are objections. See what you think:
>
> https://sourceforge.net/tracker/?func=detail&aid=1933198&group_id=106328&atid=644065
>
> I don't think this would necessary obviate future improvements to the
> whole date scheme, but is intended to fill an immediate gap in the
> available elements.
>
> I want to come back to the discussion you two have had on dates and
> their relation to our FGEA metadata, but maybe after this week's work on
> TSJN corrections.
>
> David
>
> On Tue, 7 Apr 2009, Tim Finney wrote:
>
> > Hi Markus,
> >
> > On Fri, 2009-04-03 at 09:42 -0400, Markus Flatscher wrote:
> > > Tim, David,
> > >
> > > I've been thinking some more about yesterday's idea of an @RegEx-when
> > > on <date>s, and I guess that introducing a new, flexible data type
> > > like that is trying to build too much logic into the TEI that should
> > > better be done at the processing stage. (Kind of like building a spell
> > > checker into HTTP.)
> >
> > Hey, that's a good idea.
> >
> > >
> > > However,
> > >
> > > (1) I realized today that <date> already may contain <date> in the
> > > current version of P5.
> >
> > Wow. That's good to know.
> >
> > >  I think that solves the problem of unevenly
> > > distributed precision or certainty/confidence in conjectural date
> > > ranges of type "c.1753-c.1795". (Sidenote: for those cases, @from and
> > > @to on one single element would have to be considered evil, because
> > > @precision or @cert can't refer to sibling attributes. However, it
> > > should be possible to split up the date range as in case 1 in the
> > > listing below.)
> >
> > As you know (steps onto soap box) I think that uncertain quantities
> > should be able to be expressed with a confidence interval (i.e. a range
> > specified by upper and lower limits) and a confidence level (e.g.
> > "high"). Otherwise, a set of alternatives can be listed with a
> > confidence level. E.g. Many people think that Christ was born <birth
> > bestGuess="-6" notBefore="-7" notAfter="-4" cert="high">between 4 and 7
> > BC, with 6 BC a popular choice</birth>; alternatively, something like
> > <birth cert="high"><choice><date when="-6"/><date when="-5"/><date
> > when="-7"/><date when="-4"/></choice></birth>. (Note that order shows
> > preference in the last example.)
> >
> > >
> > > (2) That being said, in order to do what I referred to as "grouping"
> > > in my last email (referring to specific positions within a date, at
> > > least YYYY or MM or DD), most real-world cases could probably be
> > > handled if tei:year, tei:month and tei:day children were added to
> > > tei:date. See listing below for examples.
> >
> > I like that idea. Maybe a datePart (or just date) element instead, with
> > type saying whether it is a day, month, year, festal season, etc.
> >
> > >
> > > I think this looks, sounds and quacks much more like TEI (and
> > > therefore might have more of a chance of becoming TEI),
> >
> > I wondered what that quacking noise was.
> >
> > >  and the
> > > suggested modification probably could still be handled decently by
> > > applications with a need for higher granularity in spite of the
> > > absence of a straightforward @when for dates or date ranges with mixed
> > > certainty or precision.
> >
> > @when could be replaced with, say, @bestGuess or something similar when
> > the date is uncertain.
> >
> > >
> > > Curious to hear your thoughts,
> >
> > See below...
> >
> > >
> > > Markus
> > >
> > > ---
> > > Listing:
> > >
> > > <!-- case 1: ca.1753--ca.1795 (date range, precision: medium,
> > > tolerance: -5/+2 years) -->
> > > <!-- Note: Tim Finney's example. Argument: "Best guess" is captured in
> > > the @from-iso and @to-iso atts and/or in the element content -->
> > > <date type="range">
> > >     <!-- Note: date containing date is already legal in P5-->
> > >     <date from-iso="1753" precision="circa" notBefore-iso="1748">ca. 1753</date>
> > >     <date to-iso="1795" precision="circa" notAfter-iso="1797">ca. 1795</date>
> > > </date>
> > >
> >
> > The birth and death dates need to be handled separately, I think. So:
> > <birth bestGuess="1753" notBefore="1751" notAfter="1754" cert="high">c.
> > 1753</birth> or <birth bestGuess="1753" precision="circa">c.
> > 1753</birth> and similar death elements.
> >
> >
> >
> > > <!-- case 2: [1 January 1779] (conjectural date, confidence: high) -->
> > > <supplied>
> > >   <date when="1779-01-01">[1 January 1779]</date>
> > > </supplied>
> > >
> >
> > OK.
> >
> > > <!-- case 3: [1? January 1779] (conjectural date, confidence: medium
> > > for day, high for month and year) -->
> > > <supplied>
> > >   <date>
> > >       <supplied>
> > >           <day cert="medium">1?</day>
> > >       </supplied>
> > >       <month>January</month>
> > >       <year>1779</year>
> > >   </date>
> > > </supplied>
> >
> > OK. (Erratum noted.) The header should say what the encoder thinks
> > cert="medium" means, perhaps using categories such as mentioned in my
> > essay on uncertainty. Otherwise, the TEI spec could give some guidance
> > on what high, medium, and low actually mean.
> >
> > >
> > > <!-- case 4: [1? January? 1779] (conjectural date, confidence: medium
> > > for day and month, high for year) -->
> > > <supplied>
> > >   <date>
> > >       <day cert="medium">1?</day>
> > >       <month cert="medium">January?</month>
> > >       <year>1779</year>
> > >   </date>
> > > </supplied>
> > >
> >
> > OK, but maybe <datePart type="day" bestGuess="1"
> > cert="medium">1?</datePart>, or just <date type="day" ...>.
> >
> > > <!-- case 5: 1 [January] 1779 (date, month is conjecture, confidence: high) -->
> > > <date>
> > >   <day>1</day>
> > >   <supplied>
> > >       <month>January</month>
> > >   </supplied>
> > >   <year>1779</year>
> > > </date>
> > >
> >
> > OK.
> >
> > > <!-- case 6: 1 January 17[1 or 7]7 (date, decade is unclear,
> > > confidence: medium for decade value 7, low for decade value 1 -->
> > > <date>
> > >   <day>1</day>
> > >   <month>January</month>
> > >   <year>
> > >       <choice>
> > >           <unclear cert="medium">1777</unclear>
> > >           <unclear cert="low">1717</unclear>
> > >       </choice>
> > >   </year>
> > > </date>
> >
> > OK. There could also be the implication that the order of choices says
> > something about preference.
> >
> >
> >
> >
>