[tei-council] Fwd: bug report for Council, if you like

Sat Oct 2 00:27:21 EDT 2010

Hi there,
Here's a note from Syd complementing a post on SF. Looks like a good  
point, doesn't it?
Laurent

Début du message réexpédié :

> De : Syd Bauman <Syd_Bauman at Brown.edu>
> Date : 2 octobre 2010 05:54:35 GMT+02:00
> À : Laurent Romary <Laurent.Romary at loria.fr>
> Objet : bug report for Council, if you like
> Répondre à : Syd_Bauman at Brown.edu
>
> I've just posted a bug report (3079842) which y'all may find easier
> to discuss in e-mail, since it is somewhat long and uses formatting
> that would be lost in Sourceforge.
>
> If you'd like to forward this to Council, feel free. If you'd prefer
> to leave it to be dealt with only on Sourceforge, that's fine, too.
>
> ---------
>
>     The declaration of the points= attribute in att.coordinated looks
>     like it is probably in error, but since there are no examples of
>     the use of point= anywhere in the Guidelines, and the tagdocs for
>     att.coordinated, <surface>, and <zone> do not have <listRef>s, it
>     is hard to be sure. But the prose of the <desc> does say "a
>     series of pairs of numbers", which gives some help.
>
>     The current declaration is
>
>       attribute points {
>         list {
>           xsd:token { pattern = "[d]+,[\d]+([\s]+[\d]+,[\d]+){2,}" },
>           xsd:token { pattern = "[d]+,[\d]+([\s]+[\d]+,[\d]+){2,}" }*
>         }
>       }
>
>     (Numbers in pointy-brackets refer to productions in
>     http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/)
>     There are quite a few confusing constructs in there. First,
>     putting a single multi-character escape <37> into a character
>     class expression <12> is odd. Since "\s" is the same as
>     "[#x20\t\n\r]", saying "[\s]" seems superfluous, and I think it
>     would match exactly the same set of characters. I may be wrong on
>     this, though, because I can't get it to match *anything* using
>     either `jing` or `rnv`.
>
>     The use of "[\d]" is similarly odd. But here, "\d" is the same as
>     "\p{Nd}", which matches [0-9&#x1D7CE;-&#x1D7FF;&#xFF10-&#xFF19;],
>     which may not be what TEI wants. Besides the normal range 0-9,
>     this permits the ten FULLWIDTH DIGIT characters, the ten
>     MATHEMATICAL BOLD DIGIT characters, the ten MATHEMATICAL
>     DOUBLE-STRUCK DIGIT characters, the ten MATHEMATICAL SANS-SERIF
>     DIGIT characters, the ten MATHEMATICAL SANS-SERIF BOLD DIGIT, and
>     the ten MATHEMATICAL MONOSPACE DIGIT characters. While Unicode
>     defines these characters as having the appropriate numeric
>     values, there are still many legacy programming languages that do
>     not process them correctly. So TEI may wish to consider using
>     "[0-9]" rather than "\d" (aka "\p{Nd}").
>
>     But the first atom of the regular expression is "[d]", which
>     matches the U+0064, LATIN SMALL LETTER D. I'm guessing this is
>     just a typo for "[\d]" which would match numbers as described in
>     the previous paragraph.
>
>     If that first atom is intended to match a digit (and thus the
>     first bit before the comma is supposed to match an integer), then
>     the entire pattern, it would seem, is intended to match a list of
>     3 or more pairs of integers separated by a comma. But the regular
>     expression is itself in a RELAX NG list of itself followed by
>     zero or more of itself, and so can be matched 1 or more times.
>     This duplication -- 1 or more occurrences of 3 or more pairs of
>     integers -- seems silly, since any number of pairs of integers
>     will match so long as it's more than 3.
>
>     The RELAX NG list construct is expressed using the maxOccurs=
>     attribute of the <datatype> element in the ODD. So assuming that
>     the intent is to match a series of 3 or more pairs of non-
>     negative integers, I think the TEI has four choices, based on two
>     binary possibilities:
>     * express a number using only ASCII digits vs using any Unicode
>       digits
>     * express the "3 or more" using the regular expression or using
>       maxOccurs=.
>
>     In any case, the description should be more clear. "a series of
>     pairs of numbers" should probably be "a series of 3 or more
>     comma- separated pairs of non-negative integers". (If I have that
>     right -- ostensibly because the points are measured in pixels
>     which can't be measured fractionally, and they always relative to
>     the containing <surface>, so negative values would be off the
>     surface, and are thus out of bounds.) Moreover, it should be made
>     clear whether the numbers are relative to the bounding box, or
>     just must be within it.
>
>      <!-- ASCII digits, TEI occurrence constraint -->
>      <datatype minOccurs="3" maxOccurs="unbounded">
>        <rng:data type="token">
>          <rng:param
>              name="pattern">[0-9]+,[0-9]+</rng:param>
>        </rng:data>
>      </datatype>
>
>      <!-- ASCII digits, XSD occurrence constraint -->
>      <datatype minOccurs="1" maxOccurs="1">
>        <rng:data type="token">
>          <rng:param
>              name="pattern">[0-9]+,[0-9]+(\s+[0-9]+,[0-9]+){2,}</ 
> rng:param>
>        </rng:data>
>      </datatype>
>
>      <!-- Unicode digits, TEI occurrence constraint -->
>      <datatype minOccurs="3" maxOccurs="unbounded">
>        <rng:data type="token">
>          <rng:param
>              name="pattern">[\d]+,[\d]+</rng:param>
>        </rng:data>
>      </datatype>
>
>      <!-- Unicode digits, XSD occurrence constraint -->
>      <datatype minOccurs="1" maxOccurs="1">
>        <rng:data type="token">
>          <rng:param
>              name="pattern">[\d]+,[\d]+(\s+[\d]+,[\d]+){2,}</ 
> rng:param>
>        </rng:data>
>      </datatype>
>
>      Personally, I think the best way to go is to define a TEI
>      datatype (data.point) as
>         xsd:token { pattern = "[0-9]+,[0-9]+" }
>      and then use
>
>      <datatype minOccurs="3" maxOccurs="unbounded">
>        <rng:ref name="data.point"/>
>      </datatype>
>
>      Besides being more elegant and generalizable, this means that if
>      and when Council decides to support the other Unicode digits it
>      is an easy change.
>
> ---------

Laurent Romary
INRIA & HUB-IDSL
laurent.romary at inria.fr