[tei-council] Fwd: bug report for Council, if you like
Laurent Romary
laurent.romary at inria.fr
Sat Oct 2 00:27:21 EDT 2010
Hi there,
Here's a note from Syd complementing a post on SF. Looks like a good
point, doesn't it?
Laurent
Début du message réexpédié :
> De : Syd Bauman <Syd_Bauman at Brown.edu>
> Date : 2 octobre 2010 05:54:35 GMT+02:00
> À : Laurent Romary <Laurent.Romary at loria.fr>
> Objet : bug report for Council, if you like
> Répondre à : Syd_Bauman at Brown.edu
>
> I've just posted a bug report (3079842) which y'all may find easier
> to discuss in e-mail, since it is somewhat long and uses formatting
> that would be lost in Sourceforge.
>
> If you'd like to forward this to Council, feel free. If you'd prefer
> to leave it to be dealt with only on Sourceforge, that's fine, too.
>
> ---------
>
> The declaration of the points= attribute in att.coordinated looks
> like it is probably in error, but since there are no examples of
> the use of point= anywhere in the Guidelines, and the tagdocs for
> att.coordinated, <surface>, and <zone> do not have <listRef>s, it
> is hard to be sure. But the prose of the <desc> does say "a
> series of pairs of numbers", which gives some help.
>
> The current declaration is
>
> attribute points {
> list {
> xsd:token { pattern = "[d]+,[\d]+([\s]+[\d]+,[\d]+){2,}" },
> xsd:token { pattern = "[d]+,[\d]+([\s]+[\d]+,[\d]+){2,}" }*
> }
> }
>
> (Numbers in pointy-brackets refer to productions in
> http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/)
> There are quite a few confusing constructs in there. First,
> putting a single multi-character escape <37> into a character
> class expression <12> is odd. Since "\s" is the same as
> "[#x20\t\n\r]", saying "[\s]" seems superfluous, and I think it
> would match exactly the same set of characters. I may be wrong on
> this, though, because I can't get it to match *anything* using
> either `jing` or `rnv`.
>
> The use of "[\d]" is similarly odd. But here, "\d" is the same as
> "\p{Nd}", which matches [0-9𝟎-𝟿0-9],
> which may not be what TEI wants. Besides the normal range 0-9,
> this permits the ten FULLWIDTH DIGIT characters, the ten
> MATHEMATICAL BOLD DIGIT characters, the ten MATHEMATICAL
> DOUBLE-STRUCK DIGIT characters, the ten MATHEMATICAL SANS-SERIF
> DIGIT characters, the ten MATHEMATICAL SANS-SERIF BOLD DIGIT, and
> the ten MATHEMATICAL MONOSPACE DIGIT characters. While Unicode
> defines these characters as having the appropriate numeric
> values, there are still many legacy programming languages that do
> not process them correctly. So TEI may wish to consider using
> "[0-9]" rather than "\d" (aka "\p{Nd}").
>
> But the first atom of the regular expression is "[d]", which
> matches the U+0064, LATIN SMALL LETTER D. I'm guessing this is
> just a typo for "[\d]" which would match numbers as described in
> the previous paragraph.
>
> If that first atom is intended to match a digit (and thus the
> first bit before the comma is supposed to match an integer), then
> the entire pattern, it would seem, is intended to match a list of
> 3 or more pairs of integers separated by a comma. But the regular
> expression is itself in a RELAX NG list of itself followed by
> zero or more of itself, and so can be matched 1 or more times.
> This duplication -- 1 or more occurrences of 3 or more pairs of
> integers -- seems silly, since any number of pairs of integers
> will match so long as it's more than 3.
>
> The RELAX NG list construct is expressed using the maxOccurs=
> attribute of the <datatype> element in the ODD. So assuming that
> the intent is to match a series of 3 or more pairs of non-
> negative integers, I think the TEI has four choices, based on two
> binary possibilities:
> * express a number using only ASCII digits vs using any Unicode
> digits
> * express the "3 or more" using the regular expression or using
> maxOccurs=.
>
> In any case, the description should be more clear. "a series of
> pairs of numbers" should probably be "a series of 3 or more
> comma- separated pairs of non-negative integers". (If I have that
> right -- ostensibly because the points are measured in pixels
> which can't be measured fractionally, and they always relative to
> the containing <surface>, so negative values would be off the
> surface, and are thus out of bounds.) Moreover, it should be made
> clear whether the numbers are relative to the bounding box, or
> just must be within it.
>
> <!-- ASCII digits, TEI occurrence constraint -->
> <datatype minOccurs="3" maxOccurs="unbounded">
> <rng:data type="token">
> <rng:param
> name="pattern">[0-9]+,[0-9]+</rng:param>
> </rng:data>
> </datatype>
>
> <!-- ASCII digits, XSD occurrence constraint -->
> <datatype minOccurs="1" maxOccurs="1">
> <rng:data type="token">
> <rng:param
> name="pattern">[0-9]+,[0-9]+(\s+[0-9]+,[0-9]+){2,}</
> rng:param>
> </rng:data>
> </datatype>
>
> <!-- Unicode digits, TEI occurrence constraint -->
> <datatype minOccurs="3" maxOccurs="unbounded">
> <rng:data type="token">
> <rng:param
> name="pattern">[\d]+,[\d]+</rng:param>
> </rng:data>
> </datatype>
>
> <!-- Unicode digits, XSD occurrence constraint -->
> <datatype minOccurs="1" maxOccurs="1">
> <rng:data type="token">
> <rng:param
> name="pattern">[\d]+,[\d]+(\s+[\d]+,[\d]+){2,}</
> rng:param>
> </rng:data>
> </datatype>
>
> Personally, I think the best way to go is to define a TEI
> datatype (data.point) as
> xsd:token { pattern = "[0-9]+,[0-9]+" }
> and then use
>
> <datatype minOccurs="3" maxOccurs="unbounded">
> <rng:ref name="data.point"/>
> </datatype>
>
> Besides being more elegant and generalizable, this means that if
> and when Council decides to support the other Unicode digits it
> is an easy change.
>
> ---------
Laurent Romary
INRIA & HUB-IDSL
laurent.romary at inria.fr
More information about the tei-council
mailing list