[tei-council] facsimile draft

Thu Aug 9 19:30:04 EDT 2007

On Thu, 2007-08-09 at 10:00 +0100, Lou Burnard wrote:
> Conal Tuohy wrote:

> The problem with the way this was done before (and I may have 
> misunderstood) is that you had to have special rules telling you which 
> <graphic> the @coords supplied for it were relative to (see earlier 
> discussion with Sebastian).  

Well ... which <surface/> to be precise, and from there, to the graphics
which depict that surface. But yes, a "special rule" is what it was. 

I can accept that.

> URL pointing solves this by the concept of 
> xml:base, which might have been a good solution, if only there were an 
> xpointer location scheme for locating boxes within the graphic that we 
> all agreed on. We could define our own but I felt we ought to learn our 
> lesson from last time and wait for someone else to do it!

agreed

> > For instance, imagine a TEI transcript originating in an OCR process, 
>> which would have image coordinates assigned to each word by the OCR
>> software. Using your draft markup, if I understand correctly, it
>> would be necessary to create a distinct zone element for each word,
>> essentially a parallel of the transcription, and link each word in
>> the transcript to its corresponding zone. This would be quite an
>> overhead!

> Why? It's an automated process anyway, isn't it? You would indeed need 
> to define a zone for each word, but wouldn't  that be precisely what the 
> OCR process outputs anyway?

Sure ... it would be an overhead only in terms of the amount of markup
required, not necessarily in terms of the encoding work involved in
making that markup (which in the OCR case would be fully automated
anyway), or the extra post-processing involved (which is just some
de-referencing some URLs).

So I'm not too fussed on this point. 

I think when people are going to do a lot of alignment (such as aligning
each word), then they're more likely to be using some automated tools
anyway, so the extra markup overhead is not an issue. I could be wrong
about that, though. Perhaps you comment on that, Dot?

As a distinct benefit, dropping @coords (or @box) from textual elements
also makes for a cleaner separation between the facsimile and the text,
in that the text is not polluted with any "facsimilar" data, and the
facsimile is then "pure" standoff.

> > I also think that the value space for @facs is too loose - in the 
>> sense that a <p> or a <div> could use a @facs pointer to point to
>> either an image file, to a zone, or to a graphic. I have a feeling
>> this is not going to be so convenient for processing. In the previous
>> draft, the idea was that such links would be ONLY to zones, which
>> were facsimile equivalents of <anchor> elements in a transcription. 

> We can't enforce this kind of rule (even for <anchor>s) -- it's a 
> data.pointer and it can point anywhere it pleases. 

Can't it be enforced with schematron, though? 

In any case, there are already attributes which are supposed to point to
elements of a particular type (even if this constraint is only expressed
in the text of the guidelines). 

e.g. add/@hand, must point to a hand

> I felt it was useful 
> to spell out what it *means* when it points to different kinds of thing. 
> An application can of course choose not to support a particular class of 
> target, but that's a different issue.

>From the point of view of promoting interoperability, any constraints we
can place on it will be a help. If we could say "@facs always points to
a zone", for instance, then this would help to prevent a situation where
some applications support this and other applications support that.

> > You've also allowed <graphic> inside <zone>, and I'm having a hard 
>> time understanding the rationale for this change. It seems to be of a
>> piece with the change to remove <graphic> from att.coordinated. Now,
>> since a graphic has no @box of its own, it inherits one from its
>> parent <zone>, is that right? 

> <graphic> inside <zone> means the same as <graphic> inside <surface> 
> (you may recall that I wanted to use <surface> recursively) -- this is 
> an image of the zone/graphic defined here, so yes: the bounding box of 
> the graphic/s inside a <zone> are defined by the parent zone.
> 
> I wanted to avoid change to <graphic>, if at all possible. And I also 
> wanted to separate the co-ordinate information from the graphical 
> pointing information.

OK this makes sense.

> >  In my previous draft, a graphic had a @box (or @coords as it was 
>> still called) attribute of its own, and hence didn't need to be
>> enclosed in a zone, and I don't see why we'd want to wrap those
>> graphics in zones, when they could just have their own @box. What
>> does that gain us?

> A clearer separation of concepts, imho. Plus the ability to give 
> multiple graphic realisations for the same space in a relatively 
> non-prolix manner.

Fair enough. 

> > Removing graphic from zone (and giving graphic its own @box)
>> would mean that zones would be always empty, and this would simplify
>> processing, too, I believe.

> Because empty elements are easier to process than full ones? I find that 
> hard to believe!

No - I didn't word that sentence well - the simplification would have
come from facsimle graphics always having a <surface/> parent, rather
than sometimes a <surface/> and sometimes a <zone/>

> > Regarding the "short-cut" which allows facsimile/graphic instead of 
>> requiring facsimile/surface/graphic, this seems reasonable, though I
>> wonder if there's much prospect of people using this short-cut, and
>> if not, I think the shortcut should be abolished (to simplify
>> processing). The reason I doubt it would be popular is that if you
>> have a single graphic, you already have the option of linking to it
>> directly from a pb, which is an even shorter short-cut. If you use
>> the facsimile/graphic shortcut (i.e. a graphic as a direct child of
>> facsimile, rather than mediated by a surface), you don't have the
>> option of using zones anyway, so this slightly-longer shortcut
>> doesn't cater for any distinct use case as far as I can see).
> >   
> See my comment to Dan before breakfast. It seems a good idea to have a 
> clear distinction between graphics in the text and graphics representing 
> the text. It seems a good idea to have a place where all the information 
> about the latter can be collected together. But I agree it won't seem 
> that short a cut to people who just want to pepper their transcriptions 
> with explicit pointers off into the wild blue yonder with no concern for 
> the morrow... such people are probably beyond help anyway.

Can we agree to drop that feature then? 

If so, we could restrict <facsimile> to only contain <surface> elements
(no <graphic>s), each <surface> could be restricted to only contain
<zone> elements, and facsimile <graphic>s would only ever be contained
in <zone> elements, while @facs could be restricted to point only to a
<zone> or (as a concessionary shortcut, for those people who can't
afford <zone> elements) directly to an image file.

> > In short, I'm a bit flummoxed. I liked the linking better the way it was.

> Well, fair enough. I apologize if you feel I've messed up your ideas 
> completely, and am very grateful for your willingness to engage in the 
> debate. I think there's been a pretty convincing groundswell of approval 
> for the direction things are going so the process can't be all bad.

No, no problem at all! Sorry if I sounded bitter - I'm certainly not! In
fact I'm really pleased with the work you've put into it, Lou,
especially now that I understand it better :-)

I'm happy for you to propose whatever you see fit. I think it's
important though present some of the real reasons for the differences,
though, so we can make evaluate them correctly.

I agree we are making progress on this, so I've got no complaints.

Cheers

Con