[tei-council] google test book

Wed Jun 22 10:01:08 EDT 2011

something upset the maillist software when I replied to this, so for the record here is
what I replied:

This seems pretty fine to me, so far as it goes. There are some validation errors:

1. gullivers_travels.tei.xml:14:26: error: value of attribute "when" is invalid. <date when="06-21-2011"/> should be <date when="2011-06-21"/> (its an ISO standard date). Should be easy to fix.

2. gullivers_travels.tei.xml:22:10: error: element "imprint" not allowed yet; missing required element "title". Should be easy to fix.

3. the major problem starts at line 162. where we go back to <p> elements after the <div> containing the table of contents. This is sadly illegal
in the TEI schema - once you start working with <div>, you have to carry on.  
It happens here of course because the automated processing is not dividing the book up into chapters as a human would _except_ for the TOC. 
The solution is to drop the <div type="contents">
wrapper, and put the @type on the <list>. TEI's <div> is _not_ like the HTML div, as an arbitrary container.

If we fix those things, the thing is technically valid TEI.  I attach an ePub version
of the book using default rendering of my TEI to ePub conversion, for a laugh.

The simplest schema to check against for now (until we develop a stricter one for this purpose)
is the RELAX NG http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng, and the simplest tool
for batch processing is the Jing validator (http://www.thaiopensource.com/relaxng/jing.html). 

This leaves four  problems:

   a) identifying the chapter structure by somehow pattern-matching headings.
   b) joining up paragraphs which break across page boundaries
   c) suppressing or marking up the running headers. I would suggest marking these up
       with <fw> if they are to stay, but on the whole its probably easier to throw them away.
   d) joining up paragraphs which should are broken for no reason (eg just before PAGE_219)

a) is just plain hard.  I agree with Martin that this may be a job which a human
would happily do in an hour or so, if the rest of the book was  fairly clean, and there
was a way to easily feed back the improved result.

b) is solvable, just needs a pass over the file in your
favourite XML programming language to detect the sequence <p> ...</p> <fw>...</fw> <pb/> <p> ...</p>
and making <p>... <fw>...</fw> <pb/> ... </p>.  When there _should_ be two paragraphs is
less clear. A human would look at indentation and short last lines of preceding paragraph.

I am not sure what the scale of d) is after a quick look. 

What are the empty <figure></figure> for?

If page images are available, they can be referenced by the @facs attribute on <pb/>, which is allowed
to be the URL of a graphic file.

Finally, I note the perennial issue of hyphenated words, eg
"I was surgeon suc-cessively in two ships", but I feel weak when
I think of how to fix those. If they are identifiable in the Google post-OCR form,
then we could suggest ways of marking that up.

Hope this helps. Its going to be really great to have this stuff coming off a production line!
--
Sebastian Rahtz      
Head of Information and Support Group, Oxford University Computing Services
13 Banbury Road, Oxford OX2 6NN. Phone +44 1865 283431

Sólo le pido a Dios
que el futuro no me sea indiferente