[tei-council] first stab at Google > TEI

Wed Jun 22 00:09:35 EDT 2011

This seems pretty fine to me, so far as it goes. There are some validation errors:

1. gullivers_travels.tei.xml:14:26: error: value of attribute "when" is invalid. <date when="06-21-2011"/> should be <date when="2011-06-21"/> (its an ISO standard date). Should be easy to fix.

2. gullivers_travels.tei.xml:22:10: error: element "imprint" not allowed yet; missing required element "title". Should be easy to fix.

3. the major problem starts at line 162. where we go back to <p> elements after the <div> containing the table of contents. This is sadly illegal
in the TEI schema - once you start working with <div>, you have to carry on.
It happens here of course because the automated processing is not dividing the book up into chapters as a human would _except_ for the TOC.
The solution is to drop the <div type="contents">
wrapper, and put the @type on the <list>. TEI's <div> is _not_ like the HTML div, as an arbitrary container.

If we fix those things, the thing is technically valid TEI.  I attach an ePub version
of the book using default rendering of my TEI to ePub conversion, for a laugh.

The simplest schema to check against for now (until we develop a stricter one for this purpose)
is the RELAX NG http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng, and the simplest tool
for batch processing is the Jing validator (http://www.thaiopensource.com/relaxng/jing.html).

This leaves four  problems:

   a) identifying the chapter structure by somehow pattern-matching headings.
   b) joining up paragraphs which break across page boundaries
   c) suppressing or marking up the running headers. I would suggest marking these up
       with <fw> if they are to stay, but on the whole its probably easier to throw them away.
   d) joining up paragraphs which should are broken for no reason (eg just before PAGE_219)

a) is just plain hard.  I agree with Martin that this may be a job which a human
would happily do in an hour or so, if the rest of the book was  fairly clean, and there
was a way to easily feed back the improved result.

b) is solvable, just needs a pass over the file in your
favourite XML programming language to detect the sequence <p> ...</p> <fw>...</fw> <pb/> <p> ...</p>
and making <p>... <fw>...</fw> <pb/> ... </p>.  When there _should_ be two paragraphs is
less clear. A human would look at indentation and short last lines of preceding paragraph.

I am not sure what the scale of d) is after a quick look.

What are the empty <figure></figure> for?

If page images are available, they can be referenced by the @facs attribute on <pb/>, which is allowed
to be the URL of a graphic file.

Finally, I note the perennial issue of hyphenated words, eg
"I was surgeon suc-cessively in two ships", but I feel weak when
I think of how to fix those. If they are identifiable in the Google post-OCR form,
then we could suggest ways of marking that up.

Hope this helps. Its going to be really great to have this stuff coming off a production line!
--
Sebastian Rahtz
Head of Information and Support Group, Oxford University Computing Services
13 Banbury Road, Oxford OX2 6NN. Phone +44 1865 283431

Sólo le pido a Dios
que el futuro no me sea indiferente

-------------- next part --------------
A non-text attachment was scrubbed...
Name: gullivers_travels.tei.epub
Type: application/octet-stream
Size: 110625 bytes
Desc: gullivers_travels.tei.epub
Url : http://lists.village.Virginia.EDU/pipermail/tei-council/attachments/20110622/238eeae5/attachment.obj