[tei-council] google test book
sebastian.rahtz at oucs.ox.ac.uk
Wed Jun 22 10:01:08 EDT 2011
something upset the maillist software when I replied to this, so for the record here is
what I replied:
This seems pretty fine to me, so far as it goes. There are some validation errors:
1. gullivers_travels.tei.xml:14:26: error: value of attribute "when" is invalid. <date when="06-21-2011"/> should be <date when="2011-06-21"/> (its an ISO standard date). Should be easy to fix.
2. gullivers_travels.tei.xml:22:10: error: element "imprint" not allowed yet; missing required element "title". Should be easy to fix.
3. the major problem starts at line 162. where we go back to <p> elements after the <div> containing the table of contents. This is sadly illegal
in the TEI schema - once you start working with <div>, you have to carry on.
It happens here of course because the automated processing is not dividing the book up into chapters as a human would _except_ for the TOC.
The solution is to drop the <div type="contents">
wrapper, and put the @type on the <list>. TEI's <div> is _not_ like the HTML div, as an arbitrary container.
If we fix those things, the thing is technically valid TEI. I attach an ePub version
of the book using default rendering of my TEI to ePub conversion, for a laugh.
The simplest schema to check against for now (until we develop a stricter one for this purpose)
is the RELAX NG http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng, and the simplest tool
for batch processing is the Jing validator (http://www.thaiopensource.com/relaxng/jing.html).
This leaves four problems:
a) identifying the chapter structure by somehow pattern-matching headings.
b) joining up paragraphs which break across page boundaries
c) suppressing or marking up the running headers. I would suggest marking these up
with <fw> if they are to stay, but on the whole its probably easier to throw them away.
d) joining up paragraphs which should are broken for no reason (eg just before PAGE_219)
a) is just plain hard. I agree with Martin that this may be a job which a human
would happily do in an hour or so, if the rest of the book was fairly clean, and there
was a way to easily feed back the improved result.
b) is solvable, just needs a pass over the file in your
favourite XML programming language to detect the sequence <p> ...</p> <fw>...</fw> <pb/> <p> ...</p>
and making <p>... <fw>...</fw> <pb/> ... </p>. When there _should_ be two paragraphs is
less clear. A human would look at indentation and short last lines of preceding paragraph.
I am not sure what the scale of d) is after a quick look.
What are the empty <figure></figure> for?
If page images are available, they can be referenced by the @facs attribute on <pb/>, which is allowed
to be the URL of a graphic file.
Finally, I note the perennial issue of hyphenated words, eg
"I was surgeon suc-cessively in two ships", but I feel weak when
I think of how to fix those. If they are identifiable in the Google post-OCR form,
then we could suggest ways of marking that up.
Hope this helps. Its going to be really great to have this stuff coming off a production line!
Head of Information and Support Group, Oxford University Computing Services
13 Banbury Road, Oxford OX2 6NN. Phone +44 1865 283431
Sólo le pido a Dios
que el futuro no me sea indiferente
More information about the tei-council