[tei-council] FW: first stab at Google > TEI

stuart yeates stuart.yeates at vuw.ac.nz
Wed Jun 22 16:20:30 EDT 2011

> - Hyphen-breaks at lines are now treated correctly (when we detect them from
> OCR). So you should now see the words like "succes-sively" replaced with
> "successively" etc.

For some reason this seems to have performed poorly in the introduction.

> - Identifying and marking chapter divisions. I believe we have some signals
> for this, but I don't know if we estimate or store them with the OCR output
> explicitly. I'll check up on this, but concur that it is a hard problem.

When I did this on a previous project I leveraged knowledge I'd gleaned 
from table of contents to insert hints and tidied it up manually. It 
almost certainly needs to be done before you strip away the headers and 

> PS: Apologies if my choice of book to test on was poor. :) It was a purely
> random selection. If you have suggestions of alternate public domain books
> I'm happy to try and convert them and send their TEI files over.

By picking a linear fiction work you made your life easier.

Picking a formally-structured non-fiction work (cyclopaedia, almanac, 
etc) will provide a challenge. Linear non-fiction (histories, 
biographies, etc) with footnotes provide a separate set of challenges 
(footnotes, references, etc).

If you're looking for insights into the English / Western assumptions 
you're making, I suggest that you do something in Chinese, Japanese or 
Korean. Thai is also interesting, because it sits somewhere between 
English and C/J/K in terms of conventions. It may be easier to start 
with a non-English language that term members read/write.

Stuart Yeates
Library Technology Services http://www.victoria.ac.nz/library/

More information about the tei-council mailing list