[tei-council] FW: first stab at Google > TEI
stuart.yeates at vuw.ac.nz
Wed Jun 22 16:20:30 EDT 2011
> - Hyphen-breaks at lines are now treated correctly (when we detect them from
> OCR). So you should now see the words like "succes-sively" replaced with
> "successively" etc.
For some reason this seems to have performed poorly in the introduction.
> - Identifying and marking chapter divisions. I believe we have some signals
> for this, but I don't know if we estimate or store them with the OCR output
> explicitly. I'll check up on this, but concur that it is a hard problem.
When I did this on a previous project I leveraged knowledge I'd gleaned
from table of contents to insert hints and tidied it up manually. It
almost certainly needs to be done before you strip away the headers and
> PS: Apologies if my choice of book to test on was poor. :) It was a purely
> random selection. If you have suggestions of alternate public domain books
> I'm happy to try and convert them and send their TEI files over.
By picking a linear fiction work you made your life easier.
Picking a formally-structured non-fiction work (cyclopaedia, almanac,
etc) will provide a challenge. Linear non-fiction (histories,
biographies, etc) with footnotes provide a separate set of challenges
(footnotes, references, etc).
If you're looking for insights into the English / Western assumptions
you're making, I suggest that you do something in Chinese, Japanese or
Korean. Thai is also interesting, because it sits somewhere between
English and C/J/K in terms of conventions. It may be easier to start
with a non-English language that term members read/write.
Library Technology Services http://www.victoria.ac.nz/library/
More information about the tei-council