[tei-council] FW: first stab at Google > TEI
stuart yeates
stuart.yeates at vuw.ac.nz
Wed Jun 22 16:20:30 EDT 2011
> - Hyphen-breaks at lines are now treated correctly (when we detect them from
> OCR). So you should now see the words like "succes-sively" replaced with
> "successively" etc.
For some reason this seems to have performed poorly in the introduction.
> - Identifying and marking chapter divisions. I believe we have some signals
> for this, but I don't know if we estimate or store them with the OCR output
> explicitly. I'll check up on this, but concur that it is a hard problem.
When I did this on a previous project I leveraged knowledge I'd gleaned
from table of contents to insert hints and tidied it up manually. It
almost certainly needs to be done before you strip away the headers and
footers.
> PS: Apologies if my choice of book to test on was poor. :) It was a purely
> random selection. If you have suggestions of alternate public domain books
> I'm happy to try and convert them and send their TEI files over.
By picking a linear fiction work you made your life easier.
Picking a formally-structured non-fiction work (cyclopaedia, almanac,
etc) will provide a challenge. Linear non-fiction (histories,
biographies, etc) with footnotes provide a separate set of challenges
(footnotes, references, etc).
If you're looking for insights into the English / Western assumptions
you're making, I suggest that you do something in Chinese, Japanese or
Korean. Thai is also interesting, because it sits somewhere between
English and C/J/K in terms of conventions. It may be easier to start
with a non-English language that term members read/write.
cheers
stuart
--
Stuart Yeates
Library Technology Services http://www.victoria.ac.nz/library/
More information about the tei-council
mailing list