[tei-council] first stab at Google > TEI
Kevin Hawkins
kevin.s.hawkins at ultraslavonic.info
Wed Jun 22 17:47:13 EDT 2011
I think we've fixed the moderation problem, so I'll copy tei-council
this time.
On 6/22/2011 3:16 PM, Ranjith Unnikrishnan wrote:
> Thank you all for the quick and detailed feedback. I've attached a newly
> generated TEI file that incorporates your suggestions. Some notes on
> what has changed since the previous file:
> - Hyphen-breaks at lines are now treated correctly (when we detect them
> from OCR). So you should now see the words like "succes-sively" replaced
> with "successively" etc.
Alternatively, you might handle as explained at:
http://www.tei-c.org/SIG/Libraries/teiinlibraries/#Hyphenation
> What was left out from your suggestions are:
> - Identifying and marking chapter divisions. I believe we have some
> signals for this, but I don't know if we estimate or store them with the
> OCR output explicitly. I'll check up on this, but concur that it is a
> hard problem.
Chapters often (though not always) begin at the top of a page. For
books scanned at libraries, you could use the CHAPTER_START pagetagging
information in your data in order to stick in the page boundary just
after the <pb/>. It will be right for many items and close for others.
Kevin
More information about the tei-council
mailing list