[tei-council] first stab at Google > TEI

Wed Jun 22 17:47:13 EDT 2011

I think we've fixed the moderation problem, so I'll copy tei-council 
this time.

On 6/22/2011 3:16 PM, Ranjith Unnikrishnan wrote:
> Thank you all for the quick and detailed feedback. I've attached a newly
> generated TEI file that incorporates your suggestions. Some notes on
> what has changed since the previous file:

> - Hyphen-breaks at lines are now treated correctly (when we detect them
> from OCR). So you should now see the words like "succes-sively" replaced
> with "successively" etc.

Alternatively, you might handle as explained at:

http://www.tei-c.org/SIG/Libraries/teiinlibraries/#Hyphenation

> What was left out from your suggestions are:
> - Identifying and marking chapter divisions. I believe we have some
> signals for this, but I don't know if we estimate or store them with the
> OCR output explicitly. I'll check up on this, but concur that it is a
> hard problem.

Chapters often (though not always) begin at the top of a page.  For 
books scanned at libraries, you could use the CHAPTER_START pagetagging 
information in your data in order to stick in the page boundary just 
after the <pb/>.  It will be right for many items and close for others.

Kevin