[tei-council] first stab at Google > TEI

Wed Jun 22 15:43:54 EDT 2011

Dear Ranjith,

Thank you very much for this quick and thorough response.  For the purposes
of this exercise, it doesn't matter what text is chosen, and a randomly
chosen text is just fine. So there is nothing to apologize for.  I merely
wanted to make two interrelating points:

1. TEI is going to be a good format for "some" but not "all" or even "most"
books in Google's collection
2. If users want a text in TEI format, they may care enough about it to be
happy with an algorithmically created "rough cut" to which they can add
value by hand in some standardized manner and return it to a larger library.
I'm making these points because it may well turn out to be the case that it
is not possible to turn all or most Google books into TEI with an acceptable
error rate, and I want to guard against an "all or nothing" outcome.  If the
best achievable outcome is "some texts in a user curatable format" that
would be an excellent outcome from the perspective of scholarly and
pedagogical communities all over the world, and it would be a terrific basis
for a framework of collaborative data curation.

MM

From:  Ranjith Unnikrishnan <ranjith at google.com>
Date:  Wed, 22 Jun 2011 12:16:26 -0700
To:  Kevin Hawkins <kevin.s.hawkins at ultraslavonic.info>
Cc:  Martin Mueller <martinmueller at northwestern.edu>, Peter Gorman
<pgorman at library.wisc.edu>, Ram Subbaroyan <ramram at google.com>, Jon Orwant
<orwant at google.com>, Jeff Breidenbach <jbreiden at google.com>, Salim Virji
<salim at google.com>, "Timothy W. Cole" <t-cole3 at illinois.edu>
Subject:  Re: [tei-council] first stab at Google > TEI

Thank you all for the quick and detailed feedback. I've attached a newly
generated TEI file that incorporates your suggestions. Some notes on what
has changed since the previous file:

- All the validation errors reported by Sebastian have been fixed. The date
format is now corrected, the required <title> has been added to the
<imprint> field, and the <div> around the TOC has been replaced with a
type="contents" attribute on the <list>. The TEI header is populated
primarily using library metadata, as was in the last file I sent.

- All page headers and footers have been dropped. The group expressed little
interest in retaining them. Also the <fw> element raised questions in my
mind on whether something like
  "<p> This is a para that <pb n=42/> <fw> Page header</fw> continues to the
next page </p>"
is acceptable TEI since the <fw> element has to be contained in the <p> tags
for the merging of paragraph text across page breaks to make sense.

- Paragraphs are now continued across page breaks (wherever we detect the
continuations from OCR output). Detecting them is probably not as easy as
looking for <p>..</p> <fw>..</fw> <p>..</p> patterns, as sometimes page
footers from the first page could be incorrectly detected. This happens for
example in page 3 (index 24) where the mysterious "B2" at the bottom of the
page is detected incorrectly as a paragraph. A similar problem occurs in
page 1 (index 22) where the "B" at the bottom of the page was incorrectly
merged with the previous paragraph and so confuses our paragraph
continuation algorithm. Our paragraph continuation algorithm does look at
paragraph margins etc., but there is scope for improving it with some
separate work we're doing that focuses on better paragraph detection within
a page and uses better signals like the actual OCR text. I'll let the people
who are working on these know in case they have the cycles to work on it.
However, correcting the errors we make should be _far_ easier now.

- Hyphen-breaks at lines are now treated correctly (when we detect them from
OCR). So you should now see the words like "succes-sively" replaced with
"successively" etc.

- All <figure>-s have been dropped based on Kevin's suggestion. I really
only left them in the previous file as is because I didn't know how to treat
them. Based on your suggestions, I think it would cool to put them back at
some point with references to cropped figures from the scanned page image.
It's pretty doable.

What was left out from your suggestions are:
- Identifying and marking chapter divisions. I believe we have some signals
for this, but I don't know if we estimate or store them with the OCR output
explicitly. I'll check up on this, but concur that it is a hard problem.
- Use of a TEI validator: I feel this is really important for several
reasons, including preserving functionality of our convertor code over time.
Salim expressed interest in this and we'll discuss how to proceed and
coordinate our efforts with you in a separate thread.

Thanks again for your time. Looking forward to your next round of comments,
Ranjith

PS: Apologies if my choice of book to test on was poor. :) It was a purely
random selection. If you have suggestions of alternate public domain books
I'm happy to try and convert them and send their TEI files over.

On Wed, Jun 22, 2011 at 9:40 AM, Kevin Hawkins
<kevin.s.hawkins at ultraslavonic.info> wrote:
> Hi,
> 
> As one of the co-editors of "Best Practices for TEI in Libraries" (which we
> are finalizing and for which we will soon produce schemas for validating
> content at levels 1 through 4), let me answer Ranjith's questions:
> 
> 
>> > - The TEI Header: As you mentioned, populating the many fields in the TEI
>> > header automatically from OCR output is a hard problem. However, it should
>> be
>> > possible for us to use the book metadata that is supplied by
>> > libraries/publishers to populate some of the more common fields (eg the
>> > author, title, year and place of publication, publisher name). Would that
>> be
>> > acceptable? The caveat is that the data would be coming from a different
>> > source that, although quite accurate in comparison to OCR, need not be
>> > error-free either.
> 
> I agree that you can do this for books where you have metadata.  It should be
> even easier for books from libraries since you have the full MARC records sent
> by the libraries.
> 
> 
>> > - The description of the level 2 TEI encoding in the wiki seems to suggest
>> > that raw images of the pages should accompany the TEI file. Is that
>> correct?
>> > If so, are they meant to be referenced in the TEI file in some way (eg. in
>> the
>> > <pb> tags)?
> 
> There are three ways to do this:
> 
> http://www.tei-c.org/SIG/Libraries/teiinlibraries/#Linking_Between_Encoded_Tex
> t_and_Images_of_Source_Documents
> <http://www.tei-c.org/SIG/Libraries/teiinlibraries/#Linking_Between_Encoded_Te
> xt_and_Images_of_Source_Documents>
> 
> 
>> > - The desired encoding level: In practice the quality of OCR data we have
>> > varies widely. Some books that have been processed more recently and/or
>> have
>> > had the benefit of human operators looking at the books are likely to have
>> > more structural information such as paragraph breaks, headers and footer,
>> > lists, and maybe even more details like tables and table of contents. These
>> > details would belong in levels 3 and 4 of the TEI encoding guidlines. Would
>> it
>> > be acceptable to mix these tags from the different levels in the final TEI
>> > file? Or were you thinking of pruning the tags to discard all information
>> > higher the desired target encoding level?
>> > FWIW, I would take the position of putting as much information in the TEI
>> > output as is available because, as Martin said, it is usually easier to
>> > correct such tags if necessary than start from scratch.
> 
> These levels are guidelines.  While we intend to produce schemas for
> validation at the various levels, we understand that people might have
> something like "level 2+", with some additional structure they are able to
> add.  So this is fine. Furthermore, at this point there are basically no
> software applications that expect content to conform exactly to an encoding
> level.  So it's good to add as much structure as you can and let users figure
> out whether they want to strip some of the markup.
> 
> 
>> > - Validation:  Do you have a tool for validating a TEI file? It would
>> > certainly be beneficial for me to send you some generated TEI output to get
>> > your comments, as Peter suggested. But a validator would be useful as part
>> of
>> > our automated code testing tools, and probably help us scale our efforts
>> > better.
> 
> There is no one tool since TEI is not a finite specification.  You typically
> create a schema for your customization.  If you want to validate against
> everything possible in the TEI, you can validate against:
> 
> 
> http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng
> <http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng>
> 
>> > - Page Headers: What is the TEI element for representing page headers (ie.
>> the
>> > region above page text that often contains the page number and chapter
>> name).
>> > Or are these regions not meant to be represented in the TEI output?
> 
> The element if <fw> ("forme work").  We recommend discarding at level 3 and
> above:
> 
> http://www.tei-c.org/SIG/Libraries/teiinlibraries/#Forme_Work
> <http://www.tei-c.org/SIG/Libraries/teiinlibraries/#Forme_Work>
> 
> 
>> > - Figures: How are figures/page graphics to be treated? We rarely have
>> > information about figures - sometimes we have the caption but usually not.
>> Did
>> > you want us to create cropped image regions from the original scan and
>> insert
>> > references to them in the <figure> tags? Or do we drop these altogether?
> 
> It would be good to include the cropped images as you suggest.  We recommend
> inclusion of figures only at level 4 but not at level 3. (Mentions of it at
> level 3 are a known bug in "Best Practices for TEI in Libraries".)  See:
> 
> http://www.tei-c.org/SIG/Libraries/teiinlibraries/#Level_4_Figures
> <http://www.tei-c.org/SIG/Libraries/teiinlibraries/#Level_4_Figures>
> 
> Happy to answer further questions!
> 
> Kevin