[tei-council] Fwd: Re: first stab at Google > TEI

Wed Jun 22 12:41:12 EDT 2011

To get around the sender limit (and the confusion with non-members 
trying to send to the list), I sent this to everyone but tei-council. 
Here's your copy.

-------- Original Message --------
Subject: Re: [tei-council] first stab at Google > TEI
Date: Wed, 22 Jun 2011 12:40:23 -0400
From: Kevin Hawkins <kevin.s.hawkins at ultraslavonic.info>
To: Ranjith Unnikrishnan <ranjith at google.com>
CC: Martin Mueller <martinmueller at northwestern.edu>,  Peter Gorman 
<pgorman at library.wisc.edu>, Ram Subbaroyan <ramram at google.com>, Jon 
Orwant <orwant at google.com>,  Jeff Breidenbach <jbreiden at google.com>, 
Salim Virji <salim at google.com>, "Timothy W. Cole" <t-cole3 at illinois.edu>

Hi,

As one of the co-editors of "Best Practices for TEI in Libraries" (which
we are finalizing and for which we will soon produce schemas for
validating content at levels 1 through 4), let me answer Ranjith's
questions:

> - The TEI Header: As you mentioned, populating the many fields in the TEI
> header automatically from OCR output is a hard problem. However, it
should be
> possible for us to use the book metadata that is supplied by
> libraries/publishers to populate some of the more common fields (eg the
> author, title, year and place of publication, publisher name). Would
that be
> acceptable? The caveat is that the data would be coming from a different
> source that, although quite accurate in comparison to OCR, need not be
> error-free either.

I agree that you can do this for books where you have metadata.  It
should be even easier for books from libraries since you have the full
MARC records sent by the libraries.

> - The description of the level 2 TEI encoding in the wiki seems to
suggest
> that raw images of the pages should accompany the TEI file. Is that
correct?
> If so, are they meant to be referenced in the TEI file in some way
(eg. in the
> <pb> tags)?

There are three ways to do this:

http://www.tei-c.org/SIG/Libraries/teiinlibraries/#Linking_Between_Encoded_Text_and_Images_of_Source_Documents

> - The desired encoding level: In practice the quality of OCR data we have
> varies widely. Some books that have been processed more recently
and/or have
> had the benefit of human operators looking at the books are likely to
have
> more structural information such as paragraph breaks, headers and footer,
> lists, and maybe even more details like tables and table of contents.
These
> details would belong in levels 3 and 4 of the TEI encoding guidlines.
Would it
> be acceptable to mix these tags from the different levels in the
final TEI
> file? Or were you thinking of pruning the tags to discard all information
> higher the desired target encoding level?
> FWIW, I would take the position of putting as much information in the TEI
> output as is available because, as Martin said, it is usually easier to
> correct such tags if necessary than start from scratch.

These levels are guidelines.  While we intend to produce schemas for
validation at the various levels, we understand that people might have
something like "level 2+", with some additional structure they are able
to add.  So this is fine. Furthermore, at this point there are basically
no software applications that expect content to conform exactly to an
encoding level.  So it's good to add as much structure as you can and
let users figure out whether they want to strip some of the markup.

> - Validation:  Do you have a tool for validating a TEI file? It would
> certainly be beneficial for me to send you some generated TEI output
to get
> your comments, as Peter suggested. But a validator would be useful as
part of
> our automated code testing tools, and probably help us scale our efforts
> better.

There is no one tool since TEI is not a finite specification.  You
typically create a schema for your customization.  If you want to
validate against everything possible in the TEI, you can validate against:

http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng

> - Page Headers: What is the TEI element for representing page headers
(ie. the
> region above page text that often contains the page number and
chapter name).
> Or are these regions not meant to be represented in the TEI output?

The element if <fw> ("forme work").  We recommend discarding at level 3
and above:

http://www.tei-c.org/SIG/Libraries/teiinlibraries/#Forme_Work

> - Figures: How are figures/page graphics to be treated? We rarely have
> information about figures - sometimes we have the caption but usually
not. Did
> you want us to create cropped image regions from the original scan
and insert
> references to them in the <figure> tags? Or do we drop these altogether?

It would be good to include the cropped images as you suggest.  We
recommend inclusion of figures only at level 4 but not at level 3.
(Mentions of it at level 3 are a known bug in "Best Practices for TEI in
Libraries".)  See:

http://www.tei-c.org/SIG/Libraries/teiinlibraries/#Level_4_Figures

Happy to answer further questions!

Kevin