[tei-council] first stab at Google > TEI
martinmueller at northwestern.edu
Tue Jun 21 21:52:03 EDT 2011
It's great to have a first stab at a TEI version of a Google book. I'm
taking the liberty of cc'ing the members of the TEI Council. Many of them
have a keen interest in your project succeeding, and I'm sure they'll give
you excellent advice.
I read up the file in the oXygen, which has validation built in and also
supports batch validation of whole directories. Sebastian Rahtz and other
TEI council gurus can tell you much more about validation at scale.
I don't have a view about running heads in books. I'd probably drop them
from the TEI version, but I'm sure there are ways of keeping them in some
element, which almost certainly would not be <p>.
You currently don't have a way of identifying paragraphs that begin on one
page and end on another. Working that out would be the single greatest
improvement. Tim Cole at the University of Illinois (UIUC) and a programmer
of his developed a method for merging paragraphs that span pages. I cc him
on this memo. I am sure he will be delighted to share his algorithms with
you. They were very good and made few errors.
A bigger problem and harder to solve across different books is chapter
divisions. Katrina Fenlon at UIUC worked on this and had some ingenious
regular expressions that looked for strings that identified chapter
headings. Then you put </div><div> before it, or something like it. This
works less often and requires more human intervention.
There is a very general point here about the utility of TEI versions of
Google books. People interested in a TEI version of a Google book would go
for text that they consider high-value documents (high value being very much
in the eye of the beholder). And they might want an algorithmically created
"rough cut" version that they could then lick into shape in a matter of
hours or perhaps days, and contribute back to some archive rather in the way
in which biologists contribute genomes of this or that animal to a large and
growing gene bank. I don't imagine anybody doing this with an 1863 version
of Gullivers' Travels, but I could easily imagine somebody doing it with a
first edition of Lews Carrol's Sylvie and Bruno, or, for that matter with an
18th century edition of Swift's novel.
If such people get a text that recognizes paragraphs and can handle
page-straddling with a high degree of accuracy, the human labor of marking
chapter divisions etc. becomes affordable. If I care enough about the text,
hunting down twelve chapter divisions is a pleasant evening's work, and
modern tools like oXygen have drastically lowered the pain threshold for
working with raw XML. Fixing several hundred page breaking paragraphs is
still a big enough bore to keep you from doing it. Checking the page
breaking paragraphs and fixing the occasional one is OK.
This is the kind of calculus that fits into Clay Shirky recent book about
Cognitive Surplus and how to engage users in various kinds of digital work
that give pleasure and satisfaction to them.
Anyhow, I am absolutely delighted with your work. With a few improvements
here and there, I think it has enormous potential for creating highly
curatable "rough cuts." There are a lot of texts that it would be good to
have in this condition, and once they are in that condition I think there
will be a lot of people all over the world, ranging from ambitious high
school students to retired lawyers and similar folks, who would find it
interesting to lick such texts into something like final shape. And it would
be a great project to design a user-friendly framework that would make such
licking into shape a rewarding task.
Chair, TEI Board
From: Ranjith Unnikrishnan <ranjith at google.com>
Date: Tue, 21 Jun 2011 16:51:38 -0700
To: Peter Gorman <pgorman at library.wisc.edu>
Cc: Martin Mueller <martinmueller at northwestern.edu>, Ram Subbaroyan
<ramram at google.com>, Salim Virji <salim at google.com>, Jeff Breidenbach
<jbreiden at google.com>
Subject: Re: additional contact for TEI
Hi Peter and Martin,
To follow-up my previous email, attached is a generated TEI file for your
comments. The file is generated from the book "Gulliver's Travels
visible in full on the Google Books site.
I'd like to ask you to ignore the quality of OCR output in the file; this
book was processed a long time ago and we've made several advances in our
OCR technology since then (and we'll reprocess books before generating their
final TEI files to publish). Instead, I'd like to draw your attention to how
the document structure extracted from OCR output is translated to the TEI
file, and would appreciate your feedback on that.
On Fri, Jun 17, 2011 at 3:14 PM, Ranjith Unnikrishnan <ranjith at google.com>
> Hi Martin and Peter,
> Thank you for the informative emails. A couple of comments/questions:
> - The TEI Header: As you mentioned, populating the many fields in the TEI
> header automatically from OCR output is a hard problem. However, it should be
> possible for us to use the book metadata that is supplied by
> libraries/publishers to populate some of the more common fields (eg the
> author, title, year and place of publication, publisher name). Would that be
> acceptable? The caveat is that the data would be coming from a different
> source that, although quite accurate in comparison to OCR, need not be
> error-free either.
> - The description of the level 2 TEI encoding in the wiki seems to suggest
> that raw images of the pages should accompany the TEI file. Is that correct?
> If so, are they meant to be referenced in the TEI file in some way (eg. in the
> <pb> tags)?
> - The desired encoding level: In practice the quality of OCR data we have
> varies widely. Some books that have been processed more recently and/or have
> had the benefit of human operators looking at the books are likely to have
> more structural information such as paragraph breaks, headers and footer,
> lists, and maybe even more details like tables and table of contents. These
> details would belong in levels 3 and 4 of the TEI encoding guidlines. Would it
> be acceptable to mix these tags from the different levels in the final TEI
> file? Or were you thinking of pruning the tags to discard all information
> higher the desired target encoding level?
> FWIW, I would take the position of putting as much information in the TEI
> output as is available because, as Martin said, it is usually easier to
> correct such tags if necessary than start from scratch.
> - Validation: Do you have a tool for validating a TEI file? It would
> certainly be beneficial for me to send you some generated TEI output to get
> your comments, as Peter suggested. But a validator would be useful as part of
> our automated code testing tools, and probably help us scale our efforts
> - Page Headers: What is the TEI element for representing page headers (ie. the
> region above page text that often contains the page number and chapter name).
> Or are these regions not meant to be represented in the TEI output?
> - Figures: How are figures/page graphics to be treated? We rarely have
> information about figures - sometimes we have the caption but usually not. Did
> you want us to create cropped image regions from the original scan and insert
> references to them in the <figure> tags? Or do we drop these altogether?
> I'm in the process of making some improvements to my existing code, in
> addition to trying to get richer header tags as described earlier, and am
> aiming to send you a generated TEI file for your comments very soon. Cheers,
> On Thu, Jun 16, 2011 at 2:02 PM, Martin Mueller
> <martinmueller at northwestern.edu> wrote:
>> I'm delighted to see action and enthusiasm for a Google/TEI project and
>> add my two cents worth, in full awareness that I don't know the landscape
>> of all that well. The only Google text I've ever looked at with some care
>> is an e-pub version of Hofmannsthal's Der Schwierige. From it, I gather
>> that the level of granularity in the encoding may be more like Level 3 of
>> the current draft of the Guidelines to Best Practices
>> s). That is to say, paragraphs are identified. I don't thing the e-pub
>> version can distinguish between verse and prose (and there is no verse in
>> Der Schwierige).
>> So shooting for a level of granularity that captures paragraphs seems to
>> me desirable. I agree, though, with Peter's Gorman's view that it is
>> better to aim at a lower level that can be captured with some precision
>> than at a higher level that leads to many errors.
>> On the other hand, if you have a version in which paragraphs and lines of
>> verse are alike coded as <p> in the Google, it is a lot easier for a human
>> editor to change the relevant <p> elements to <l> than to start from
>> scratch, and coding lines of verse consistently as <p> is not wrong. It's
>> just coarse.
>> I approach the whole business of TEI encoded Google docs from their
>> potential as XML versions that some humans at some subsequent point might
>> want to "upcode," whether manually or through some combination of
>> automatic, semi-automatic, or manual procedures. So it's a matter of
>> striking the balance between avoiding error and maximizing improvability.
>> I don't know enough about Google based encoding to have a clear idea where
>> that balancing point is -- and it may differ between different kinds of
>> texts, which may not be something that the process can attend to -- but I
>> hope this general reflection is of some use.
>> On 6/16/11 2:06 PM, "Peter Gorman" <pgorman at library.wisc.edu> wrote:
>>> >Hi, Ranjith - welcome to the TEI bandwagon!
>>> >I'm including Martin Mueller, Chair and CEO of the TEI Consortium, on
>>> >this message, as this project is very important to the TEI community.
>>> >I wouldn't presume to speak for the entire TEI or library community, but
>>> >it's safe to say that an important goal is to be able to process
>>> >Google-digitized texts along with those coming from other efforts like
>>> >the Text Creation Partnership (TCP), without scholars having to do
>>> >extensive prior conversion from PDF, METS or EPub. Of course, in a
>>> >community this large there is a great diversity of opinion about what
>>> >"good enough" TEI should look like, hence efforts like the TEI in
>>> >Libraries Guidelines. And as I've pointed out to Salim, the level of
>>> >markup you produce is going to be constrained by your inputs and
>>> >Personally, I'd look for a lighter level of markup that's consistent and
>>> >semantically valid (against the TEI Guidelines) rather than a deeper
>>> >level of markup that's often incorrect. Level 2 seemed to be to be a
>>> >reasonable goal to shoot for, though there may be parts (particularly in
>>> >the TEI Header) that may be difficult for you to do. I'd be happy to
>>> >comment on any output you generate.
>>> >Peter C. Gorman
>>> >Head, University of Wisconsin Digital Collections Center
>>> >pgorman at library.wisc.edu
>>> >(608) 265-5291 <tel:%28608%29%20265-5291>
>>> >On Jun 15, 2011, at 11:46 AM, Ranjith Unnikrishnan wrote:
>>>> >> Thanks for the introduction, Jeff, and thank you Peter for your
>>>> >>willingness to share your expertise with us.
>>>> >> Peter, I have a lot of questions related to TEI, but my primary
>>>> >>question is of what the exact requirements are. An email that Jeff
>>>> >>forwarded me from you titled "Best practices for TEI in libraries" dated
>>>> >>May 10th had a link to the Level 2 (minimal encoding) section of the TEI
>>>> >>spec. Is that the minimal set of requirements that we should be aiming
>>>> >>to satisfy? If not, can you clarify what they are?
>>>> >> Best wishes,
>>>> >> Ranjith
>>>> >> On Tue, Jun 14, 2011 at 5:24 PM, Jeff Breidenbach <jbreiden at google.com>
>>>> >> Hi Peter,
>>>> >> One of my action items from the call today was to check on our TEI
>>>> >> engineer Salim. Done! In addition I'd like to introduce to another OCR
>>>> >> engineer, Ranjith who is very excited about TEI (well, at least
>>>> >> compared to me!). Have a productive time.
>>>> >> Jeff
More information about the tei-council