[tei-council] Fwd: Re: Google Books > TEI

Tue Mar 6 21:40:03 EST 2012

All,

I'm forwarding two messages from our contact at Google (cc'd here), 
where he answers the questions that came up during our last conference 
call with some requests for further information from us.  He 
accidentally sent the second message sent only to me rather than to our 
whole committee, so I'm copying Laurent as well since he's no longer on 
the TEI Council.

(Ranjith, to prevent spam, only members can post to tei-council, but I 
can forward messages from you to the group.)

---Kevin

-------- Original Message --------
Subject: 	Re: Google Books > TEI
Date: 	Mon, 5 Mar 2012 15:43:13 -0800
From: 	Ranjith Unnikrishnan <ranjith at google.com>
To: 	Kevin Hawkins <kevin.s.hawkins at ultraslavonic.info>
CC: 	James.Cummings at oucs.ox.ac.uk, mholmes at uvic.ca, laurent.romary at inria.fr

Hi Kevin,

Responses are inline.

    1. Would Google be willing to release the code used for the
    transformation from Google's internal data format to TEI?  We
    realize the code might wouldn't be useful to anyone without the
    underlying data, but it might help users who are trying to
    understand what you produce and what errors are likely.

I personally don't think this is likely because (i) in addition to the 
source material we'd have to open-source our internal data format for 
the code to make any sense, and (ii) during the course of processing, 
the code interacts with many internal storage systems that cannot be 
made publicly accessible.

That said, the significant errors originate not from the process of 
converting our internal data format to TEI, but from poor OCR and/or 
structure analysis. So I'm not convinced that looking at the conversion 
code would serve the purpose of better understanding the errors. The 
bulk of the OCR work is done using the Tesseract OCR engine, which is 
open-source and is updated periodically, and so may be a better place 
for interested parties to look.

    2. Will there be a mechanism for users to report errors in the TEI
    files or submit revised TEI files?  This might relate to the work
    you're doing with James Cummings and others on matching TEI
    documents with scanned editions of the same source document.

This was my hope and intention from the start, but we don't have a 
feedback mechanism for this at the moment and I'm not sure what the best 
scalable solution is. In fact I'm eager to get your ideas on this since, 
as both stewards and users of the format, you are in a better position 
than us to guide and drive this correction process in the community of 
TEI users.

I'm not sure of the document matching work that you're referring to, but 
I am working with James to import some public-domain material that he 
can provide. My end of the work requires setting up the infrastructure 
to ingest ePub files directly into our corpus so that it is freely 
available to a broader audience via the Google Books website. In that 
respect, the work could relate to your comment as it should result in a 
way to import whole corrected TEI files (via an intermediate conversion 
to ePub).

On a related note, you may be aware of an early stage project out of 
Texas A&M University that is aimed at digitizing 18th century books via 
crowd-sourcing in combination with OCR. I've been trying to motivate 
them to use TEI as an input/output format so that when we eventually 
provide TEI files for public domain books, they will have a free source 
of input material to correct, and secondly if they wish to return the 
corrected TEI files to us to make freely available on Google Books then 
it will benefit everyone and every community involved. I don't know if 
they're completely on board with this vision, but I do know that some of 
the tools they plan to use do accept TEI.

~Ranjith

-------- Original Message --------
Subject: 	Re: scope of Google Books > TEI
Date: 	Tue, 6 Mar 2012 17:36:31 -0800
From: 	Ranjith Unnikrishnan <ranjith at google.com>
To: 	Kevin Hawkins <kevin.s.hawkins at ultraslavonic.info>

Actually, I'd like to turn that question around if you don't mind. Can 
you give some convincing arguments as to why such a release, if it were 
to happen, would be useful to the community?
The reason I'm asking is that we're still in early days and at this 
point I'm one of the few voices advocating making TEI files available in 
the manner you describe. If you can provide me some arguments, 
preferably with some backing numbers, around the impact you anticipate 
this having, it would help bolster my position and help move this forward.

On Tue, Mar 6, 2012 at 4:26 PM, Kevin Hawkins 
<kevin.s.hawkins at ultraslavonic.info 
<mailto:kevin.s.hawkins at ultraslavonic.info>> wrote:

    One more question, Ranjith: will this code, once deployed, make TEI
    available as a download format for all material in the public domain
    in the user's location?  Do you have a sense of how many titles this
    will affect in Google Books?  The TEI Consortium would like to be
    ready to generate publicity around this, so having some stats will help.