[tei-council] Fwd: Re: Google Books > TEI
Kevin Hawkins
kevin.s.hawkins at ultraslavonic.info
Tue Mar 6 21:40:03 EST 2012
All,
I'm forwarding two messages from our contact at Google (cc'd here),
where he answers the questions that came up during our last conference
call with some requests for further information from us. He
accidentally sent the second message sent only to me rather than to our
whole committee, so I'm copying Laurent as well since he's no longer on
the TEI Council.
(Ranjith, to prevent spam, only members can post to tei-council, but I
can forward messages from you to the group.)
---Kevin
-------- Original Message --------
Subject: Re: Google Books > TEI
Date: Mon, 5 Mar 2012 15:43:13 -0800
From: Ranjith Unnikrishnan <ranjith at google.com>
To: Kevin Hawkins <kevin.s.hawkins at ultraslavonic.info>
CC: James.Cummings at oucs.ox.ac.uk, mholmes at uvic.ca, laurent.romary at inria.fr
Hi Kevin,
Responses are inline.
1. Would Google be willing to release the code used for the
transformation from Google's internal data format to TEI? We
realize the code might wouldn't be useful to anyone without the
underlying data, but it might help users who are trying to
understand what you produce and what errors are likely.
I personally don't think this is likely because (i) in addition to the
source material we'd have to open-source our internal data format for
the code to make any sense, and (ii) during the course of processing,
the code interacts with many internal storage systems that cannot be
made publicly accessible.
That said, the significant errors originate not from the process of
converting our internal data format to TEI, but from poor OCR and/or
structure analysis. So I'm not convinced that looking at the conversion
code would serve the purpose of better understanding the errors. The
bulk of the OCR work is done using the Tesseract OCR engine, which is
open-source and is updated periodically, and so may be a better place
for interested parties to look.
2. Will there be a mechanism for users to report errors in the TEI
files or submit revised TEI files? This might relate to the work
you're doing with James Cummings and others on matching TEI
documents with scanned editions of the same source document.
This was my hope and intention from the start, but we don't have a
feedback mechanism for this at the moment and I'm not sure what the best
scalable solution is. In fact I'm eager to get your ideas on this since,
as both stewards and users of the format, you are in a better position
than us to guide and drive this correction process in the community of
TEI users.
I'm not sure of the document matching work that you're referring to, but
I am working with James to import some public-domain material that he
can provide. My end of the work requires setting up the infrastructure
to ingest ePub files directly into our corpus so that it is freely
available to a broader audience via the Google Books website. In that
respect, the work could relate to your comment as it should result in a
way to import whole corrected TEI files (via an intermediate conversion
to ePub).
On a related note, you may be aware of an early stage project out of
Texas A&M University that is aimed at digitizing 18th century books via
crowd-sourcing in combination with OCR. I've been trying to motivate
them to use TEI as an input/output format so that when we eventually
provide TEI files for public domain books, they will have a free source
of input material to correct, and secondly if they wish to return the
corrected TEI files to us to make freely available on Google Books then
it will benefit everyone and every community involved. I don't know if
they're completely on board with this vision, but I do know that some of
the tools they plan to use do accept TEI.
~Ranjith
-------- Original Message --------
Subject: Re: scope of Google Books > TEI
Date: Tue, 6 Mar 2012 17:36:31 -0800
From: Ranjith Unnikrishnan <ranjith at google.com>
To: Kevin Hawkins <kevin.s.hawkins at ultraslavonic.info>
Actually, I'd like to turn that question around if you don't mind. Can
you give some convincing arguments as to why such a release, if it were
to happen, would be useful to the community?
The reason I'm asking is that we're still in early days and at this
point I'm one of the few voices advocating making TEI files available in
the manner you describe. If you can provide me some arguments,
preferably with some backing numbers, around the impact you anticipate
this having, it would help bolster my position and help move this forward.
On Tue, Mar 6, 2012 at 4:26 PM, Kevin Hawkins
<kevin.s.hawkins at ultraslavonic.info
<mailto:kevin.s.hawkins at ultraslavonic.info>> wrote:
One more question, Ranjith: will this code, once deployed, make TEI
available as a download format for all material in the public domain
in the user's location? Do you have a sense of how many titles this
will affect in Google Books? The TEI Consortium would like to be
ready to generate publicity around this, so having some stats will help.
More information about the tei-council
mailing list