[tei-council] Fwd: Re: Google Books > TEI

Wed Mar 7 09:48:33 EST 2012

Thanks for all of this Kevin....very interesting. More thoughts and
questions below....

Becky

>    2. Will there be a mechanism for users to report errors in the TEI
>    files or submit revised TEI files?  This might relate to the work
>    you're doing with James Cummings and others on matching TEI
>    documents with scanned editions of the same source document.
>
>
> This was my hope and intention from the start, but we don't have a
> feedback mechanism for this at the moment and I'm not sure what the best
> scalable solution is. In fact I'm eager to get your ideas on this since,
> as both stewards and users of the format, you are in a better position
> than us to guide and drive this correction process in the community of
> TEI users.

Ranjith's commet is in line with my (admittedly vague) sense of what
Google currently allows: that you can report errors, but this simply
puts the reported book back on a queue to be re-scanned and re-OCRed.
One challenge here may be that Google doesn't want corrections
inserted in the middle or at the end of their workflow, because if the
book is ever re-processed, by accident or on purpose, those changes
would be lost.
>
> I'm not sure of the document matching work that you're referring to, but
> I am working with James to import some public-domain material that he
> can provide. My end of the work requires setting up the infrastructure
> to ingest ePub files directly into our corpus so that it is freely
> available to a broader audience via the Google Books website. In that
> respect, the work could relate to your comment as it should result in a
> way to import whole corrected TEI files (via an intermediate conversion
> to ePub).

James, these are the ECCO-TCP texts (and perhaps other Oxford Text
Archive texts), right? If matching the texts up to scanned books is
*not* part of this work, does that mean that from Google's perspective
these works would only consist of electronic text, and no page images?
 That's the part that seemed to me like it might tie in with the
development of a text feedback mechanism, because it implies a
workflow where the electronic text is the original object, rather than
the output of another process (in contrast with the current scan + OCR
model). If text is 1) going to be accepted from the outside in the
first place and 2) Not going to be generated or overwritten by a
Google engine, building a way to make or recommend corrections to the
text starts to look more reasonable.
>
> On a related note, you may be aware of an early stage project out of
> Texas A&M University that is aimed at digitizing 18th century books via
> crowd-sourcing in combination with OCR. I've been trying to motivate
> them to use TEI as an input/output format so that when we eventually
> provide TEI files for public domain books, they will have a free source
> of input material to correct, and secondly if they wish to return the
> corrected TEI files to us to make freely available on Google Books then
> it will benefit everyone and every community involved. I don't know if
> they're completely on board with this vision, but I do know that some of
> the tools they plan to use do accept TEI.

Very interesting! I'm aware of the work at Texas A&M (the TCP has
provided lots of 18th-century texts to them for testing this OCR) and
glad to hear that they're in touch with Google about it. I agree it
would be nice if they could generate TEI as an output, but I'm not
sure I understand what is meant by TEI as an input. Since this work
involves OCR, isn't the desired input format just the page images? My
understanding of the project at Texas A&M is that the goal is to train
OCR to successfully capture characters in books/fonts where this is
currently still too hard to do. If this project is successful, it
would be great if the tool they build could be used to generate more
accurate OCR for older books in Google's corpus. But I'm not sure that
crowdsourced correction of existing electronic text fits into the
mission of their project.

> Actually, I'd like to turn that question around if you don't mind. Can
> you give some convincing arguments as to why such a release, if it were
> to happen, would be useful to the community?
> The reason I'm asking is that we're still in early days and at this
> point I'm one of the few voices advocating making TEI files available in
> the manner you describe. If you can provide me some arguments,
> preferably with some backing numbers, around the impact you anticipate
> this having, it would help bolster my position and help move this forward.
>
>
> On Tue, Mar 6, 2012 at 4:26 PM, Kevin Hawkins
> <kevin.s.hawkins at ultraslavonic.info
> <mailto:kevin.s.hawkins at ultraslavonic.info>> wrote:
>
>    One more question, Ranjith: will this code, once deployed, make TEI
>    available as a download format for all material in the public domain
>    in the user's location?  Do you have a sense of how many titles this
>    will affect in Google Books?  The TEI Consortium would like to be
>    ready to generate publicity around this, so having some stats will help.
>
I'm not sure I understand what Ranjith wants here. Is this a request
for justification of making the TEI available for download, or
justification for doing TEI at all? If the TEI files would not be
available for download, what's the point of producing it?