[tei-council] TEI and Google

Martin Mueller martin.mueller at mac.com
Wed Mar 23 01:12:42 EDT 2011

I'd like to brief the Council and the Board on some conversations that have
recently taken place about Google books being published in TEI.

On March 3, there was a "Corpora Camp" meeting at Maryland for participants
in Bamboo's Corpora Space activities. Jon Orwant from Google attended that
meeting. Orwant appears to be the informal liaison between Google and
humanities scholarship. He mentioned casually that there might be some
interest on the Google to explore this option.

I followed up subsequently with an email. I quote part of it:

> I am following up on our brief exchange about TEI and Google Books at the
> Maryland Corpora Camp.
> From the perspective of humanities scholars there are distinct advantages to
> Google books being available in a simple and tightly controlled TEI P5 format.
> With regard to bibliographical and basic structural data it would mean that
> Google books would be interoperable with the scholarly text archives that have
> been produced by major research libraries over the past decade, whether the
> TCP texts, Documenting the American South, the Wright Fiction Archive, or, for
> that matter, the papers of the US State Department.
> I imagine a scholarly eco-system in which researchers will increasingly work
> with text archives on a mix and match basis and will want texts that are
> easily pried from their silos, whether for plain reading, additional forms of
> curation, or algorithmic manipulation. TEI versions of Google books would be a
> terrific contribution to such an eco-system.

Orwant replied that he had "written it up and advertised it inside Google
for people with spare time.  We'll see who bites!"

Peter Gorman from Wisconsin reports that the same matter was brought up at a
meeting of the Google Library Quality Group. I quote from a memo he sent to
various Bamboo folks:

> Last Wednesday we discussed with Jon Orwant whether Google might be
> willing to provide a TEI-encoded version of the Google Books content.
> Although John's no longer formally associated with Google Books, he
> saw the usefulness of the idea and agreed to take the idea back to
> Google. I'm happy to report that he's submitted a feature request for
> this through Google's pipeline. That doesn't necessarily mean they'll
> do it, but it's officially on their radar. During yesterday's quality
> group meeting I gave the Google engineers some background on TEI and
> our Corpora Camp discussion. The other group members (representatives
> from Library Partner institutions and Google staff) thought this was
> an idea worth pursuing. I think we (libraries, scholars, TEI
> Consortium) should try to come to some kind of informal consensus on
> what that TEI should look like, perhaps through a revivified TEI
> Libraries SIG.

This seems to be a ball that is rolling around in some court, and if we want
to kick it in some direction we should think about doing that sooner rather
than later. 

I spent the evening reading Hofmannsthal's Der Schwierige in a Google
facsimile and looked at its epub version. I was reminded of some very
interesting experiments that Tim Cole and various staff people at UIUC has
done with converting what I call white-space XML into TEI. It appears that
you can go pretty far with some combination of algorithmic transformation
and human curation.

I can see many upsides and few downsides in a very basic flavour of TEI
becoming an optional format for some Google books, and I share Peter's sense
that we scholars, librarians, hackers, TEI-ers should find various venues to
discuss the whether, what, and how of all this.

