[tei-council] TEI versions of Gutenberg and Google

Kevin Hawkins kevin.s.hawkins at ultraslavonic.info
Mon Apr 18 15:20:47 EDT 2011


I think the question of how Project Gutenberg maintains their schema is 
separate from how they derive their TEI texts.

Even though Martin Mueller hasn't directly confirmed my earlier query, I 
believe he uses "white space XML" to mean something quite similar to 
Levels 1 and 2 in the Best Practices for TEI in Libraries, which assume 
output from OCR software in which whitespace has some significance.  I 
think Tim Cole isn't the only one using heuristics to determine the 
sturcture of a document: such technology has been worked out years ago 
at places like Xerox and Hewlett Packard and put into practice by, say, 
Amazon, which will take your PDF file and convert into the Kindle 
format, or by Google to produce an EPUB file.

The Kindle and EPUB formats are essentially zip wrappers for a 
constrained form of HTML, plus accessory files like embedded images. 
The current version of EPUB lets you include DAISY content instead of 
HTML, but EPUB 3 is moving toward HTML5 only.  In either case, the 
structure here is quite minimal.

So you might take a Google EPUB file, unzip it (which on some systems is 
most easily accomplished by giving the file a ".zip" extension), and 
seeing what the content looks like.  I took a quick look at 
Hofmannsthal's Der Schwierige; turning this into lightly encoded TEI 
shouldn't be difficult.  If you don't want to wait on Google to 
implement TEI export from Google Books, someone could write the XSLT and 
share on the TEI wiki.

On 4/18/2011 9:52 AM, Martin Mueller wrote:
> Has there ever been any TEI version of a Gutenberg text produced in this
> manner?  I've written to Matt Jockers to see what they've done with their
> upcoding.
>
> On a related matter, has anybody tried to convert Google ebooks to TEI?
> This would be a sort of reverse engineering and unnecessary if Google has
> a TEI output format. On the other hand, demonstration projects from within
> the TEI community might spur them.
>
> I recently read Hofmannsthal's Der Schwierige in a Google epub version and
> looked at the encoding and OCR.  There have been some interesting
> experiments at UIUC about transforming the "white space XML" of OCR output
> into TEI. It takes only minimal forms of human intervention to correct
> basic structural errors.  One of the attractive aspects of TEI texts that
> originate in OCR is that you end up with a text that is a chain of digital
> surrogates stretching from the page image through the transcription to the
> data derivatives and aggregates that you can construct from a corpus of
> such texts.
>
> Laura Mandell and the 18thConnect Project are very active in that field.
>
> I don't know enough about what goes on inside an epub text to figure out
> how to transform it. But it looks like it's possible, and if we did a few
> proof-of-concept conversions, it might be a way of nudging Google.
>
> MM
> On 4/18/11 12:04 AM, "Laurent Romary"<laurent.romary at inria.fr>  wrote:
>
>> The idea would be to have a way to make those projects maintain their
>> schema in a way which is closer to what the TEI itself does. They could
>> even maintain these on SF. By helping them going this way, we would go in
>> the direction indicated by Martin M. in his talk last week.
>>
>> Le 18 avr. 2011 à 00:25, Lou Burnard a écrit :
>>
>>> We did have quite a bit of discussion with Marcello back in 2006 or
>>> 2007
>>> or so, but I haven't heard much from them lately.
>>>
>>> Re-expressing their subset of TEI as an ODD would be a fairly trivial
>>> exercise, but I'm not sure what it would achieve.
>>>
>>> On 17/04/11 17:01, Laurent Romary wrote:
>>>> Would someone already ni close contact with PG be eager to take an
>>>> informal contact with the guy maintaining the page and see whether he
>>>> would like having his schema as a P5 ODD, which would also allow him to
>>>> update some tiny features here and there (like using xml:lang)?
>>>>
>>>>
>>>> Le 17 avr. 2011 à 17:48, Piotr Bański a écrit :
>>>>
>>>>> Regarding the need to secure the choice of TEI as the format of choice
>>>>> in (among others) digitization of literary works, should we not pay
>>>>> special attention to developments such as PGTEI? Project Gutenberg is
>>>>> a
>>>>> very serious and popular initiative, and providing support to it will
>>>>> benefit both sides.
>>>>>
>>>>> I can't recall PG being mentioned at TEI-MMs or on TEI-L (may have
>>>>> missed something obvious though). PGTEI appears to be a derived format
>>>>> instead of being an ODD customization -- perhaps all is not lost and
>>>>> we
>>>>> can provide support for PG, in return enlarging the community
>>>>> coverage,
>>>>> with all the related benefits.
>>>>>
>>>>> * http://pgtei.pglaf.org/marcello/0.4/doc/20000-h.html
>>>>> *
>>>>>
>>>>> http://www.gutenbergnews.org/20070402/what-is-pg-tei-and-why-is-it-bein
>>>>> g-developed/
>>>>> * https://www.stanford.edu/~mjockers/cgi-bin/drupal/node/49
>>>>>
>>>>>   P.
>>>>> _______________________________________________
>>>>> tei-council mailing list
>>>>> tei-council at lists.village.Virginia.EDU
>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>
>>>>> PLEASE NOTE: postings to this list are publicly archived
>>>>
>>>> Laurent Romary
>>>> INRIA&   HUB-IDSL
>>>> laurent.romary at inria.fr
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> tei-council mailing list
>>>> tei-council at lists.village.Virginia.EDU
>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>
>>>> PLEASE NOTE: postings to this list are publicly archived
>>>
>>> _______________________________________________
>>> tei-council mailing list
>>> tei-council at lists.village.Virginia.EDU
>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>
>>> PLEASE NOTE: postings to this list are publicly archived
>>
>> Laurent Romary
>> INRIA&  HUB-IDSL
>> laurent.romary at inria.fr
>>
>>
>>
>> _______________________________________________
>> tei-council mailing list
>> tei-council at lists.village.Virginia.EDU
>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>
>> PLEASE NOTE: postings to this list are publicly archived
>
>
>
>
> _______________________________________________
> tei-council mailing list
> tei-council at lists.village.Virginia.EDU
> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>
> PLEASE NOTE: postings to this list are publicly archived


More information about the tei-council mailing list