[tei-council] Fwd: [tei-board] Report from Google engineer about progress with TEI
mholmes at uvic.ca
Fri Aug 19 11:40:15 EDT 2011
I don't know about anyone else, but personally I find the quality of
these pretty remarkable. The headers look good, the documents validate,
and there's considerable sophistication in the process -- poetry is
identified as such, and encoded with line-groups, as opposed to prose.
I'd like to see an XML declaration at the beginning, and perhaps some
more detailed metadata:
<idno>gA8UAAAAQAAJ</idno> <<< What does this idno mean?
How could it be used to access
It would be nice to see info in the header on how the XML was created,
and whether it has undergone any human proofing or editing.
I don't know what GRIN is, and I couldn't find much useful info on it --
is anyone familiar with it?
On 11-08-19 12:18 AM, Laurent Romary wrote:
> Council. See the message below which is a follow up on some technical feedback from Google that we already discussed. Please provide your views on this and possibly volunterr if you want to be the council contact for this collaboration.
> Début du message réexpédié :
>> De : Martin Mueller<martinmueller at northwestern.edu>
>> Date : 11 août 2011 04:04:59 HAEC
>> À : "tei-board at lists.village.Virginia.EDU"<tei-board at lists.village.Virginia.EDU>
>> Objet : [tei-board] Report from Google engineer about progress with TEI
>> Répondre à : tei-board at lists.village.Virginia.EDU
>> From: Ranjith Unnikrishnan<ranjith at google.com>
>> Date: Wed, 10 Aug 2011 18:54:20 -0700
>> To:<google-library-quality at googlegroups.com>, Jeff Breidenbach<jbreiden at google.com>, Martin Mueller<martinmueller at northwestern.edu>
>> Subject: TEI samples and open questions
>> Hello everyone,
>> To follow up on our discussion yesterday, I've attached the following generated sample TEI files for your feedback. They are loosely in order of decreasing OCR text quality. The variation comes from a number of factors like image quality, complexity of the book structure, as well as the recency and extent of processing. But I'd like to draw your attention to the generated format rather than the text quality at this stage as there are possibilities for exporting our estimates of text quality that we can discuss separately.
>> dickens.tei (Google books ID i8_u_-YmG4MC)
>> gullivers_travels.tei (Google books ID srVbAAAAQAAJ)
>> shamela_andrews.tei (Google books ID zNsNAAAAQAAJ)
>> scandal.tei (Google books ID i3lbAAAAQAAJ)
>> dunciad.tei (Google books ID gA8UAAAAQAAJ)
>> The files were validated using the latest candidate release RNC schema files that follow the TEI best practices guide for libraries at the "Level 3" encoding. Our intention is to supply generated TEI files for our processed volumes via GRIN or some other interface so that you can then disseminate them as you wish to interested humanities scholars. The TEI users and members of the TEI standards body that we've been corresponding with over the past months seem pleased with the samples they've seen, and from the quality of generated output feel they would make a decent starting point for further manual annotation and enrichment.
>> I'd like to get your feedback on:
>> (i) whether and how to restrict the set of volumes for which we generate TEI files. eg. restriction by language, a quality threshold over the document using something like Ashok's text scorer, only public domain books etc. Or maybe this should be library specific?
>> (ii) whether to use GRIN as the interface to provide these files, and
>> (iii) whether and how to make an entry in the METS xml file for the generated TEI file to accompany the GRIN package, and what other conventions (eg. file naming) should be followed for that.
>> tei-board mailing list
>> tei-board at lists.village.Virginia.EDU
> Laurent Romary
> INRIA& HUB-IDSL
> laurent.romary at inria.fr
> tei-council mailing list
> tei-council at lists.village.Virginia.EDU
> PLEASE NOTE: postings to this list are publicly archived
University of Victoria Humanities Computing and Media Centre
(mholmes at uvic.ca)
More information about the tei-council