[tei-council] Fwd: [tei-board] Report from Google engineer about progress with TEI
laurent.romary at inria.fr
Fri Aug 19 03:18:17 EDT 2011
Council. See the message below which is a follow up on some technical feedback from Google that we already discussed. Please provide your views on this and possibly volunterr if you want to be the council contact for this collaboration.
Début du message réexpédié :
> De : Martin Mueller <martinmueller at northwestern.edu>
> Date : 11 août 2011 04:04:59 HAEC
> À : "tei-board at lists.village.Virginia.EDU" <tei-board at lists.village.Virginia.EDU>
> Objet : [tei-board] Report from Google engineer about progress with TEI
> Répondre à : tei-board at lists.village.Virginia.EDU
> From: Ranjith Unnikrishnan <ranjith at google.com>
> Date: Wed, 10 Aug 2011 18:54:20 -0700
> To: <google-library-quality at googlegroups.com>, Jeff Breidenbach <jbreiden at google.com>, Martin Mueller <martinmueller at northwestern.edu>
> Subject: TEI samples and open questions
> Hello everyone,
> To follow up on our discussion yesterday, I've attached the following generated sample TEI files for your feedback. They are loosely in order of decreasing OCR text quality. The variation comes from a number of factors like image quality, complexity of the book structure, as well as the recency and extent of processing. But I'd like to draw your attention to the generated format rather than the text quality at this stage as there are possibilities for exporting our estimates of text quality that we can discuss separately.
> dickens.tei (Google books ID i8_u_-YmG4MC)
> gullivers_travels.tei (Google books ID srVbAAAAQAAJ)
> shamela_andrews.tei (Google books ID zNsNAAAAQAAJ)
> scandal.tei (Google books ID i3lbAAAAQAAJ)
> dunciad.tei (Google books ID gA8UAAAAQAAJ)
> The files were validated using the latest candidate release RNC schema files that follow the TEI best practices guide for libraries at the "Level 3" encoding. Our intention is to supply generated TEI files for our processed volumes via GRIN or some other interface so that you can then disseminate them as you wish to interested humanities scholars. The TEI users and members of the TEI standards body that we've been corresponding with over the past months seem pleased with the samples they've seen, and from the quality of generated output feel they would make a decent starting point for further manual annotation and enrichment.
> I'd like to get your feedback on:
> (i) whether and how to restrict the set of volumes for which we generate TEI files. eg. restriction by language, a quality threshold over the document using something like Ashok's text scorer, only public domain books etc. Or maybe this should be library specific?
> (ii) whether to use GRIN as the interface to provide these files, and
> (iii) whether and how to make an entry in the METS xml file for the generated TEI file to accompany the GRIN package, and what other conventions (eg. file naming) should be followed for that.
> tei-board mailing list
> tei-board at lists.village.Virginia.EDU
INRIA & HUB-IDSL
laurent.romary at inria.fr
More information about the tei-council