[tei-council] Fwd: [tei-board] Report from Google engineer about progress with TEI
kevin.s.hawkins at ultraslavonic.info
Fri Aug 19 20:43:20 EDT 2011
Martin, I'm glad you are impressed. I was too, whereas some colleagues
of mine (who shall remain nameless) found many errors and think the text
aren't so useful. I, however, agree with you.
The value of <idno> is a unique identifier in Google Books, and you will
find it in the URL of that book online.
Ranjith will be grateful to hear your additional comments, so I
encourage you to pass them along.
GRIN is the "Google Return Interface" used by Google Books partner
libraries that have the right to (and interest in) retrieving content
digitized from their collection. It's how content scanned by Google
gets into HathiTrust. Here is some outdated information:
On 8/19/11 11:40 AM, Martin Holmes wrote:
> I don't know about anyone else, but personally I find the quality of
> these pretty remarkable. The headers look good, the documents validate,
> and there's considerable sophistication in the process -- poetry is
> identified as such, and encoded with line-groups, as opposed to prose.
> I'd like to see an XML declaration at the beginning, and perhaps some
> more detailed metadata:
> <publisher>Google Inc.</publisher>
> <idno>gA8UAAAAQAAJ</idno> <<< What does this idno mean?
> How could it be used to access
> the document?
> <date when="2011-08-10"/>
> It would be nice to see info in the header on how the XML was created,
> and whether it has undergone any human proofing or editing.
> I don't know what GRIN is, and I couldn't find much useful info on it --
> is anyone familiar with it?
> On 11-08-19 12:18 AM, Laurent Romary wrote:
>> Council. See the message below which is a follow up on some technical feedback from Google that we already discussed. Please provide your views on this and possibly volunterr if you want to be the council contact for this collaboration.
>> Début du message réexpédié :
>>> De : Martin Mueller<martinmueller at northwestern.edu>
>>> Date : 11 août 2011 04:04:59 HAEC
>>> À : "tei-board at lists.village.Virginia.EDU"<tei-board at lists.village.Virginia.EDU>
>>> Objet : [tei-board] Report from Google engineer about progress with TEI
>>> Répondre à : tei-board at lists.village.Virginia.EDU
>>> From: Ranjith Unnikrishnan<ranjith at google.com>
>>> Date: Wed, 10 Aug 2011 18:54:20 -0700
>>> To:<google-library-quality at googlegroups.com>, Jeff Breidenbach<jbreiden at google.com>, Martin Mueller<martinmueller at northwestern.edu>
>>> Subject: TEI samples and open questions
>>> Hello everyone,
>>> To follow up on our discussion yesterday, I've attached the following generated sample TEI files for your feedback. They are loosely in order of decreasing OCR text quality. The variation comes from a number of factors like image quality, complexity of the book structure, as well as the recency and extent of processing. But I'd like to draw your attention to the generated format rather than the text quality at this stage as there are possibilities for exporting our estimates of text quality that we can discuss separately.
>>> dickens.tei (Google books ID i8_u_-YmG4MC)
>>> gullivers_travels.tei (Google books ID srVbAAAAQAAJ)
>>> shamela_andrews.tei (Google books ID zNsNAAAAQAAJ)
>>> scandal.tei (Google books ID i3lbAAAAQAAJ)
>>> dunciad.tei (Google books ID gA8UAAAAQAAJ)
>>> The files were validated using the latest candidate release RNC schema files that follow the TEI best practices guide for libraries at the "Level 3" encoding. Our intention is to supply generated TEI files for our processed volumes via GRIN or some other interface so that you can then disseminate them as you wish to interested humanities scholars. The TEI users and members of the TEI standards body that we've been corresponding with over the past months seem pleased with the samples they've seen, and from the quality of generated output feel they would make a decent starting point for further manual annotation and enrichment.
>>> I'd like to get your feedback on:
>>> (i) whether and how to restrict the set of volumes for which we generate TEI files. eg. restriction by language, a quality threshold over the document using something like Ashok's text scorer, only public domain books etc. Or maybe this should be library specific?
>>> (ii) whether to use GRIN as the interface to provide these files, and
>>> (iii) whether and how to make an entry in the METS xml file for the generated TEI file to accompany the GRIN package, and what other conventions (eg. file naming) should be followed for that.
>>> tei-board mailing list
>>> tei-board at lists.village.Virginia.EDU
>> Laurent Romary
>> INRIA& HUB-IDSL
>> laurent.romary at inria.fr
>> tei-council mailing list
>> tei-council at lists.village.Virginia.EDU
>> PLEASE NOTE: postings to this list are publicly archived
More information about the tei-council