[tei-council] Fwd: [tei-board] Report from Google engineer about progress with TEI

Kevin Hawkins kevin.s.hawkins at ultraslavonic.info
Thu Aug 25 09:31:50 EDT 2011


So far Ranjith has been quite willing to ask questions of and provide 
steadily improved versions of his sample texts anyone who has ever 
responded to his emails.  So I direct correspondence with him seems to 
be working.  The only risk I see is that people will give contradictory 
advice, but that same risk is a possibility if we have a list, where we 
might end up debating encoding while this poor engineer looks on haplessly.

--Kevin

On 8/25/2011 8:44 AM, Laurent Romary wrote:
> Thanks,
> I would say, to keep checking their production and see how to help google to provide texts useful for a wide community of scholars.
> Laurent
>
> Le 25 août 2011 à 14:41, Martin Holmes a écrit :
>
>> I'm tempted to volunteer, but I don't really know the history of this project, nor am I a librarian, or affiliated with any of the institutions who will be receiving the texts. What tasks would the subgroup take on?
>>
>> Cheers,
>> Martin
>>
>> On 11-08-25 12:41 AM, Laurent Romary wrote:
>>> Dear all,
>>> Anyone else with views on this subject? Shall we create a temporary subgroup on this to optimize the communication with Google? Volunteers?
>>> Cheers,
>>> Laurent
>>>
>>> Le 20 août 2011 à 02:43, Kevin Hawkins a écrit :
>>>
>>>> Martin, I'm glad you are impressed.  I was too, whereas some colleagues
>>>> of mine (who shall remain nameless) found many errors and think the text
>>>> aren't so useful.  I, however, agree with you.
>>>>
>>>> The value of<idno>   is a unique identifier in Google Books, and you will
>>>> find it in the URL of that book online.
>>>>
>>>> Ranjith will be grateful to hear your additional comments, so I
>>>> encourage you to pass them along.
>>>>
>>>> GRIN is the "Google Return Interface" used by Google Books partner
>>>> libraries that have the right to (and interest in) retrieving content
>>>> digitized from their collection.  It's how content scanned by Google
>>>> gets into HathiTrust.  Here is some outdated information:
>>>>
>>>> http://old.diglib.org/forums/fall2006/presentations/powell-2006-11.pdf"
>>>>
>>>> On 8/19/11 11:40 AM, Martin Holmes wrote:
>>>>> I don't know about anyone else, but personally I find the quality of
>>>>> these pretty remarkable. The headers look good, the documents validate,
>>>>> and there's considerable sophistication in the process -- poetry is
>>>>> identified as such, and encoded with line-groups, as opposed to prose.
>>>>> I'd like to see an XML declaration at the beginning, and perhaps some
>>>>> more detailed metadata:
>>>>>
>>>>> <publicationStmt>
>>>>> <publisher>Google Inc.</publisher>
>>>>>
>>>>> <idno>gA8UAAAAQAAJ</idno>              <<<    What does this idno mean?
>>>>>                                           How could it be used to access
>>>>>                                           the document?
>>>>> <date when="2011-08-10"/>
>>>>> </publicationStmt>
>>>>>
>>>>> It would be nice to see info in the header on how the XML was created,
>>>>> and whether it has undergone any human proofing or editing.
>>>>>
>>>>> I don't know what GRIN is, and I couldn't find much useful info on it --
>>>>> is anyone familiar with it?
>>>>>
>>>>> Cheers,
>>>>> Martin
>>>>>
>>>>>
>>>>> On 11-08-19 12:18 AM, Laurent Romary wrote:
>>>>>> Council. See the message below which is a follow up on some technical feedback from Google that we already discussed. Please provide your views on this and possibly volunterr if you want to be the council contact for this collaboration.
>>>>>> Laurent
>>>>>>
>>>>>>
>>>>>> Début du message réexpédié :
>>>>>>
>>>>>>> De : Martin Mueller<martinmueller at northwestern.edu>
>>>>>>> Date : 11 août 2011 04:04:59 HAEC
>>>>>>> À : "tei-board at lists.village.Virginia.EDU"<tei-board at lists.village.Virginia.EDU>
>>>>>>> Objet : [tei-board] Report from Google engineer about progress with TEI
>>>>>>> Répondre à : tei-board at lists.village.Virginia.EDU
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> From: Ranjith Unnikrishnan<ranjith at google.com>
>>>>>>> Date: Wed, 10 Aug 2011 18:54:20 -0700
>>>>>>> To:<google-library-quality at googlegroups.com>, Jeff Breidenbach<jbreiden at google.com>, Martin Mueller<martinmueller at northwestern.edu>
>>>>>>> Subject: TEI samples and open questions
>>>>>>>
>>>>>>> Hello everyone,
>>>>>>>
>>>>>>> To follow up on our discussion yesterday, I've attached the following generated sample TEI files for your feedback. They are loosely in order of decreasing OCR text quality. The variation comes from a number of factors like image quality, complexity of the book structure, as well as the recency and extent of processing. But I'd like to draw your attention to the generated format rather than the text quality at this stage as there are possibilities for exporting our estimates of text quality that we can discuss separately.
>>>>>>>
>>>>>>> dickens.tei  (Google books ID i8_u_-YmG4MC)
>>>>>>> gullivers_travels.tei (Google books ID srVbAAAAQAAJ)
>>>>>>> shamela_andrews.tei (Google books ID zNsNAAAAQAAJ)
>>>>>>> scandal.tei (Google books ID i3lbAAAAQAAJ)
>>>>>>> dunciad.tei (Google books ID gA8UAAAAQAAJ)
>>>>>>>
>>>>>>> The files were validated using the latest candidate release RNC schema files that follow the TEI best practices guide for libraries at the "Level 3" encoding. Our intention is to supply generated TEI files for our processed volumes via GRIN or some other interface so that you can then disseminate them as you wish to interested humanities scholars. The TEI users and members of the TEI standards body that we've been corresponding with over the past months seem pleased with the samples they've seen, and from the quality of generated output feel they would make a decent starting point for further manual annotation and enrichment.
>>>>>>>
>>>>>>> I'd like to get your feedback on:
>>>>>>> (i) whether and how to restrict the set of volumes for which we generate TEI files. eg. restriction by language, a quality threshold over the document using something like Ashok's text scorer, only public domain books etc. Or maybe this should be library specific?
>>>>>>> (ii) whether to use GRIN as the interface to provide these files, and
>>>>>>> (iii) whether and how to make an entry in the METS xml file for the generated TEI file to accompany the GRIN package, and what other conventions (eg. file naming) should be followed for that.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ranjith
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> _______________________________________________
>>>>>>> tei-board mailing list
>>>>>>> tei-board at lists.village.Virginia.EDU
>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-board
>>>>>>
>>>>>> Laurent Romary
>>>>>> INRIA&     HUB-IDSL
>>>>>> laurent.romary at inria.fr
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> tei-council mailing list
>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>
>>>>>> PLEASE NOTE: postings to this list are publicly archived
>>>>>> .
>>>>>>
>>>>>
>>>> _______________________________________________
>>>> tei-council mailing list
>>>> tei-council at lists.village.Virginia.EDU
>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>
>>>> PLEASE NOTE: postings to this list are publicly archived
>>>
>>> Laurent Romary
>>> INRIA&   HUB-IDSL
>>> laurent.romary at inria.fr
>>>
>>>
>>>
>>> _______________________________________________
>>> tei-council mailing list
>>> tei-council at lists.village.Virginia.EDU
>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>
>>> PLEASE NOTE: postings to this list are publicly archived
>
> Laurent Romary
> INRIA&  HUB-IDSL
> laurent.romary at inria.fr
>
>
>
> _______________________________________________
> tei-council mailing list
> tei-council at lists.village.Virginia.EDU
> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>
> PLEASE NOTE: postings to this list are publicly archived


More information about the tei-council mailing list