[tei-council] Fwd: Fwd: Re: Google Books > TEI

Kevin Hawkins kevin.s.hawkins at ultraslavonic.info
Sun Apr 21 19:20:10 EDT 2013


As requested in Providence, I am forwarding the samples for anyone who 
is interested in reviewing the quality of these.  Below are the IDs for 
each in case you want to compare against the equivalent page images in 
Google Books:

dickens.tei -- i8_u_-YmG4MC
gullivers_travels.tei -- srVbAAAAQAAJ


-------- Original Message --------
Subject: 	Fwd: Re: Google Books > TEI
Date: 	Thu, 28 Jun 2012 00:12:48 -0400
From: 	Kevin Hawkins <kevin.s.hawkins at ultraslavonic.info>
To: 	tei-council at lists.village.virginia.edu



All,

Now that the latest release is behind us, I'd like to follow up on a few
things I've promised for you but which wouldn't have contributed to
getting through bug fixes and feature requests in time for the release.

First of all, in Ann Arbor we agreed that we would ask Ranjith, our
contact at Google, for the latest samples so we can calculate some
statistics on accuracy and encourage Google towards making this format
public.  See the two attachments and the correspondence below.

Our agenda is a bit vague on responsibility here.  I've asked for
samples, but I think others will want to check for accuracy of encoding.

Kevin

-------- Original Message --------
Subject: 	Re: Google Books > TEI
Date: 	Mon, 23 Apr 2012 15:30:38 -0700
From: 	Ranjith Unnikrishnan <ranjith at google.com>
To: 	Kevin Hawkins <kevin.s.hawkins at ultraslavonic.info>
CC: 	James.Cummings at oucs.ox.ac.uk, mholmes at uvic.ca, laurent.romary at inria.fr



Yes, the last round of feedback I got was around that time frame, and
came both from your group as well as another working group that included
some of our library partners. I had incorporated the two sets of
feedback into some improvements to the algorithm, but they were mostly
related to style and had nothing to do with producing new output tags or
such. Comments that were not addressed required improvements to existing
text structure analysis algorithms that were at least partly based on
the quality of obtained OCR text. Both of these are active research
topics that are always on our agenda but are not quick fixes.
I've attached the latest Dicken's and Gulliver's Travels files, and can
generate TEI files for others if you can send me links to their
corresponding pages on Google Books.



On Mon, Apr 23, 2012 at 2:53 PM, Kevin Hawkins
<kevin.s.hawkins at ultraslavonic.info
<mailto:kevin.s.hawkins at ultraslavonic.info>> wrote:

     The last round of reviews I have is a sample of Dickens from July
     27, 2011.  I have earlier versions of other titles, but they aren't
     worth consulting at this point since you've improved other things.
       Has the algorithm changed since then?  It would be nice to have
     the latest version of not only Dickens but also of some other works
     in the public domain: perhaps an old bound volume of a journal and a
     non-fiction book?  Thanks.


     On 4/23/2012 12:50 PM, Ranjith Unnikrishnan wrote:

         Hi Kevin,

         I have not made any changes to the TEI generation algorithm
         since our
         last round of reviews withing the group. I've since diverted my
         energy
         towards getting buy-in and making progress on making the TEI files
         available on the Books site. I've had some success but it's
         still early
         days. Until this is launched, I don't plan to work on increasing
         the TEI
         markup depth or anything else related to the process apart from
         fixing
         any bugs that may arise during large-scale testing.

         ~R


         On Thu, Apr 19, 2012 at 6:05 PM, Kevin Hawkins
         <kevin.s.hawkins at ultraslavonic.info
         <mailto:kevin.s.hawkins at ultraslavonic.info>
         <mailto:kevin.s.hawkins at ultraslavonic.info
         <mailto:kevin.s.hawkins at ultraslavonic.info>>> wrote:

             Hi Ranjith,

             While we wait on your colleagues to integrate the code for
             generating TEI into your production pipeline (for which we
         hope our
             Google Docs brainstorming has helped you make that case),
         the TEI
             Technical Council is thinking about how we might publicize the
             availability of TEI documents in Google Books -- when that
         day comes
             -- and what it might mean for our community.  The sort of
         message we
             would promote depends on the depth and consistency of the 
markup
             that Google is able to create. Would you be able to provide
         us with
             some of samples generated by the latest version of your
         code?  (We
             saw early drafts, but I'm not sure that I have any of the final
             versions.)  I'd like to share them with the Technical Council.

             Thanks,

             Kevin


             On 2/6/12 12:24 PM, Ranjith Unnikrishnan wrote:

                 Hi Kevin,

                 The code has been reviewed and checked in but we're
         still working on
                 some questions related to integrating the code in our
         production
                 pipeline. There's probably not much you can help with at
         this
                 point, but
                 I might need some input from you guys as we get closer
         to deploying.

                 Sorry this is not moving as fast as I'd like; this
         effort is one
                 of my
                 independent "20%" projects, and those have a general
         tendency of
                 getting
                 pushed down the priority list in light of more urgent
         tasks. So
                 don't
                 hesitate to check up once in a while.

                 ~Ranjith



-------------- next part --------------
A non-text attachment was scrubbed...
Name: dickens.tei
Type: application/octet-stream
Size: 284908 bytes
Desc: not available
Url : http://lists.village.Virginia.EDU/pipermail/tei-council/attachments/20130421/bfa66a99/attachment-0002.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gullivers_travels.tei
Type: application/octet-stream
Size: 278754 bytes
Desc: not available
Url : http://lists.village.Virginia.EDU/pipermail/tei-council/attachments/20130421/bfa66a99/attachment-0003.obj 


More information about the tei-council mailing list