[tei-council] Fwd: Re: Google Books > TEI

Kevin Hawkins kevin.s.hawkins at ultraslavonic.info
Thu Jun 28 00:12:48 EDT 2012


All,

Now that the latest release is behind us, I'd like to follow up on a few 
things I've promised for you but which wouldn't have contributed to 
getting through bug fixes and feature requests in time for the release.

First of all, in Ann Arbor we agreed that we would ask Ranjith, our 
contact at Google, for the latest samples so we can calculate some 
statistics on accuracy and encourage Google towards making this format 
public.  See the two attachments and the correspondence below.

Our agenda is a bit vague on responsibility here.  I've asked for 
samples, but I think others will want to check for accuracy of encoding.

Kevin

-------- Original Message --------
Subject: 	Re: Google Books > TEI
Date: 	Mon, 23 Apr 2012 15:30:38 -0700
From: 	Ranjith Unnikrishnan <ranjith at google.com>
To: 	Kevin Hawkins <kevin.s.hawkins at ultraslavonic.info>
CC: 	James.Cummings at oucs.ox.ac.uk, mholmes at uvic.ca, laurent.romary at inria.fr



Yes, the last round of feedback I got was around that time frame, and 
came both from your group as well as another working group that included 
some of our library partners. I had incorporated the two sets of 
feedback into some improvements to the algorithm, but they were mostly 
related to style and had nothing to do with producing new output tags or 
such. Comments that were not addressed required improvements to existing 
text structure analysis algorithms that were at least partly based on 
the quality of obtained OCR text. Both of these are active research 
topics that are always on our agenda but are not quick fixes.
I've attached the latest Dicken's and Gulliver's Travels files, and can 
generate TEI files for others if you can send me links to their 
corresponding pages on Google Books.



On Mon, Apr 23, 2012 at 2:53 PM, Kevin Hawkins 
<kevin.s.hawkins at ultraslavonic.info 
<mailto:kevin.s.hawkins at ultraslavonic.info>> wrote:

    The last round of reviews I have is a sample of Dickens from July
    27, 2011.  I have earlier versions of other titles, but they aren't
    worth consulting at this point since you've improved other things.
      Has the algorithm changed since then?  It would be nice to have
    the latest version of not only Dickens but also of some other works
    in the public domain: perhaps an old bound volume of a journal and a
    non-fiction book?  Thanks.


    On 4/23/2012 12:50 PM, Ranjith Unnikrishnan wrote:

        Hi Kevin,

        I have not made any changes to the TEI generation algorithm
        since our
        last round of reviews withing the group. I've since diverted my
        energy
        towards getting buy-in and making progress on making the TEI files
        available on the Books site. I've had some success but it's
        still early
        days. Until this is launched, I don't plan to work on increasing
        the TEI
        markup depth or anything else related to the process apart from
        fixing
        any bugs that may arise during large-scale testing.

        ~R


        On Thu, Apr 19, 2012 at 6:05 PM, Kevin Hawkins
        <kevin.s.hawkins at ultraslavonic.info
        <mailto:kevin.s.hawkins at ultraslavonic.info>
        <mailto:kevin.s.hawkins at ultraslavonic.info
        <mailto:kevin.s.hawkins at ultraslavonic.info>>> wrote:

            Hi Ranjith,

            While we wait on your colleagues to integrate the code for
            generating TEI into your production pipeline (for which we
        hope our
            Google Docs brainstorming has helped you make that case),
        the TEI
            Technical Council is thinking about how we might publicize the
            availability of TEI documents in Google Books -- when that
        day comes
            -- and what it might mean for our community.  The sort of
        message we
            would promote depends on the depth and consistency of the markup
            that Google is able to create. Would you be able to provide
        us with
            some of samples generated by the latest version of your
        code?  (We
            saw early drafts, but I'm not sure that I have any of the final
            versions.)  I'd like to share them with the Technical Council.

            Thanks,

            Kevin


            On 2/6/12 12:24 PM, Ranjith Unnikrishnan wrote:

                Hi Kevin,

                The code has been reviewed and checked in but we're
        still working on
                some questions related to integrating the code in our
        production
                pipeline. There's probably not much you can help with at
        this
                point, but
                I might need some input from you guys as we get closer
        to deploying.

                Sorry this is not moving as fast as I'd like; this
        effort is one
                of my
                independent "20%" projects, and those have a general
        tendency of
                getting
                pushed down the priority list in light of more urgent
        tasks. So
                don't
                hesitate to check up once in a while.

                ~Ranjith

-------------- next part --------------
A non-text attachment was scrubbed...
Name: dickens.tei
Type: application/octet-stream
Size: 284908 bytes
Desc: not available
Url : http://lists.village.Virginia.EDU/pipermail/tei-council/attachments/20120628/1d5efce7/attachment-0002.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gullivers_travels.tei
Type: application/octet-stream
Size: 278754 bytes
Desc: not available
Url : http://lists.village.Virginia.EDU/pipermail/tei-council/attachments/20120628/1d5efce7/attachment-0003.obj 


More information about the tei-council mailing list