[tei-council] Fwd: Re: Google Books > TEI

Thu Jun 28 12:20:11 EDT 2012

On 12-06-27 09:12 PM, Kevin Hawkins wrote:
> All,
>
> Now that the latest release is behind us, I'd like to follow up on a few
> things I've promised for you but which wouldn't have contributed to
> getting through bug fixes and feature requests in time for the release.
>
> First of all, in Ann Arbor we agreed that we would ask Ranjith, our
> contact at Google, for the latest samples so we can calculate some
> statistics on accuracy and encourage Google towards making this format
> public.  See the two attachments and the correspondence below.

Does anyone have any experience in calculating the accuracy of OCR and 
automated markup? Do we do errors-per-page? Is a word either wrong or 
right, or do we count errors inside words? Do we count missing or 
misplaced column or page breaks as errors?

Presumably we'll need to create "perfect" hand-crafted versions of a set 
of sample pages in order to do the accuracy calculation. How many do we 
need to get a reasonable sample?

Cheers,
Martin

> Our agenda is a bit vague on responsibility here.  I've asked for
> samples, but I think others will want to check for accuracy of encoding.
>
> Kevin
>
> -------- Original Message --------
> Subject: 	Re: Google Books > TEI
> Date: 	Mon, 23 Apr 2012 15:30:38 -0700
> From: 	Ranjith Unnikrishnan <ranjith at google.com>
> To: 	Kevin Hawkins <kevin.s.hawkins at ultraslavonic.info>
> CC: 	James.Cummings at oucs.ox.ac.uk, mholmes at uvic.ca, laurent.romary at inria.fr
>
>
>
> Yes, the last round of feedback I got was around that time frame, and
> came both from your group as well as another working group that included
> some of our library partners. I had incorporated the two sets of
> feedback into some improvements to the algorithm, but they were mostly
> related to style and had nothing to do with producing new output tags or
> such. Comments that were not addressed required improvements to existing
> text structure analysis algorithms that were at least partly based on
> the quality of obtained OCR text. Both of these are active research
> topics that are always on our agenda but are not quick fixes.
> I've attached the latest Dicken's and Gulliver's Travels files, and can
> generate TEI files for others if you can send me links to their
> corresponding pages on Google Books.
>
>
>
> On Mon, Apr 23, 2012 at 2:53 PM, Kevin Hawkins
> <kevin.s.hawkins at ultraslavonic.info
> <mailto:kevin.s.hawkins at ultraslavonic.info>> wrote:
>
>      The last round of reviews I have is a sample of Dickens from July
>      27, 2011.  I have earlier versions of other titles, but they aren't
>      worth consulting at this point since you've improved other things.
>        Has the algorithm changed since then?  It would be nice to have
>      the latest version of not only Dickens but also of some other works
>      in the public domain: perhaps an old bound volume of a journal and a
>      non-fiction book?  Thanks.
>
>
>      On 4/23/2012 12:50 PM, Ranjith Unnikrishnan wrote:
>
>          Hi Kevin,
>
>          I have not made any changes to the TEI generation algorithm
>          since our
>          last round of reviews withing the group. I've since diverted my
>          energy
>          towards getting buy-in and making progress on making the TEI files
>          available on the Books site. I've had some success but it's
>          still early
>          days. Until this is launched, I don't plan to work on increasing
>          the TEI
>          markup depth or anything else related to the process apart from
>          fixing
>          any bugs that may arise during large-scale testing.
>
>          ~R
>
>
>          On Thu, Apr 19, 2012 at 6:05 PM, Kevin Hawkins
>          <kevin.s.hawkins at ultraslavonic.info
>          <mailto:kevin.s.hawkins at ultraslavonic.info>
>          <mailto:kevin.s.hawkins at ultraslavonic.info
>          <mailto:kevin.s.hawkins at ultraslavonic.info>>> wrote:
>
>              Hi Ranjith,
>
>              While we wait on your colleagues to integrate the code for
>              generating TEI into your production pipeline (for which we
>          hope our
>              Google Docs brainstorming has helped you make that case),
>          the TEI
>              Technical Council is thinking about how we might publicize the
>              availability of TEI documents in Google Books -- when that
>          day comes
>              -- and what it might mean for our community.  The sort of
>          message we
>              would promote depends on the depth and consistency of the markup
>              that Google is able to create. Would you be able to provide
>          us with
>              some of samples generated by the latest version of your
>          code?  (We
>              saw early drafts, but I'm not sure that I have any of the final
>              versions.)  I'd like to share them with the Technical Council.
>
>              Thanks,
>
>              Kevin
>
>
>              On 2/6/12 12:24 PM, Ranjith Unnikrishnan wrote:
>
>                  Hi Kevin,
>
>                  The code has been reviewed and checked in but we're
>          still working on
>                  some questions related to integrating the code in our
>          production
>                  pipeline. There's probably not much you can help with at
>          this
>                  point, but
>                  I might need some input from you guys as we get closer
>          to deploying.
>
>                  Sorry this is not moving as fast as I'd like; this
>          effort is one
>                  of my
>                  independent "20%" projects, and those have a general
>          tendency of
>                  getting
>                  pushed down the priority list in light of more urgent
>          tasks. So
>                  don't
>                  hesitate to check up once in a while.
>
>                  ~Ranjith
>