[tei-council] Fwd: Fwd: Re: Google Books > TEI
Kevin Hawkins
kevin.s.hawkins at ultraslavonic.info
Sun Apr 21 19:20:10 EDT 2013
As requested in Providence, I am forwarding the samples for anyone who
is interested in reviewing the quality of these. Below are the IDs for
each in case you want to compare against the equivalent page images in
Google Books:
dickens.tei -- i8_u_-YmG4MC
gullivers_travels.tei -- srVbAAAAQAAJ
-------- Original Message --------
Subject: Fwd: Re: Google Books > TEI
Date: Thu, 28 Jun 2012 00:12:48 -0400
From: Kevin Hawkins <kevin.s.hawkins at ultraslavonic.info>
To: tei-council at lists.village.virginia.edu
All,
Now that the latest release is behind us, I'd like to follow up on a few
things I've promised for you but which wouldn't have contributed to
getting through bug fixes and feature requests in time for the release.
First of all, in Ann Arbor we agreed that we would ask Ranjith, our
contact at Google, for the latest samples so we can calculate some
statistics on accuracy and encourage Google towards making this format
public. See the two attachments and the correspondence below.
Our agenda is a bit vague on responsibility here. I've asked for
samples, but I think others will want to check for accuracy of encoding.
Kevin
-------- Original Message --------
Subject: Re: Google Books > TEI
Date: Mon, 23 Apr 2012 15:30:38 -0700
From: Ranjith Unnikrishnan <ranjith at google.com>
To: Kevin Hawkins <kevin.s.hawkins at ultraslavonic.info>
CC: James.Cummings at oucs.ox.ac.uk, mholmes at uvic.ca, laurent.romary at inria.fr
Yes, the last round of feedback I got was around that time frame, and
came both from your group as well as another working group that included
some of our library partners. I had incorporated the two sets of
feedback into some improvements to the algorithm, but they were mostly
related to style and had nothing to do with producing new output tags or
such. Comments that were not addressed required improvements to existing
text structure analysis algorithms that were at least partly based on
the quality of obtained OCR text. Both of these are active research
topics that are always on our agenda but are not quick fixes.
I've attached the latest Dicken's and Gulliver's Travels files, and can
generate TEI files for others if you can send me links to their
corresponding pages on Google Books.
On Mon, Apr 23, 2012 at 2:53 PM, Kevin Hawkins
<kevin.s.hawkins at ultraslavonic.info
<mailto:kevin.s.hawkins at ultraslavonic.info>> wrote:
The last round of reviews I have is a sample of Dickens from July
27, 2011. I have earlier versions of other titles, but they aren't
worth consulting at this point since you've improved other things.
Has the algorithm changed since then? It would be nice to have
the latest version of not only Dickens but also of some other works
in the public domain: perhaps an old bound volume of a journal and a
non-fiction book? Thanks.
On 4/23/2012 12:50 PM, Ranjith Unnikrishnan wrote:
Hi Kevin,
I have not made any changes to the TEI generation algorithm
since our
last round of reviews withing the group. I've since diverted my
energy
towards getting buy-in and making progress on making the TEI files
available on the Books site. I've had some success but it's
still early
days. Until this is launched, I don't plan to work on increasing
the TEI
markup depth or anything else related to the process apart from
fixing
any bugs that may arise during large-scale testing.
~R
On Thu, Apr 19, 2012 at 6:05 PM, Kevin Hawkins
<kevin.s.hawkins at ultraslavonic.info
<mailto:kevin.s.hawkins at ultraslavonic.info>
<mailto:kevin.s.hawkins at ultraslavonic.info
<mailto:kevin.s.hawkins at ultraslavonic.info>>> wrote:
Hi Ranjith,
While we wait on your colleagues to integrate the code for
generating TEI into your production pipeline (for which we
hope our
Google Docs brainstorming has helped you make that case),
the TEI
Technical Council is thinking about how we might publicize the
availability of TEI documents in Google Books -- when that
day comes
-- and what it might mean for our community. The sort of
message we
would promote depends on the depth and consistency of the
markup
that Google is able to create. Would you be able to provide
us with
some of samples generated by the latest version of your
code? (We
saw early drafts, but I'm not sure that I have any of the final
versions.) I'd like to share them with the Technical Council.
Thanks,
Kevin
On 2/6/12 12:24 PM, Ranjith Unnikrishnan wrote:
Hi Kevin,
The code has been reviewed and checked in but we're
still working on
some questions related to integrating the code in our
production
pipeline. There's probably not much you can help with at
this
point, but
I might need some input from you guys as we get closer
to deploying.
Sorry this is not moving as fast as I'd like; this
effort is one
of my
independent "20%" projects, and those have a general
tendency of
getting
pushed down the priority list in light of more urgent
tasks. So
don't
hesitate to check up once in a while.
~Ranjith
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dickens.tei
Type: application/octet-stream
Size: 284908 bytes
Desc: not available
Url : http://lists.village.Virginia.EDU/pipermail/tei-council/attachments/20130421/bfa66a99/attachment-0002.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gullivers_travels.tei
Type: application/octet-stream
Size: 278754 bytes
Desc: not available
Url : http://lists.village.Virginia.EDU/pipermail/tei-council/attachments/20130421/bfa66a99/attachment-0003.obj
More information about the tei-council
mailing list