[tei-council] FW: first stab at Google > TEI
martinmueller at northwestern.edu
Sat Jun 25 17:50:22 EDT 2011
This seems to me an exemplary statement of rapid progress, and I applaud
you for it.
I particularly like the very clear priorities at the end of your memo,
most particularly the goal of aiming at "sufficiently high quality that
files are not burdensome for the community and enrich." Making significant
progress towards that goal will be a great benefit to scholars and other
users all over the world.
With thanks and hopes for more
Chair, TEI Board of Directors
On 6/24/11 9:55 PM, "Ranjith Unnikrishnan" <ranjith at google.com> wrote:
>Thanks again for your input and suggestions. I'd like to categorize the
>feedback thus far into those related to correctness of the output file
>respect to the TEI encoding, and those related to the richness of the
>generated output. It sounds like the group is satisfied on the former for
>the initial test book. The current convertor code makes no
>assumptions, but given that page layout analysis and OCR in general tends
>be harder on books in non-Latin based languages and multi-lingual books
>(particularly if the languages span scripts), there may be possible
>modes to be discovered there. I'm in the process of testing other more
>complex books based on your suggestions. There's a small checklist of
>improvements I'd like to make to the code before sending out the next
>of TEI output files for you to look at. These include things like better
>handling of footnotes and verse (when we can detect them from the OCR
>output). Do bear with me while I make time to work on these.
>Regarding the richness of the output, as the group pointed out, there are
>several areas of improvement like chapter delineation, document structure
>extraction, as well as category-specific improvements like speaker
>identification in plays. These are obviously hard problems to solve for
>highly varied universe of books, and we're working on them as best as we
>using many of the signals you suggested. Google also awards research
>to institutions to help us and the larger digital humanities community
>these efforts. However, the quality of our TEI output for a book will
>reflect the extent to which we are currently able to extract useful OCR
>structure information from it. So for example, failure to delineate the
>section of footnotes in a page could result in loss of paragraph
>continuation, and failure to detect a paragraph of text as a verse would
>result in the output text not respecting line-breaks, and so on.
>These conditions require making a time-quality-coverage tradeoff, as
>suggests. I'd like to make it by defining my current scope of work to be
>providing TEI output that (i) respects the TEI specification and best
>practices guide, (ii) faithfully translates every piece of OCR and
>structure information that we can currently extract from books, and (iii)
>of sufficiently high quality that the files are not burdensome for the
>community to curate and enrich. The last criterion, in particular, will
>heavily on your feedback and judgment.
>I should also mention that the team I work with is in the early stages of
>taking a fresh look at our structure extraction algorithms. We aim to
>steadily improve their quality and coverage, and they should automatically
>translate to richer TEI output over time.
>On Wed, Jun 22, 2011 at 9:14 PM, stuart yeates
><stuart.yeates at vuw.ac.nz>wrote:
>> On 23/06/11 08:20, stuart yeates wrote:
>>> PS: Apologies if my choice of book to test on was poor. :) It was a
>>>> random selection. If you have suggestions of alternate public domain
>>>> I'm happy to try and convert them and send their TEI files over.
>>> By picking a linear fiction work you made your life easier.
>>> Picking a formally-structured non-fiction work (cyclopaedia, almanac,
>>> etc) will provide a challenge. Linear non-fiction (histories,
>>> biographies, etc) with footnotes provide a separate set of challenges
>>> (footnotes, references, etc).
>>> If you're looking for insights into the English / Western assumptions
>>> you're making, I suggest that you do something in Chinese, Japanese or
>>> Korean. Thai is also interesting, because it sits somewhere between
>>> English and C/J/K in terms of conventions. It may be easier to start
>>> with a non-English language that term members read/write.
>> Reflection suggests that my previous answer may not have answered the
>> underlying issue.
>> Your choice of book may or may not have been poor, depending on your
>> purposes in choosing it. As test of the 'first draft' of a technical
>> solution it was probably a good choice.
>> At some point you have to make some decisions about trade-offs between
>> completeness of coverage of the solution and other technical factors
>> (quality, speed, price, etc). The placement of those trade-offs makes
>> only after you've sat down and worked exactly what it is you're trying
>> and what you're priorities are. Those trade-offs may lead you to
>> further books to test.
>> If you are entirely focused on the kinds of linear popular fiction
>> and serialised novel-like works) which have been consumed en-mass in the
>> west for the last 300 years then your current test book (and/or other
>> similar ones) may be all you need. You can now focus on quality, speed,
>> price, etc
>> The TEI community takes a considerably wider, longer and deeper view,
>> however, and the TEI standards are a product of that view. Sebastian,
>> and myself all made suggestions pointing to larger universe of
>> but I think it's fair to say that most members of the TEI community have
>> personal examples of documents which are substantially further from
>> Gulliver's travels than any of the examples given so far.
>> The question to answer is not "Is this a poor test case?" but "How big a
>> universe of documents matter to me?"
>> Stuart Yeates
>> Library Technology Services
>tei-council mailing list
>tei-council at lists.village.Virginia.EDU
>PLEASE NOTE: postings to this list are publicly archived
More information about the tei-council