[tei-council] FW: first stab at Google > TEI

Fri Jun 24 22:55:19 EDT 2011

Hi everyone,

Thanks again for your input and suggestions. I'd like to categorize the
feedback thus far into those related to correctness of the output file with
respect to the TEI encoding, and those related to the richness of the
generated output. It sounds like the group is satisfied on the former for
the initial test book. The current convertor code makes no language-specific
assumptions, but given that page layout analysis and OCR in general tends to
be harder on books in non-Latin based languages and multi-lingual books
(particularly if the languages span scripts), there may be possible failure
modes to be discovered there. I'm in the process of testing other more
complex books based on your suggestions. There's a small checklist of
improvements I'd like to make to the code before sending out the next round
of TEI output files for you to look at. These include things like better
handling of footnotes and verse (when we can detect them from the OCR
output). Do bear with me while I make time to work on these.

Regarding the richness of the output, as the group pointed out, there are
several areas of improvement like chapter delineation, document structure
extraction, as well as category-specific improvements like  speaker
identification in plays. These are obviously hard problems to solve for the
highly varied universe of books, and we're working on them as best as we can
using many of the signals you suggested. Google also awards research grants
to institutions to help us and the larger digital humanities community with
these efforts. However, the quality of our TEI output for a book will
reflect the extent to which we are currently able to extract useful OCR and
structure information from it. So for example, failure to delineate the
section of footnotes in a page could result in loss of paragraph
continuation, and failure to detect a paragraph of text as a verse would
result in the output text not respecting line-breaks, and so on.

These conditions require making a time-quality-coverage tradeoff, as Stuart
suggests. I'd like to make it by defining my current scope of work to be of
providing TEI output that (i) respects the TEI specification and best
practices guide, (ii) faithfully translates every piece of OCR and document
structure information that we can currently extract from books, and (iii) is
of sufficiently high quality that the files are not burdensome for the
community to curate and enrich. The last criterion, in particular, will rely
heavily on your feedback and judgment.

I should also mention that the team I work with is in the early stages of
taking a fresh look at our structure extraction algorithms. We aim to
steadily improve their quality and coverage, and they should automatically
translate to richer TEI output over time.

Best,
Ranjith

On Wed, Jun 22, 2011 at 9:14 PM, stuart yeates <stuart.yeates at vuw.ac.nz>wrote:

> On 23/06/11 08:20, stuart yeates wrote:
>
>> PS: Apologies if my choice of book to test on was poor. :) It was a purely
>>> random selection. If you have suggestions of alternate public domain
>>> books
>>> I'm happy to try and convert them and send their TEI files over.
>>>
>>
>> By picking a linear fiction work you made your life easier.
>>
>> Picking a formally-structured non-fiction work (cyclopaedia, almanac,
>> etc) will provide a challenge. Linear non-fiction (histories,
>> biographies, etc) with footnotes provide a separate set of challenges
>> (footnotes, references, etc).
>>
>> If you're looking for insights into the English / Western assumptions
>> you're making, I suggest that you do something in Chinese, Japanese or
>> Korean. Thai is also interesting, because it sits somewhere between
>> English and C/J/K in terms of conventions. It may be easier to start
>> with a non-English language that term members read/write.
>>
>
> Reflection suggests that my previous answer may not have answered the
> underlying issue.
>
> Your choice of book may or may not have been poor, depending on your
> purposes in choosing it. As test of the 'first draft' of a technical
> solution it was probably a good choice.
>
> At some point you have to make some decisions about trade-offs between the
> completeness of coverage of the solution and other technical factors
> (quality, speed, price, etc). The placement of those trade-offs makes sense
> only after you've sat down and worked exactly what it is you're trying to do
> and what you're priorities are. Those trade-offs may lead you to choosing
> further books to test.
>
> If you are entirely focused on the kinds of linear popular fiction (novels
> and serialised novel-like works) which have been consumed en-mass in the
> west for the last 300 years then your current test book (and/or other
> similar ones) may be all you need. You can now focus on quality, speed,
> price, etc
>
> The TEI community takes a considerably wider, longer and deeper view,
> however, and the TEI standards are a product of that view. Sebastian, Martin
> and myself all made suggestions pointing to larger universe of documents,
> but I think it's fair to say that most members of the TEI community have
> personal examples of documents which are substantially further from
> Gulliver's travels than any of the examples given so far.
>
> The question to answer is not "Is this a poor test case?" but "How big a
> universe of documents matter to me?"
>
>
> cheers
> stuart
> --
> Stuart Yeates
> Library Technology Services http://www.victoria.ac.nz/**library/<http://www.victoria.ac.nz/library/>
>