[tei-council] Fwd: Re: Google Books > TEI
Kevin Hawkins
kevin.s.hawkins at ultraslavonic.info
Thu Aug 9 22:24:54 EDT 2012
We agreed at today's meeting that I would remind you all to consult my
emails from 28 June to see sample TEI books created by Google. A number
of people have expressed feedback in reviewing the samples.
I see now that no one answered Martin's question on this thread ...
On 6/28/12 12:20 PM, Martin Holmes wrote:
> Does anyone have any experience in calculating the accuracy of OCR and
> automated markup? Do we do errors-per-page? Is a word either wrong or
> right, or do we count errors inside words? Do we count missing or
> misplaced column or page breaks as errors?
>
> Presumably we'll need to create "perfect" hand-crafted versions of a set
> of sample pages in order to do the accuracy calculation. How many do we
> need to get a reasonable sample?
Paul has extensive documentation on error rates for double-keyboarded
text: http://www.textcreationpartnership.org/docs/errors/errors1.html .
Nothing else comes to mind in that area.
Still, I don't think we should worry about the OCR accuracy. There are
a couple of reasons for this:
1) Google keeps reprocessing images as their OCR technology improves, so
they keep generating new OCR.
2) The processes for creating TEI are separate from the OCR processes.
So instead focus on the correct application of TEI markup. See if there
are things that, when compared with the page images, get misidentified.
For example, it might turn out that their heuristics for identifying
block quotes assume a wider indentation of the text than found in
certain books, leading such blockquotes to be missed.
All of this means you basically have to skim markup and look for oddities.
More information about the tei-council
mailing list