[tei-council] Fwd: Re: Google Books > TEI

Kevin Hawkins kevin.s.hawkins at ultraslavonic.info
Thu Aug 9 22:24:54 EDT 2012


We agreed at today's meeting that I would remind you all to consult my 
emails from 28 June to see sample TEI books created by Google.  A number 
of people have expressed feedback in reviewing the samples.

I see now that no one answered Martin's question on this thread ...

On 6/28/12 12:20 PM, Martin Holmes wrote:
> Does anyone have any experience in calculating the accuracy of OCR and
> automated markup? Do we do errors-per-page? Is a word either wrong or
> right, or do we count errors inside words? Do we count missing or
> misplaced column or page breaks as errors?
>
> Presumably we'll need to create "perfect" hand-crafted versions of a set
> of sample pages in order to do the accuracy calculation. How many do we
> need to get a reasonable sample?

Paul has extensive documentation on error rates for double-keyboarded 
text: http://www.textcreationpartnership.org/docs/errors/errors1.html . 
  Nothing else comes to mind in that area.

Still, I don't think we should worry about the OCR accuracy.  There are 
a couple of reasons for this:

1) Google keeps reprocessing images as their OCR technology improves, so 
they keep generating new OCR.

2) The processes for creating TEI are separate from the OCR processes.

So instead focus on the correct application of TEI markup.  See if there 
are things that, when compared with the page images, get misidentified. 
  For example, it might turn out that their heuristics for identifying 
block quotes assume a wider indentation of the text than found in 
certain books, leading such blockquotes to be missed.

All of this means you basically have to skim markup and look for oddities.


More information about the tei-council mailing list