optical scanners, cont. (80)

Willard McCarty (MCCARTY@VM.EPAS.UTORONTO.CA)
Wed, 5 Apr 89 20:23:44 EDT


Humanist Mailing List, Vol. 2, No. 806. Wednesday, 5 Apr 1989.

Date: Tue, 04 Apr 89 14:47:04 -0800
From: mbb@jessica.Stanford.EDU
Subject: OCR: sorting it out


Jamie Hubbard did a nice job of collecting bits of
information on the current state of OCR. I have only
a few points to add:

> I was told informally by Kurzweil sources that they
have a trainable system under development. This would
be the replacement for the Model 4000, which they phased
out this year. But I'd guess that Terry Erdt has the
inside scoop on this.

> Accuracy is still a significant problem, in my opinion.
Many tests in reviews are done with clean, typeset material.
That's fine, but when I look at the stuff the folks at
Stanford want to scan, it's anything but clean! !

99% accuracy, I think, is almost a minimum! Consider that
a typical printed page holds 2,000 characters: at 99% accuracy
this still means 20 errors per page! Spell checking is still
a pretty tedious exercise.

Hence if an OCR system doesn't score nearly 100% on clean
originals, one can imagine its performance on, say, a 19th
century newspaper, which is full of letters with punctured
counters, broken hairline strokes, and letters that do not
sit accurately on the baseline.

In general, I think it would be useful if we could agree
on some standard benchmark tests that would measure the
performance of an OCR system relative to humanist needs.
This is one of the items I'm proposing be discussed during
the panel discussion on scanning at the Toronto conference.

Perhaps at the conference we can do something ad hoc to
set up such a slate of benchmarks. If anyone is interested,
send me mail (gx.mbb@stanford.bitnet) and I'll sneak up
on you at the conference.

> If Jamie really has 250,000 pages to do, I'd be very wary of
OCR programs running on microcomputers and utilizing desktop
scanners. From what I've seen, they are too slow, especially
for such a tremendous volume. (And I thought that the works
of Nietzsche, which I'm scanning now, was a big undertaking...)

> I think Jamie is right to emphasize trainability as a key
feature. Most every company that sells an OCR system has
business use, and not academic use, in mind. An example of
this is Calera, whose system, last time I checked, couldn't
even recognize latin characters with diacritics.

Malcolm Brown
Stanford

[Editorial footnote: if you don't know what conference Malcolm is
referring to, take notice: it is The Dynamic Text, the first joint
conference of the ACH and ALLC in Toronto, 5-10 June 1989.
Associated events are the Toronto-Oxford Summer School in Humanities
Computing, 29 May - 2 June and 12 - 16 June, and the software and
hardware fair, Tools for Humanists, 6-9 June. Several scanning systems
will be exhibited at the fair, together with approximately 50 other
interesting items. A schedule of events in the fair will be published
soon on Humanist. Further information is contained in the file DYNAMTXT
CONFRNCE, available on the file-server; see your Guide to Humanist for
instructions on how to download this file. For a conference booklet,
containing the registration materials, send a request to
cch@vm.epas.utoronto.ca OR to cch@utorepas. --W.M.]