4.0050 OCR Scanning Errors (197)

Elaine Brennan & Allen Renear (EDITORS@BROWNVM.BITNET)
Tue, 15 May 90 17:20:30 EDT

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Elaine Brennan & Allen Renear: "4.0051 TeX (89)"
Previous message: Elaine Brennan & Allen Renear: "4.0049 Humanist Structure (179)"

Humanist Discussion Group, Vol. 4, No. 0050. Tuesday, 15 May 1990.

(1) Date: Mon, 14 May 90 16:21:39 MDT (26 lines)
From: koontz@alpha (John E. Koontz)
Subject: Re: 4.0042 OCR errors; ...

(2) Date: Tue, 15 May 90 04:26:45 -0400 (43 lines)
From: Mark Rooks <rooks@cs.unc.edu>
Subject: Scan errors

(3) Date: Mon, 14 May 90 19:13:07 CDT (137 lines)
From: Mark Olsen <mark@gide.uchicago.edu>
Subject: Kurzweil errors

(1) --------------------------------------------------------------------
Date: Mon, 14 May 90 16:21:39 MDT
From: koontz@alpha (John E. Koontz)
Subject: Re: 4.0042 OCR errors; ...

The solution to the problem that Bob Kraft mentions with correcting OCR
character recognition problems is to use a spelling checker. The
checker in the PC word processor Nota Bene, for example, will build a
table of automatic corrections, like change wc to we or lne to me, etc.,
as you go along. You just add the correction to your personal list by
specifying that the correction is to be made automatically hereafter. I
think that other word processors and stand alone spelling checkers would
in most cases have similar features.

A problem would arrise if a misrecognition resulted in a correct
spelling, but the examples he cites do not seem to result in such
problems.

As an aside, for languages other than English, especially highly
inflected ones, note that SIL has a package of PC tools that can be used
to check spelling in languages that use phonemic or quasi-phonemic
orthographies. These tools can check plausibility of spelling based on
canonical form, as well as on list membership. This package is
available for $3.00 or $4.00 and is called something like Documentation
Aids for Non-Major Languages.

(2) --------------------------------------------------------------59----
Date: Tue, 15 May 90 04:26:45 -0400
From: Mark Rooks <rooks@cs.unc.edu>
Subject: Scan errors

Regarding Bob Kraft's inquiry concerning lists of common scanner
error-letter combinations:

I too would be interested in such a list; however, I would find of more
use a list of mis-scanned words, in which the scanner error results in a
different correctly spelled word. For example 'modern' is sometimes
scanned as 'modem,' 'but' as 'hut,' etc. We have begun assembling such
a list, but would certainly be interested in what others have.

Writing an acceptable program to automate scan error correction would be
very difficult, since it would require a substantial semantic component.
Any ambitious automation would inevitably introduce correctly spelled
words which were not the words scanned, even with a semantic component.
A simple-minded automation would introduce even more. Impossible letter
combinations are usually not, particularly when dealing with scholarly
materials and abbreviations. Of course such a program would be
acceptable (in my view), if it merely showed the context of the presumed
error to a human with the appropriate alternative, and gave the human
the option of rejecting the alternative. (See below.)

Although scan errors tend to cluster around certain letter combinations,
this is just a tendency. 'rnay' (e.g.) might be generated by a word
other than 'may,' though it commonly would be. In our experience, a
correctly spelled "incorrect" word, is worse than a host of misspelled
"incorrect" words, given that any word error causes us sleepless nights.
At some point a human must look at the remaining errors in the file, and
it is easier to overlook a correctly spelled word than an incorrectly
spelled one.

Omnipage has just introduced a spell-checking dictionary (Omnispell),
which we have purchased (but yet to receive), designed for use with
scanners. Common scan errors (with scan flags (e.g. '~' and '^')) are
anticipated by the spell-checker, but a human must look and (at least)
click a button in each case (or so the advertising goes).

Mark Rooks
InteLex
(3) --------------------------------------------------------------157---
Date: Mon, 14 May 90 19:13:07 CDT
From: Mark Olsen <mark@gide.uchicago.edu>
Subject: Kurzweil errors

As usual, Bob Kraft has the good idea. After writing more routines to
clean up KDM errors in various languages and on various computers than I
care to recall, I think that a general program would be very helpful.
The following is a description of the corrections that we make to
scanned material and the sed script (UNIX Stream EDitor) which performs
those corrections. Unfortunately, most of this example is probably
French specific. I too would be happy to contribute to Bob's program.

Mark

This is a list of the automatic corrections for kurzweil documents, as of
4/20/88. The caret "^" means at the beginning of a line, and the dollar
sign "$", at the end of a line. Note that spaces are significant.
Besides changing "," to "'", space is changed to "'" in that environment.

characters: changed to:

"dc" "de"
"nI" "M"
"^^I" "M"
"qu," "qu'"
"Qu," "Qu'"
"qu " "qu'"
"Qu " "Qu'"
" v " " y "
"^v " "y "
" ct" " et"
"^ct" "et"
" nc " " ne "
"^nc " "ne "
" nc$" " ne"
" cc " " ce "
"^cc " "ce "
" cc$" " ce"
"quc" "que"
"Quc" "Que"
"unc " "une "
"rnm" "mm"
"mrn" "mm"
" rn" " m"
"^rn" "m"
"rnent " "ment "
"rnent$" "ment"
":." ":"
".:" ":"
";," ";"
"]-" "j"
")-" "j"
"1-" "i"
" -- " " --- "
"^'" " "
"^." " "
" c," " c'"
" C," " C'"
" d," " d'"
" D," " D'"
" j," " j'"
" J," " J'"
" l," " l'"
" L," " L'"
" m," " m'"
" n," " n'"
" t," " t'"
"^c," "c'"
"^C," "C'"
"^d," "d'"
"^D," "D'"
"^j," "j'"
"^J," "J'"
"^l," "l'"
"^L," "L'"
"^m," "m'"
"^M," "M'"
"^n," "n'"
"^N," "N'"
"^s," "s'"
"^S," "S'"
"^t," "t'"
"^T," "T'"
" ll(s) " " Il(s) "
" 1l(s) " " Il(s) "
"^ll(s) " "Il(s) "
"^1l(s) " "Il(s) "
" ll(s)$" " Il(s)"
" 1l(s)$" " Il(s)"
"^ *" " " (tab)

# corrdocs.sed
#
s/dc/de/g
s/nI/M/g
s/\^^I/M/g
s/$[qQ]$u[,\ ]/\1u'/g
s/ v / y /g
s/^v /y /
s/ ct/ et/g
s/^ct/et/
s/ nc / ne /g
s/^nc /ne /
s/ nc$/ ne/
s/ cc / ce /g
s/^cc /ce /
s/ cc$/ ce/
s/$[qQ]$uc/\1ue/g
s/unc /une /g
s/rnm/mm/g
s/mrn/mm/g
s/ rn/ m/g
s/^rn/m/
s/rnent /ment /g
s/rnent$/ment/
s/:\./:/g
s/\.:/:/g
s/;,/;/g
s/[])]-/j/g
s/1-/i/g
s/ -- / --- /g
s/^['\.]/ /g
s/ $[cCdDjJlLmMnNsStT]$[\ ,]/ \1'/g
s/^$[cCdDjJlLmMnNsStT]$[\ ,]/\1'/g
s/^1'/l'/g
s/ 1'/l'/g
s/ [1l]l / Il /g
s/^[1l]l /Il /g
s/ [1l]l$/ Il/g
s/ [1l]ls / Ils /g
s/^[1l]ls /Ils /g
s/ [1l]ls$/ Ils/g
s/^ */.P /
s/^\ \ */ /g

Next message: Elaine Brennan & Allen Renear: "4.0051 TeX (89)"
Previous message: Elaine Brennan & Allen Renear: "4.0049 Humanist Structure (179)"