Humanist Discussion Group, Vol. 16, No. 16.
Centre for Computing in the Humanities, King's College London
<http://www.princeton.edu/~mccarty/humanist/>
<http://www.kcl.ac.uk/humanities/cch/humanist/>
Date: Fri, 10 May 2002 06:20:16 +0100
From: "Jim Marchand" <marchand@ux1.cso.uiuc.edu>
Subject: OCRing Handwriting
Very few problems are really insurmountable, but OCR of handwriting
comes close. It is not likely that we will see it any time soon,
particularly for medieval manuscripts. You need to look upon each
hand as a font-type, containing several fonts. For example, I
trained my OCR software to read the font of SUGNL, which contains
the largest collection of Old Norse literature, because I realized
that I need Old Norse, but its collection is larger than that of
any medieval scribe. If you remember the problems we had, still
not all solved, with the various fonts in which a modern book is
printed, you can see the problem more clearly. Although
schoolmasters have tried hard in the past to get all their students
to write the same way, they have only rarely been even close to
getting it done (one might cite the Carolingian minuscule).
Remember all those people who claim not to be able to read their
own notes (J. W. Marchand, for example). Of course, we have to
make a difference between `print' and `cursive' (we learn to print
up to about mid-fourth grade, then cursive). This points out the
difficulty of the major move in OCR, pattern recognition. We can
all remember (and still suffer from) the advent of transitional
probabilities and guesses into OCR, and how much it helped out.
Who has not had to remember to turn off `recognition' (of English)
when scanning German? Transitional probabilities and lexicon check
are mainly there for English, though other languages use them, too.
For a Carolingian manuscript, to look at Bill Schipper's problem,
pattern recognition is difficult if not impossible; think how many
scholarly arguments we have over the reading of a letter or two.
Transitional probabilities are not available for Latin, although
God only knows why not. We have only very few Latin lexica
available in electronic form.
We might be able to train an OCR program like the old Kurzweil to
read the hand of a single scribe (though, as Wilhelm Braun pointed
out, "wer schreibt an allen Tagen gleich?"), but a quoi bon? Some
hands are very uniform; Ihre thought the Codex Argenteus's Gothic
to be so uniform that he thought Wulfila had invented (4th C. AD)
movable type, but even there it is easy to see places where there
is little uniformity, and modern authorities have seen two `hands'.
Of course, there is always the possibility of teaching us to write
more uniformly and with recognizable distinctive features, as in
the case of a hand-held, but that does not help those of us who
crave an OCR program for those medieval (ancient, foreign, etc.)
manuscripts. Unfortunately, it does not seem likely.
This archive was generated by hypermail 2b30 : Fri May 10 2002 - 01:52:14 EDT