4.0282 Multilingual OCR (1/91)
Elaine Brennan & Allen Renear (EDITORS@BROWNVM.BITNET)
Mon, 16 Jul 90 18:21:32 EDT
Humanist Discussion Group, Vol. 4, No. 0282. Monday, 16 Jul 1990.
Date: Fri, 13 Jul 90 23:08:31 CST
From: "Robin C. Cover" <ZRCC1001@SMUVM1>
Subject: MULTILINGUAL OCR
Multilingual OCR/ICR With Kurzweil 5100/5200
I would appreciate help from anyone who has had experience scanning
multilingual documents or mixed non-Roman scripts using the Kurzweil
Model 5100 (or 5200) scanner. If you have personal experience or can
supply the name of someone who has, I will be deeply grateful: please
include name, email address, postal address and phone number. I need
to make these contacts in the next three weeks.
About a year ago I tested the Kurzweil Model 5100 sufficiently to
determine that its optics (400 dpi) and brain are excellent for
scanning standard roman and non-Roman scripts; it has several
striking performance advantages over the older Model 4000. For
example, it faithfully scans sub-linear diacritics (like Hebrew
vowels) where the Model 4000 is nearly blind. Unfortunately, I was
unable to make one crucial test. This is now my primary area of
interest: the arbitrary assignment of "special characters" to
4- or 5-byte strings. Trainability is the issue, of course, though
Kurzweil's marketing division (feeling pressure from
Panantir/Calera's "automatic" scanning) forced the term
"verification" to be used rather than "training" for the current
models. Whereas the Model 4000 permitted about 400 of these "MPD"
assignments (as I recall) and allowed mapping to 3-character strings,
the Kurzweil Model 5100 and 5200 support an allegedly "unlimited"
number of MPD's, and mapping to 4-character strings (5100) or
5-character strings (5200). As of a phone conversation today,
Cambridge-Kurzweil stands by this claim of "unlimited" number of
MPD's, but this is still theory as far as I'm concerned. I want to
hear whether someone has tested Kurzweil 5100/5200 trainability in
making 500-2500 or more such arbitrary assignments, and whether
performance is thereby degraded, and in what ways, to what extent.
Theoretically, the technology would allow for millions of
special-character assignments, (hi-bit chars are legal with a few
reserved chars) but I confess I'm skeptical. "Unlimited" is clearly
hyperbolic, and "millions" is probably also false, so what's the
truth?
One unfortunate disadvantage of the "omnifont" (general feature
extraction) technology used in the newer Kurzweil Discovery Series is
that the user has lost control over fixed "font" assignments. On the
earlier Kurzweil models ("multiple-font recognition"), trainability
suffered from too limited a number of map-able characters, but at
least the user *could* map characters of a certain set into discrete
"fonts." This permitted a mode or state operation (it takes two
characters in a font to trip the "font sequence), so that alphabets,
languages or other sets with typographically-distinct print
attributes could be output in delimited formats. Delimited strings
based upon user-defined font sets has apparently been lost in the
"omnifont" technology, at least in Kurzweil's scanners.
The Model 5200's 5-character-string mapping buys back some of the
earlier "font" functionality if one makes creative (if necessarily
painful) mnemonic assignments with four characters, leaving the fifth
(first) byte as a font identifier. (This is just theory too!) For
example, pointed Hebrew could be scanned -- I tried this -- in
block-character units, using one hi-bit character for the Hebrew font
identifier, two characters for the (mnemonic) consonant-name, and two
for the (mnemonic) vowel-name. Training would not be a nightmare,
and subsequent text-processing with string handling utilities could
supplement the standard output with delimiters based upon these named
entities in fonts/alphabets/languages. We all do this for markup
anyway.
Questions:
(1) Does the Kurzweil Model 5100/5200 actually support an "unlimited"
number of MPD's -- and at what cost? What *actually* starts
happening with the second, third, fourth, fifth...alphabet?
(2) Are there any other trainable scanners that can compete with
Kurzweil in this arena? Until now, I have not heard of any serious
competition to Kurzweil if you have documents in Greek, Hebrew,
Cyrillic or other mixed non-Roman scripts where training is required
from the ground up. ("Optiram" does not interest me at this point.)
Thanks to anyone who will contact with me if you can shed light on
these matters. No flames please: I know many believe scanning will
never be practical for digitizing multilingual texts. But even
industry -- and the EEC moving toward 1992 -- are starting to say
otherwise.
Robin Cover
DTS - Semitics & OT
3909 Swiss Avenue
Dallas, TX 75204
AT&T: (214) 296-1783
FAX: 214-841-3540
BITNET: zrcc1001@smuvm1
INTERNET: robin@txsil.lonestar.org
UUCP: ...texbell!txsil!robin