3.204 scanning and the TEI (40)

Willard McCarty (MCCARTY@VM.EPAS.UTORONTO.CA)
Mon, 3 Jul 89 22:22:42 EDT


Humanist Discussion Group, Vol. 3, No. 204. Monday, 3 Jul 1989.

Date: Mon, 3 Jul 89 12:39 BST
From: Lou Burnard <LOU@VAX.OXFORD.AC.UK>
Subject: OCR and the Text Encoding Initiative

I have waited and watched in vain for someone else to jump on
Terry Erdt's astounding comment about OCRs and the TEI of June
14th (3.121) but a fortnight has gone by (an eon on Humanist) so
regretfully I rise to take the bait myself. Terry said (in case
you've forgotten) that given the wonderful capabilities of the
next generation of OCR devices - in particular their ability to
link bit mapped images of an original with the OCR output derived
from it - "the tedious and herculean efforts of the Text Encoding
Initiative may be misplaced or misdirected". Now, I won't argue
with the 'tedious' or even (modesty apart) the 'herculean', but
the 'misdirected' is just plain wrong. Suppose an OCR system were
capable of 100% accuracy in identifying the typeface and layout
of a printed or written page. (Suppose everything they say about
Optiram was true!). Suppose you got a machine readable text in
which every change of font, every variation of point size, every
detail of inter-letter and intra-word spacing were perfectly
tagged. What use would it be if you couldn't tell the footnotes
from the running titles EXCEPT in terms of their typography?
Reading a text - and encoding the results of that reading - is
not only a matter of identifying what it looks like. It's also an
interpretative act. If the TEI doesn't deliver ways of making
explicit those interpretations then it really will be
misdirected, in much the same way as WYSWYG word-processors, by
focussing attention on the medium at the expense of the message.
Let me recommend, yet again, the CACM article on scholarly markup
by Coombs et al. as a reminder of what we are trying to achieve,
for those who have forgotten, or never knew.