3.1195 super scanning, cont. (158)

Willard McCarty (MCCARTY@vm.epas.utoronto.ca)
Wed, 21 Mar 90 21:58:18 EST

Humanist Discussion Group, Vol. 3, No. 1195. Wednesday, 21 Mar 1990.


(1) Date: Tue, 20 Mar 90 20:23:57 EST (13 lines)
From: cbf@faulhaber.Berkeley.EDU (Charles Faulhaber)
Subject: Re: 3.1191 super scanning (77)

(2) Date: Tue, 20 Mar 90 23:16:57 EST (28 lines)
From: "Michael S. Hart" <HART@UIUCVMD>
Subject: Re: 3.1191 super scanning (77)

(3) Date: Wed, 21 Mar 90 11:01 EDT (15 lines)
From: DURAND@pip.cc.brandeis.edu
Subject: Scanning the Library of Congress

(4) Date: Wed, 21 Mar 90 11:45:04 EST (8 lines)
From: Peter Shillingsburg <SHILL@MSSTATE>
Subject: Re: 3.1187 paper vs. e-documents (74)

(5) Date: Wed, 21 Mar 90 12:01:40 EST (59 lines)
From: "Steven J. DeRose" <IR400011@BROWNVM>
Subject: More specifics on scanning the LOC

(1) --------------------------------------------------------------------
Date: Tue, 20 Mar 90 20:23:57 EST
From: cbf@faulhaber.Berkeley.EDU (Charles Faulhaber)
Subject: Re: 3.1191 super scanning (77)

In re Bob Kraft's query about moving from
scanned microfilm to OCR. That's exactly
what we're going to try to do with the
Spanish-language incunabula in Madrid's
Biblioteca Nacional. We'll let you know
whether it worked, in a couple of years.

Charles Faulhaber
UC Berkeley
(2) --------------------------------------------------------------37----
Date: Tue, 20 Mar 90 23:16:57 EST
From: "Michael S. Hart" <HART@UIUCVMD>
Subject: Re: 3.1191 super scanning (77)

I would prefer to think of a grass roots operation in which the 1,000
scanners would be located throughout the library systems and operator
fees would be eliminated by the thought that if each library turned a
page a minute, we could create such an electronic library which could
be either shared by the entire library system or thrown into a public
domain collection as each copyrighted item reaches the end of term, a
royalty could be charged till that time in accordance with current $$
charged for printed matter.


In fact, Project Gutenberg would be happy to provide a few of each of
both public domain and copyrighted materials as a beginning, and will
encourage other etexts providers to do the same. As a first step, we
would like to annouce the release of a Shareware edition of "Alice in
Wonderland" which we hope to post for FTP availability.

Thank you for your interest,

Michael S. Hart, Director, Project Gutenberg
National Clearinghouse for Machine Readable Texts

BITNET: HART@UIUCVMD INTERNET: HART@VMD.CSO.UIUC.EDU
(*ADDRESS CHANGE FROM *VME* TO *VMD* AS OF DECEMBER 18!!**)
(THE GUTNBERG SERVER IS LOCATED AT GUTNBERG@UIUCVMD.BITNET)
(3) --------------------------------------------------------------22----
Date: Wed, 21 Mar 90 11:01 EDT
From: DURAND@pip.cc.brandeis.edu
Subject: Scanning the Library of Congress

I think that a number of the repsonses to Steve's note were rather too
literal in their interpretation of the estimate -- I think it was intended to
indicate that the scale might more within grasp then we generally assume,
given a real commitment by the Federal government. If we assume that Steve's
estimate is off by 2 orders of magnitude (which is above where I would put it,
even after factoring in Murphy's law), then we have a figure of 4.1 (US)
billion dollars. This seems like a lot of money, but according to the April
Harper's Index (derived from congressional budget office figures) the Defense
department was able to save 3 (US) Billion dollars by moving its last pay
period back to the preceeding fiscal year. That's the price (as someone noted)
of one Stealth Bomber.
(4) --------------------------------------------------------------18----
Date: Wed, 21 Mar 90 11:45:04 EST
From: Peter Shillingsburg <SHILL@MSSTATE>
Subject: Re: 3.1187 paper vs. e-documents (74)

Does Steve DeRose assume that scanners will provide accurate texts, or
did I misread his note? My experience has been that scanned texts
require rather a lot of editing to be readable and then are probably
not accurate.
(5) --------------------------------------------------------------67----
Date: Wed, 21 Mar 90 12:01:40 EST
From: "Steven J. DeRose" <IR400011@BROWNVM>
Subject: More specifics on scanning the LOC


Ahem! I thought I stated I was giving an "idealistic" estimate:

>granted it's an idealistic construction, but it's not entirely fanciful.

As for a 10 page/minute scanner, I saw an article in the last *MacWeek*
about a new board for ATs, containing 10 sets of its OCR hardware,
which can dispatch bitmaps from physical scanner to its 10 OCR units
as available (e.g., if one page is especially time-consuming, the system
can still keep up with the other 9 units). Physical scanning is
easily up to 10 pages/minute with readily available scanners; this new
AT board claims 10ppm for OCR, too; price is about $10K. Perhaps the
device has some problems, I don't know; but they're selling it.

Bob Kraft is right that photocopying everything for a sheet feeder
would be expensive (both in toner and in labor). That's one reason
I mentioned the need for hardware to flip pages, which isn't all that
hard an engineering problem once sheet feeding has been solved.

In answer to Bob Hollander, yes I have dealt with scanners.
One project I consulted for has scanned something like 15,000 pages,
including a few Greek/English lexica and other hard cases (fine print,
mixed fonts, accented scripts,...). I also have close friends doing
similar work, so I think I am reasonably in touch with reality.

Proofreading is a serious issue for primary texts. However, for most
material it can be deferred, because (a) a moderate character error
rate does not impede human readability (fr xmpl, Shannon shovved thet
Iglich omly hac obavt l bit per leHer of infomnatiori--scanners generally
can do much better than this); (b) spelling correctors, smarter
retrieval software, and other technology yield decent retrieval accuracy
even in the face of a moderate error rate; and (c) the usefulness of
such files has been proven in (for example) the Lexis and Nexis dbs
(recently mentioned), which I understand to have quite a high typo rate.
Also, it's fairly easy to estimate how bad documents are (e.g., using
Markov-based character-sequence ratings), in order to single out the
worst for human-supervised spelling correction. We tend to think of
needing extreme accuracy, because we do for primary texts; but it just
isn't as critical for 10-year-old journal articles. I'd be quite
pleased to have readable copies of all of *Language* for the last
several decades on my desk. I'd even pay quite a bit for it (whereas
I'd seldom if ever buy back issues of the paper form even if available).

If my guesstimate is off by an order of magnitude, the point
remains, the more so because the technology continually gets
cheaper and more accurate. Also, I was figuring for the entire LOC -
which contains a lot of material either (a) online already, like legal
records and many recent publications; or (b) not of much interest.

The point I'm making is not that we're going to scan the LOC today.
Rather, I'm asking what the more fundamental issues are in comparing
paper and electronic media, and claiming that the cost of conversion,
the need to recopy, etc. (!), are not among those fundamental issues.

S