Wed Mar 23 10:48:34 EDT 2011

By white-space XML I mean the various forms of XML generated by OCR. I've
seen some of these as well as experiments done with them by Time Cole,
Katrina Fenlon, and others at UIUC. I think that Brian Pytlik Zillig at
Nebraska has done some experiments with this as well.

White-space XML gives you pretty good hints where words, lines,
paragraphs, and pages break.  There are problems with inferring paragraphs
that straddle pages. Tim Cole's group did a pretty good job on those,
marking beginning and ending paragraphs on separate as such. From what
I've seen the results require some human editing at the end.

Another problem are "non-line" lines in text, such as running headers,
signature, page numbers. Katrina did some really good work with that.

All this would fit into some mixed model of algorithmic and human data
curation. That seems to be what Google is doing with their (so far)
internal curation tools Goodoctor (?) and Agora. And from conversations
with younger colleagues in Computer Science, I gather that it fits into
new ways of conceptualizing the relationship between machine learning and
human labor. Machines are very good at some things and very bad at others.
Can we build frameworks that maximize complementary powers?

That would give new strength to old proverbs like "Many hands make light
work," as Rose Holley recently argued in her discussion of crowdsourcing
in (http://www.dlib.org/dlib/march10/holley/03holley.html).  Katherine
Rowe at Bryn Mawr has drawn my attention to the remarkable work of Robert
Binkley who was responsible for the WPA local history project. His essay
New Tools for men of letters from the Yale Review of 1935
(http://www.wallandbinkley.com/rcb/articles/newtools-output.html) is a
fascinating reflection on the relationships of technology, media, and
culture, alternately pessimistic about the powers of Big Media and
romantically idealistic about new technologies offering counterbalancing
ways of of "working the other way ‹ as implements for a more decentralized
and less professionalized culture, a culture of local literature and
amateur scholarship."

It is well worth reading and perhaps not accidentally a close contemporary
of Walter Benjamin's famous 1935(?) article about the work of art in an
age of mechanical reproduction.

Google and TEI is a little bit like agribusiness and organic  farming.
Are there ways of combining the virtues of the very large or "très grand"
(why do things always sound better in French?) with the virtues of the
quite small. 

It is my hunch that imaginative proposals coming out of the environment
towards which I'm gesturing with these remarks will find a friendly
reception with the Mellon Foundation and that their quite a few ears
inside Google that like to hear such things.

