[tei-council] TEI and Google
julianne.nyhan at gmail.com
Thu Apr 21 17:39:38 EDT 2011
Dear Martin, dear all,
If you have gathered a a bibliography on 'white-space XML' I'd be really
grateful if you could send it on to me.
Also, is the term 'white-space XML' often used?
On Wed, Mar 23, 2011 at 2:48 PM, Martin Mueller <martin.mueller at mac.com>wrote:
> By white-space XML I mean the various forms of XML generated by OCR. I've
> seen some of these as well as experiments done with them by Time Cole,
> Katrina Fenlon, and others at UIUC. I think that Brian Pytlik Zillig at
> Nebraska has done some experiments with this as well.
> White-space XML gives you pretty good hints where words, lines,
> paragraphs, and pages break. There are problems with inferring paragraphs
> that straddle pages. Tim Cole's group did a pretty good job on those,
> marking beginning and ending paragraphs on separate as such. From what
> I've seen the results require some human editing at the end.
> Another problem are "non-line" lines in text, such as running headers,
> signature, page numbers. Katrina did some really good work with that.
> All this would fit into some mixed model of algorithmic and human data
> curation. That seems to be what Google is doing with their (so far)
> internal curation tools Goodoctor (?) and Agora. And from conversations
> with younger colleagues in Computer Science, I gather that it fits into
> new ways of conceptualizing the relationship between machine learning and
> human labor. Machines are very good at some things and very bad at others.
> Can we build frameworks that maximize complementary powers?
> That would give new strength to old proverbs like "Many hands make light
> work," as Rose Holley recently argued in her discussion of crowdsourcing
> in (http://www.dlib.org/dlib/march10/holley/03holley.html). Katherine
> Rowe at Bryn Mawr has drawn my attention to the remarkable work of Robert
> Binkley who was responsible for the WPA local history project. His essay
> New Tools for men of letters from the Yale Review of 1935
> (http://www.wallandbinkley.com/rcb/articles/newtools-output.html) is a
> fascinating reflection on the relationships of technology, media, and
> culture, alternately pessimistic about the powers of Big Media and
> romantically idealistic about new technologies offering counterbalancing
> ways of of "working the other way ‹ as implements for a more decentralized
> and less professionalized culture, a culture of local literature and
> amateur scholarship."
> It is well worth reading and perhaps not accidentally a close contemporary
> of Walter Benjamin's famous 1935(?) article about the work of art in an
> age of mechanical reproduction.
> Google and TEI is a little bit like agribusiness and organic farming.
> Are there ways of combining the virtues of the very large or "très grand"
> (why do things always sound better in French?) with the virtues of the
> quite small.
> It is my hunch that imaginative proposals coming out of the environment
> towards which I'm gesturing with these remarks will find a friendly
> reception with the Mellon Foundation and that their quite a few ears
> inside Google that like to hear such things.
> On 3/23/11 8:35 AM, "Kevin Hawkins" <kevin.s.hawkins at ultraslavonic.info>
> >Replying only to Council since I can't post to the Board list ...
> >On 3/23/2011 1:12 AM, Martin Mueller wrote:
> >> I spent the evening reading Hofmannsthal's Der Schwierige in a Google
> >> facsimile and looked at its epub version. I was reminded of some very
> >> interesting experiments that Tim Cole and various staff people at UIUC
> >> done with converting what I call white-space XML into TEI. It appears
> >> you can go pretty far with some combination of algorithmic
> >> and human curation.
> >To clarify, by "white-space XML" are you thinking of something like text
> >produced according to Level 1 or Level 2 of the Best Practices for TEI
> >in Libraries?
> >tei-council mailing list
> >tei-council at lists.village.Virginia.EDU
> >PLEASE NOTE: postings to this list are publicly archived
> tei-council mailing list
> tei-council at lists.village.Virginia.EDU
> PLEASE NOTE: postings to this list are publicly archived
Dr Julianne Nyhan,
(UCL & Universitaet Trier)
*Direct Line:* +44 (0)20 7679 7206)
*Fax:* +44 (0)20 7383 0557)
*Office:* G15a, Department of Information Studies, Foster Court, University
College London, WC1E 6BT, U.K.
More information about the tei-council