12.0605 gadfly notes on XML & TEI

Humanist Discussion Group (humanist@kcl.ac.uk)
Sat, 1 May 1999 08:23:56 +0100 (BST)

Humanist Discussion Group, Vol. 12, No. 605.
Centre for Computing in the Humanities, King's College London
<http://www.princeton.edu/~mccarty/humanist/>
<http://www.kcl.ac.uk/humanities/cch/humanist/>

Date: Sat, 01 May 1999 07:39:21 +0100
From: Mark Olsen <mark@barkov.uchicago.edu>
Subject: XML and TEI: Gadfly Notes

I read with interest the announcement of the TEI Consortium.
As a supporter of the intellectual aims and most of the
results of the TEI, I can only wish the Consortium the greatest
success. But -- of course there is a but :-) -- the near
simultaneous appearance of this announcement and an article in
this month's Scientific American by Jon Bosak and Tim Bray,
both of whom played crucial roles in the development of XML,
["XML and the Second Generation Web", SciAm May 99
http://www.sciam.com/1999/0599issue/0599bosak.html)
discussing XML may have interesting implications for humanities
computing.

The TEI Consortium sees a large part of its mission and future
implicated in the early deployment of XML:

> With the rise of XML, some observers predict that the TEI Consortium
> has a window of opportunity to make TEI a much more widely used method
> of text encoding than ever before. "Now that common off-the-shelf
> browsers are beginning to support XML and style sheets," says David
> Chesnutt (who represents the ACH on the TEI consortium transition
> team), "we have something to give users that they can use, in their
> current environments, today. We no longer have to tell them about pie
> in the sky by and by -- there's software they can use right now.
> We'll always need more software for scholarly work, sure -- but it is
> a big improvement to be able to use Internet Explorer or Netscape to
> look at a historical documentary edition in its SGML form, instead of
> having to translate it into HTML or use a piece of software people
> wouldn't otherwise need to get." "When people see what XML and style
> sheets can do," said Allen Renear of Brown, "who is going to want to
> continue using HTML? They'll be looking for good DTDs to use -- and
> the TEI is going to be RIGHT THERE and ready for them."

Such enthusiasm among members of the TEI is certainly warranted. A common
problem raised with TEI in particular (and SGML in general) is that there
is a dearth of software. The few candidates for general software
solutions which have emerged from the academic community have developed
into large companies whose primary market is the management and e-publication
of corporate information systems, either abandonning the academic
market completely or charging very high prices for total solutions that
more often than not fail to handle the kinds of problems posed by
scholars in the humanities. So, the enthusiasm of the TEI consortium
for XML is certainly understandable.

A critical assumption here is that the primary use of TEI and other
encoded texts is READING documents online. Rather than read TEI
documents in either rendered HTML or using an SGML browser, XML will
allow us the read TEI encoded documents in a browser. The implication
is that the primary focus of humanities computing should be electronic
publication. It is not clear to me that this is the case. Renear and
Chesnutt are only addressing reading here, but their own work shows
that searching and analysis are crucial elements of textual research.
Searching and analysis, that might combined with e-text "publication",
seems to be more consistent with the long tradition of humanities computing
scholarship. But I digress (tho' this is an important digression).

It is NOT clear to me that XML is going to be the solution that the
TEI Consortium hopes because XML may require dramatic increases
in costs. Bosak and Bray conclude the SciAm article, writing:

>> Web site designers, on the other hand, will find
>> it more demanding. Battalions of programmers will be needed to exploit
>> new XML languages to their fullest. And although the day of the
>> self-trained Web hacker is not yet over, the species is endangered.
>> Tomorrow's Web designers will need to be versed not just in the
>> production of words and graphics but also in the construction of
>> multilayered, interdependent systems of DTDs, data trees, hyperlink
>> structures, metadata and stylesheets--a more robust infrastructure for
>> the Web's second generation. [Bosak and Bray, p. 93]

This may be worrisome development for humanities computing. Even the
largest digital library operations, e-text centers, and humanities
computing centers are notoriously underfunded, scraping along to
move ahead in an opportunistic fashion -- grants, soft money, finding
a computer savvy student, etc. ARTFL does NOT have "battalions
of programmers" to be certain. Even worse, when Bosak and Bray
suggest the demise of "the self-trained Web hacker", we had all better
take notice. Humanities computing is based on the work of precisely
"self-trained hackers" [Web or programming], since most of us are
researchers and scholars in different substantive disciplines.
The primary labor force for most humanities computing efforts are
students, often graduate students, working and studying in the disciplines.
The ease of learning HTML (and other simple schemes like Dublin Core),
so noted by the authors, makes it possible for students to begin
working effectively with a minimum of training. This is
of particular importance given the rapid turn-over of students in
many humanities computing efforts -- they *ARE* supposed to graduate
and move on.

Why might XML require so much more effort? XML is a simplified
meta-language. A better SGML. No debate from me here. It is.
But the devil is in the details: namely the DTDs and required software
development. They can be as simple as HTML or as complex as TEI.
Bosak and Bray offer a word of caution here:

>> XML does allow anyone to design a new, custom-built language [MVO
>> notes: I presume this is the equivalent of a DTD and style sheet],
>> but designing good languages is a challenge that should not be
>> undertaken lightly. And the design is just the beginning: the
>> meanings of your tags are not going to be obvious to other people
>> unless you write some prose to explain them, nor to computers
>> unless you write some software to process them. [p. 92]

In fact, XML does not really resolve the hard problems at all:

>> What XML does is less magical but quite effective nonetheless. It lays
>> down ground rules that clear away a layer of programming details so
>> that people with similar interests can concentrate on the hard
>> part--agreeing on how they want to represent the information they
>> commonly exchange. This is not an easy problem to solve, but it is
>> not a new one, either. [p. 92]

So, we're back to discussing DTDs and hoping that **SOMEONE** will
develop effective, inexpensive, and flexible software to handle complex
DTDs like the TEI. The potential benefits of XML must be balanced against
potential huge cost increases, costs which most humanities computing
efforts can ill-afford.

One of the most important selling points of XML is that it will
require "well formed" tagging in order guarantee proper rendering
and interoperability. This is, without doubt, an important goal.
Unfortunately, the laudable desire to encourage "wellformedness"
(is that a word?? :-) seems to be almost an end in itself, rather
than an operational goal that should be balanced by other considerations
Tim Bray writes in another article that there is alot of BAD HTML out
there, because most people simply code to the relatively permissive
standards of contemporary WWW browsers (such as Netscape and Explorer).
"Fortunately," Bray continues,

>> XML comes with a built-in solution.
>>
>> The XML spec says, very clearly, that if a document is supposed to
>> be XML but isn't well-formed, then it's toast. That is to say, no
>> conformant XML processor is allowed to recover, to go on and try to guess
>> what the author meant. The idea is, basically, that it's pretty easy
>> to make documents well-formed, the rewards for doing so are very high,
>> and anybody who doesn't bother is a bozo whose material should be
>> ignored anyway.
[http://developer.netscape.com/viewsource/bray_xml.html
MVO: sorry, no hard copy citation]

Most people tend to use browsers as the way to verify HTML and
have produced huge amounts of very valuable material. Bray continues:

>> This was a controversial decision, but it was one that both
>> Netscape and Microsoft demanded of the XML committee. HTML means
>> never having to say you're sorry, which is just fine for lightweight
>> low-overhead publishing, but a really lousy basis for trying to
>> automate the Web. This decision won't change the way people work --
>> authors will continue to publish any old thing, no matter how bad, as
>> long as it looks good in Navigator -- but when you're publishing
>> XML, XML's error-handling rules will guarantee that once Navigator
>> displays something, you can be sure it's well-formed.

Bray introduces the distinction between "lightweight low-overhead"
Web development/publication -- presumably less expensive and time
consuming -- and development under XML, which he warns can be
very expensive -- dare I say "heavyweight high-overhead" :-) --
development. And he belittles those who, for whatever reasons,
continue to work in ways that are certainly proven to be effective.
The very important goal of Web automation needs to be balanced
against the utility of such automation in humanities computing
and the resources required to get there.

I am an historian and prefer doing post-mortems to predictions :-).
The computer bizness does require us to try to guess which new
things are going to pan out, and what the implications of said
developments might be. So, it is my prediction that extensive
use of the capabilities of XML will be reserved for commercial
oragnizations that have the resources and markets to make the
required investment feasible, and possibly some demonstration projects
in the humanities. HTML (or an XML look-alike) and other simple,
effective schemes, however, will continue to work on the Web and
continue to be the lingua franca of the scholarly community for
reasons of cost, installed based, and simplicity.

Many people are pinning great hopes on XML. But XML is still
largely in the early stages of deployment and we have yet to
see or digest all of the implications. As one of little faith --
a doubting Thomas to be sure -- I will wait to see what actually
happens. I do hope that the TEI Consortium and scholars in
humanities computing not get too caught up in the hype and exercise
their critical faculties, not just on the technical merits, but
also including the variety of other factors that impact real research
and development in the discipline.

Mark

Mark Olsen
ARTFL Project
University of Chicago
WWW: http://humanities.uchicago.edu/ARTFL/

-------------------------------------------------------------------------
Humanist Discussion Group
Information at <http://www.kcl.ac.uk/humanities/cch/humanist/>
<http://www.princeton.edu/~mccarty/humanist/>
=========================================================================