12.0608 XML & TEI

Humanist Discussion Group (humanist@kcl.ac.uk)
Sun, 2 May 1999 21:35:31 +0100 (BST)

Humanist Discussion Group, Vol. 12, No. 608.
Centre for Computing in the Humanities, King's College London
<http://www.princeton.edu/~mccarty/humanist/>
<http://www.kcl.ac.uk/humanities/cch/humanist/>

[1] From: C M Sperberg-McQueen <cmsmcq@acm.org> (93)
Subject: Re: 12.0605 gadfly notes on XML & TEI

[2] From: Patrick Durusau <pdurusau@emory.edu> (160)
Subject: Re: 12.0605 gadfly notes on XML & TEI

[3] From: aimeefreak <ahm@gpu.srv.ualberta.ca> (37)
Subject: Re: 12.0605 gadfly notes on XML & TEI

--[1]------------------------------------------------------------------
Date: Sun, 02 May 1999 21:31:37 +0100
From: C M Sperberg-McQueen <cmsmcq@acm.org>
Subject: Re: 12.0605 gadfly notes on XML & TEI

On Sat, 1 May 1999 08:23:56 +0100 (BST), Mark Olsen wrote (in Humanist
12.0605):

>The TEI Consortium sees a large part of its mission and future
>implicated in the early deployment of XML:

Future, yes; mission, perhaps. XML is certainly an important
opportunity for everyone with any interest in being able to use
off-the-shelf software for humanities computing.

>A critical assumption here is that the primary use of TEI and other
>encoded texts is READING documents online. Rather than read TEI
>documents in either rendered HTML or using an SGML browser, XML will
>allow us the read TEI encoded documents in a browser. The implication
>is that the primary focus of humanities computing should be electronic
>publication. It is not clear to me that this is the case.

On this, Mark and I agree. (Does that mean he has failed as a
gadfly? Hmmm.) But it seems to me that the situation is more
complex than Mark seems to think.

- It may or may not be the case that electronic publishing should
be the primary focus of humanities computing. Mark and I both think
it should not.
- Mark and I, however, don't make the rules in humanities computing,
and humanists will continue to focus on whatever they jolly well
please, whatever Mark and I may say. It is an empirical fact that a
lot of people in humanities have a keen interest in electronic
publication of their work, whether in the form of primary texts or as
secondary works. There are whole branches of literature, history,
theology, and other disciplines devoted to publication of literary
works and documents; it would be surprising and scandalous if
scholarly editors who use computers were NOT interested in electronic
publication.
- It is also possible to be keenly interested in electronic
publication without believing that it is, or should be, the primary
focus of humanities computing. (In much the same way, it is possible
to be keenly interested in publishing books and articles while still
believing that the primary focus of, say, literary study is or should
be the understanding of literature, and without confusing
understanding with publication.)
- David Chesnutt and Allen Renear do not, in any case, argue that
publication should be the primary use of TEI-encoded texts. Chesnutt
speaks of the increased amount of markup-aware software. He
explicitly mentions browsers such as Internet Explorer 5 and Mozilla,
but other off-the-shelf software is equally important. There are XML
editors (XMetal from SoftQuad, Documentum from Excosoft, XED from the
Language Technology Group in Edinburgh, and David Megginson's XML
adaptation of Lennart Staflin's psgml.el for emacs are the ones that
come to my mind off the bat, but I have not been monitoring the market
very actively lately), search tools (e.g. sggrep from the LTG in
Edinburgh, or sgrep from the Document Management Group at the
University of Helsinki), and other tools (libraries in Perl, Python,
and who knows how many other languages). All will be useful to
those in humanities computing. Allen Renear suggests that users
will move to TEI from HTML when they see what off-the-shelf software
can do with TEI. Style-sheets are one obvious application here,
but not the only one.

>Renear and
>Chesnutt are only addressing reading here, but their own work shows
>that searching and analysis are crucial elements of textual research.
>Searching and analysis, that might combined with e-text "publication",
>seems to be more consistent with the long tradition of humanities computing
>scholarship. But I digress (tho' this is an important digression).

It is indeed an important digression: the ability to search documents
using the markup to help control the search is one of the most
important advantages of heavy markup over light markup. That
advantage is sometimes large enough to be worth the cost of the
heavier markup, and sometimes not. But it's nice to be able to make
the choice, instead of having it taken out of one's hands by
systems that cannot understand the markup that is present.

>It is NOT clear to me that XML is going to be the solution that the
>TEI Consortium hopes because XML may require dramatic increases
>in costs. Bosak and Bray conclude the SciAm article, writing:

XML in itself does not require any increase in cost. Since it
offers more opportunities, exploiting it fully may well cost more
than exploiting HTML fully. That's where commercial enterprises
will need battalions of programmers.

>In fact, XML does not really resolve the hard problems at all:

No more than the TEI does. Scholars still have to understand
their texts, and still have to decide what they care about.
Neither TEI nor XML change that, and neither should be thought to
be a silver bullet. They are merely good ways of helping make it
possible to *mark up* what you care about, in order to make software
do what you want it to do.

> ... I do hope that the TEI Consortium and scholars in
>humanities computing not get too caught up in the hype and exercise
>their critical faculties, not just on the technical merits, but
>also including the variety of other factors that impact real research
>and development in the discipline.

Sound advice: exercise your critical faculties. Enlightenment
is the liberation of the mind from self-imposed incapabilities.

-C. M. Sperberg-McQueen
Co-chair, W3C XML Schema Work Group
Senior Research Programmer, University of Illinois at Chicago
Editor, ACH/ACL/ALLC Text Encoding Initiative
Co-coordinator, Model Editions Partnership

N.B. My remarks represent my own opinions, not necessarily those of
W3C, UIC, the TEI Consortium, or MEP.

--[2]------------------------------------------------------------------
Date: Sun, 02 May 1999 21:32:12 +0100
From: Patrick Durusau <pdurusau@emory.edu>
Subject: Re: 12.0605 gadfly notes on XML & TEI

In his role as "gadfly" Mark Olsen makes a number of explicit (and
implicit) claims about the importance of XML for the use of the TEI
Guidelines by humanities scholars. Some of those claims warrant further
discussion while others are factual errors.

1. TEI focuses on electronic publication

Olsen quotes Alan Renard (Brown University) as saying that the advent of
XML will mean that scholars can be shown the usefulness of the level of
encoding available from standards such as the TEI Guidelines and not
just assured that when tools become available the texts will be useful.
That does not necessarily lead to Olsen's conclusion:

>A critical assumption here is that the primary use of TEI and other
>encoded texts is READING documents online. Rather than read TEI
>documents in either rendered HTML or using an SGML browser, XML will
>allow us the read TEI encoded documents in a browser. The implication
>is that the primary focus of humanities computing should be electronic
>publication. It is not clear to me that this is the case.

I cannot speak for Alan Renard or anyone else working with TEI but I can
report from personal experience that it is easier to demonstrate the
utility of TEI encoding if I have a display mechanism for the encoded
text. Linguistic corpora experts have the computer background to
appreciate encoding schemes while textual critics usually do not.
(Apologies in advance to all the text critics who have such a
background.) The proper encoding of a text leads itself to a number of
uses, only one of which is electronic publication and to suggest that
TEI primarily has an "electronic publication" focus is simply incorrect.

2. XML is as complex as SGML

Olsen quotes Bosak and Bray to illustrate the point that developing an
encoding standard will be just as complex in XML as it was in SGML.

>> XML does allow anyone to design a new, custom-built language [MVO
>> notes: I presume this is the equivalent of a DTD and style sheet],
>> but designing good languages is a challenge that should not be
>> undertaken lightly. And the design is just the beginning: the
>> meanings of your tags are not going to be obvious to other people
>> unless you write some prose to explain them, nor to computers
>> unless you write some software to process them. [p. 92]

>In fact, XML does not really resolve the hard problems at all:

>> What XML does is less magical but quite effective nonetheless. It
>>laydown ground rules that clear away a layer of programming details so
>> that people with similar interests can concentrate on the hard
>> part--agreeing on how they want to represent the information they
>> commonly exchange. This is not an easy problem to solve, but it is
>> not a new one, either. [p. 92]

The point is correct but irrelevant since the TEI Guidelines have already
developed the encoding standard for a substantial body of materials and
only await the use and extension by scholars working the various
humanities disciplines. The TEI Guidelines will probably be extended by a
minority of scholars for use by a larger community but then my home
computer architecture has been advanced in a similar fashion. Not being
able to design a motherboard does not disqualify me from using the
computer to my advantage. Scholars will need to be trained in applying TEI
elements to properly encode a text and not in the arcane rules of DTD
construction to make use of the TEI Guidelines.

3. Cost to humanities projects

Olsen quotes from the recent Bosak and Bray article in Scientific American
["XML and the Second Generation Web", SciAm May 99
http://www.sciam.com/1999/0599issue/0599bosak.html) to support his claim
that the use of XML will dramatically increase costs in the humantities
due to the need for:

>>designers will need to be versed not just in the
>> production of words and graphics but also in the construction of
>> multilayered, interdependent systems of DTDs, data trees, hyperlink
>> structures, metadata and stylesheets--a more robust infrastructure
>> for the Web's second generation.

One component of the cost of a technology is the cost of the software to
use it. A quick visit to Robin Cover's excellent SGML/XML Web page:
http://www.oasis-open.org/cover/, will reveal that while some software
is available for SGML, there is a wealth of software for XML on a
variety of platforms and in a number of programming languages. There are
a number of sources of robust software packages available for free for
academic projects ranging from the LT XML package (Henry Thompson,
Language Technology Group, Edinburgh) to "IBM XML for Java" (Kent
Tamura, Tokyo Research Laboratory, IBM Japan) and numerous Perl and
Python software packages. From the software standpoint there is little
evidence to support the envisioned increased costs posed by Olsen for
the humanities.

Olsen's claim that the use of XML will lead to "potential huge cost
increases,...." seems to consist of two components, first, the cost of
software for using the TEI Guidelines (answered above) and second, the
cost of training humanists to use a more complex encoding scheme than
HTML.

On the training of humanists Olsen notes:

>Humanities computing is based on the work of precisely
>"self-trained hackers" [Web or programming], since most of us are
>researchers and scholars in different substantive disciplines.
>The primary labor force for most humanities computing efforts are
>students, often graduate students, working and studying in the
>disciplines.
>The ease of learning HTML (and other simple schemes like Dublin Core),
>so noted by the authors, makes it possible for students to begin
>working effectively with a minimum of training.

Viewed in its entirety, the TEI Guidelines are an imposing 1200+ page
set of complex and sometimes obscure encoding guidelines. But any
particular project may only require the use of 30-40 (guess, no research
on this point) elements by the graduate students in the project. One
would assume that all graduate students could be taught by tutorials and
examples to effectively apply a small set of elements for a given
project. The person preparing the list of elements for use by the
graduate students would have to have a greater degree of knowledge than
the students but that should be true for the project in general.

The point here being that the TEI Guidelines or XML should not be
criticized as too complex when there has been no effort to train any
graduate students in its use or the use of a TEI subset. Like the
perennial criticism that there are some textual structures that TEI
cannot properly encode (even with extensions to the Guidelines) the
defeatist view of encoding complexity is more an article of faith than
fact. I am hopeful that generalized tutorial materials will be one of
the first deliverables to appear from the TEI Consortium as those will
go a long way to dispelling the aura of complexity that surrounds the
TEI Guidelines.

4. Continued use of HTML

Olsen predicts that "HMTL (or an XML look-alike) will continue to work on
the Web and continue to be the lingua franca of the scholarly community
for reasons of cost, installed based, and simplicity." It was not so many
years ago that HTML and web based resources were not an accepted part of
the scholarly community. There is no reason why academic communities
cannot develop powerful yet simple encoding subsets of the TEI Guidelines
to become an even more powerful lingua franca for their area of study.
HTML encoding is better than nothing but to remain within its limitations
simply because it is now a common skill and there is cheap labor for its
use seems contrary to any effort to produce long lasting scholarly texts
for present and future research.

Funding agencies will need a vast education on the expenses and
requirements of proper encoding schemes but I am hopeful that will be
one of the activities of the TEI Consortium. I foresee a time when any
critical edition text will be required by the granting agency to be
prepared, if not published, in SGML/XML encoding so its contents will
not be buried in the confines of hard cover editions. (Here you can
insert whatever texts of the past type project that comes to mind.)

Conclusion

Mark Olsen and the ARTFL Project should be commended for their work
using HTML markup (and the PhiloLogic database engine and other
scholarly contributions too numerous to list here). Those not familiar
with the project should take the time necessary to become familiar with
it. But I disagree with some of his predictions on the use of the TEI
Guidelines, the impact of XML on their acceptance and the utility of
retaining HTML as a markup standard.

There are a number of positive steps that scholars can take to advance
the use of TEI and XML in their respective fields of study which
include:

1. Lobby your institution to join the TEI Consortium.

2. Volunteer to participate in the creation of training materials for
TEI in encoding scholarly materials.

3. Develop software as part of university sponsored projects to
facilitate the use of TEI and XML in encoding projects.

4. Urge graduate students to develop some degree of competence in the
use of markup languages as part of their basic skill set in your
discipline. (I saw a statistical analysis presented at a scholarly
conference that asserted certain factors were significant because the
manual for the statistical software said they were significant. I assume
we want to avoid a similar situation with markup languages.)

Scholars should take a critical view of new technologies but not assume
that the unfamiliar is too complex or too costly to be useful. SGML has
been used to provide access unimagined by prior generations of scholars to
works such as the Patrologia Latina. The widespread use of XML may lead to
better tools and more texts for scholars but only if scholars participate
in that process.

Patrick

Patrick Durusau
Information Technology
Scholars Press
Pdurusau@emory.edu

--[3]------------------------------------------------------------------
Date: Sun, 02 May 1999 21:31:54 +0100
From: aimeefreak <ahm@gpu.srv.ualberta.ca>
Subject: Re: 12.0605 gadfly notes on XML & TEI

as one of the 'computer savvy students' referred to in the xml email, i
must loudly assert my eagerness for xml ubiquity. i just wrote, actually,
a paper on how impractical html is for those of us who want to use the
machines to do lit work -- the language is so dumb, all you can do with it
is format, and not really encode at all. which is like typing when you can
word-process. maybe i'm just spoiled from extreme exposure to sgml and its
authoring/display tools as an ra on the orlando project, but as a student
of literature, i just can't see myself wasting my time on 'web page'
projects when i could actually be tagging smart, tagging critically,
tagging meaning instead of format.

the process of web-site-building is unnecessarily dumbed down by its tools,
and while it is great that my work thus published becomes accessible to
many, and that i am able to produce a scholarly 'thing' i would not
otherwise be able to, these benefits -- for us student types with no time,
quick project turnaround, and crappy financial resources -- are really
outweighed by the sheer brain-killing nature of having to tag everything
over again if you want to display it in another way, if you want to move it
from place to place, or repeat it. also, the whole html slapdash mentality
makes for a very dependent product, a one-shot deal that needs to be
rewritten for another platform.

the time you spend articulating a dtd is brain work, is scholarly and
intellectual (maybe not so much the style sheet ;> ) while the time you
spend 'marking up' for web display in html is typing, is like a fancy cover
page on a plain old essay.

i built a site for a course at the u of a with susan hockey, an
experimental student scholarly site, which made me declare i would never to
this type of humanities computing for a literature course if i had to use
the same tools. i concluded there that widespread xml support (via
browsers and cheap-o authoring tools [well, on this last i admit i am
dreaming the big dream]) was the only way to fly, from the student
perspecitve. you can look -

http://www.humanities.ualberta.ca/aimee_morrison/framed

have lovely weekends ... aimeefreak
--------------------
aimee morrison
phd program, dept of english
university of alberta
edmonton, ab

-------------------------------------------------------------------------
Humanist Discussion Group
Information at <http://www.kcl.ac.uk/humanities/cch/humanist/>
<http://www.princeton.edu/~mccarty/humanist/>
=========================================================================