10.0895 snowing on the parade

WILLARD MCCARTY (willard.mccarty@kcl.ac.uk)
Fri, 25 Apr 1997 23:40:20 +0100 (BST)

Humanist Discussion Group, Vol. 10, No. 895.
Center for Electronic Texts in the Humanities (Princeton/Rutgers)
Centre for Computing in the Humanities, King's College London
Information at http://www.princeton.edu/~mccarty/humanist/

Date: Fri, 25 Apr 1997 11:57:00 -0500
From: Mark Olsen <mark@barkov.uchicago.edu>
Subject: Hi there

[The reference to "mid April" in this note was timely when Mark Olsen sent
it to me but isn't now. Apparently my fault, for which all apologies.
Sometimes the difference between intending to do something (like sending
out a message) and actually doing it escapes my notice.... WM]

Hi there,

No, it's not raining. I'm in Chicago. It's mid-April. So, it's
snowing. :-( Rather than try to comment on 500 lines of reactions to
my last post, I'd like to reply to some general points, which
popped up in the replies.

Claim: TEI is NOT a standard, but recommendations/guidelines.

I suspect that Micheal and others considerably under-estimate the AUTHORITY
that that comes from a well organized project, with considerable
international institutional support, now backed-up with significant
educational efforts and high visibility in funding agencies.
The TEI has become an authoritative standard by virtue of the impressive
activity of the TEI editors, sub-committee members, and supporters.
Further, the authority is well deserved, because the TEI community
HAS done ground breaking work in systematically analyzing the complexities
of text in all of it's various forms (a task for which the editors
and committee members should be highly commended). With authority,
however, comes responsibilities particularly when you leave the realm
of research to teaching others the conclusions. Michael argues that
"waiting to use or teach the TEI until everyone in the community
subscribes to it is such a silly tomfool idea". I would counter that
until the specification has been tested in the real world as BOTH
a discussion of textual complexity AND an interchange standard, one
should proceed cautiously, particularly when teaching it to others.

Claim: TEI is being used to build large and small databases

This is true. And before TEI, we had a number of specifications
which we used to develop databases of various sizes, typically
based on formats required by popular software or developed by
particular projects. The real test of an INTERCHANGE format, however,
is not that one can build a large or small database, but that the
format can be automatically converted TO and FROM any number of systems
with a mimimum of effort. My principle objection to TEI is that
it is by far the most difficult representation to convert into
something else, because of it's expressive power. The more tightly
constrained a specification, the easier it is to write converters.
It is a BALANCING act, which I do not believe the TEI community has --
because of it's make-up and strucutre -- really tried to perform.

My great fear is that we are creating a Tower of TEI Bable, where
automatic conversions will be practically impossible. At Bergen, I
learned that the developer of TACT -- probably the most widely used
analytical system in the discipline -- wound up writing a small programming
language around TACT because he could not predict the encoding of features
that he needs for internal system purposes. At ARTFL, we have encountered
the same problem. This means that users of TEI conformant documents may
have to write programs/filters/etc to convert particular TEI documents into
whatever their analytic systems require. Hardly a strong recommendation
for an interchange standard. The real feild test is to attempt to gather TEI
conformant documents from many sources, large and small, and
attempt to convert them to another representation automatically.

Claim: The TEI has been peer reviewed.

This is an odd claim, and may be a matter of semantics. The TEI
has been, both to its credit and in some ways to its detriment,
an extraordinarily open effort. The editors and committees took
into account views from an extraordinarily wide set of interests.
A good thing. But it has lead to a specification that is so
compliant, so flexible, as to be almost completely open-ended.
But openness to comment is not a review process. A review process
would require some organized feild testing, some ways of analyzing
the performance of the specification as a whole. As soon as even
initial drafts were completed, TEI was launched by running workshops
and seminars on how to use it. This is an odd proceedure, since
there was really no effort to test the specification in general.
If there had been, I think the kinds of problems that are encountered
now in dealing with TEI documents could have been avoided. The
real problem is, in my opinion, a failure to recognize the human
element in building textbases (or writing computer programs). In the
face of any number of ways to do something, humans will wind up
using them all. Twenty-five years ago, we went thru "structured
programming" because the flexibility of earlier programming languages
produced workable, but undecipherable, programs. Even a simple
feild test -- which I tried to do for the Bergen paper -- of taking
documents encoded seperately at different locations for different
projects would have revealed the problem.

Claim: Encoding teaches us about text and/or an individual text.

True at one level. The TEI has taught ME that text is or can be
hugely complex. And they have inventoried/described an impressive
array of possibilities. But I look at computers as LABOR SAVING
devices first and foremost. I get a sense that in many ways text
ENCODING has become an end in itself, rather than a means to do two
things: 1) perform relatively short-term oriented analytical research
by an individual or team of researchers and 2) to allow inexpensive
and easy INTERCHANGE of these raw materials between individuals or
teams of researchers. My own approach has been to perform extensive
analytical tagging automatically from clues in typescripts. Text
encoding is, at best, drudge work that we now employ graduate students
to do (what a horrifying waste of talent if you ask me). If we want
to teach graduate students about texts, have them read 'em a couple
of times rather than send them thru picking out features.

And now to my TEI FAQ, based on questions I get all the time, given
my politicall incorrect views:

-- If TEI is not good, what should I do?

Right now, I'm telling people to use the native encoding mechanisms
for their target software, applications, and research/publication
processes. I fully expect that the TEI community, or someone else,
will write a workable interchange format, almost certainly based on
TEI. Patience may be required since we're now waiting XML. TEI
is a great context in which to think about the encoding of a text,
but given the fact that there is little software out there that
works with the specification and it's hard to convert to other
representations, go with a short-term solution.

-- Is TEI dead?

No, it's not only very much alive, but vital to the entire discipline
(if it were dead or unimportant, I wouldn't bother toasting it so hard :-).
Data specification standards are what makes the Internet, library
card catalogue systems, delimited feild databases, and so on... possible.
They facilitate the ready and automatic transfer and reformatting
of data. It is my belief that real progress can be made by
establishing specific and exact standards. I use the MARC record
as an example, but there are many others. TEI (or more broadly,
SGML/XML) is the logical place to start.

-- Should TEI be taught in workshops/seminars?

Not at this time. I have sat in on several of these and the
basic model is to have the student sit down with an SGML editor
and tag a small sample of text to make sure that it is TEI
conformant. The implicit assumption is that whatever is tagged,
however it is encoded, WILL BE USABLE! This is almost certainly
incorrect, given the lack of software or reliable ways to export
the information to other, particularly non-SGML, systems. Until
TEI produces a reliable export model, it is premature to teach it.

Well, that's it for now. It's Chicago, and more snow is forecast.
I suspect that I can bet on more TEI commentary too. Just remember,
guys, I'm blasting away at TEI in order to make it work, 'cause I
NEED TEI to work! We all do!

Mark

Mark Olsen
Assistant Director
ARTFL Project
University of Chicago
(773) 702-8687
WWW: http://humanities.uchicago.edu/ARTFL/ARTFL.html

Nothing will ever be attempted if all possible objections
must first be overcome. --- Samuel Johnson