13.0014 TEI & more from the gadfly

Humanist Discussion Group (humanist@kcl.ac.uk)
Thu, 13 May 1999 20:35:44 +0100 (BST)

Humanist Discussion Group, Vol. 13, No. 14.
Centre for Computing in the Humanities, King's College London
<http://www.princeton.edu/~mccarty/humanist/>
<http://www.kcl.ac.uk/humanities/cch/humanist/>

Date: Thu, 13 May 1999 20:36:09 +0100
From: Mark Olsen <mark@barkov.uchicago.edu>
Subject: TEI & the Gadfly's buzz

In his reply to my post (Humanist 13.0002) of May 7, Michael
suggests that while he did not have me in mind, my position
is a perfect example of what puzzles him about the cool reaction
of software developers in humanities computing to the TEI. I
have to admit that I am both glad to oblige :-) -- since I think
the discussion might be useful -- and very happy to find that I
am not the only developer to inform Michael that they are having
difficulties with the TEI. It may be that developers in humanities
computing are running into a set of problems which need to be
examined with some care.

Michael points out that there are lots of free SGML parsers
out there. I agree. We use one EXTENSIVELY! And looked at others.
But an SGML parser/verifier is a rather limited piece of
software, hardly the kind of system that will allow humanists
to accomplish much substantive work. He concludes:
>> The myth of the incredible expense needed just to acquire
>> SGML software is merely a myth. That companies charge high prices for
>> systems is a consequence of (a) the high utility of those systems to
>> users who can afford those prices, and (b) the laws of the market
>> economy. Welcome to capitalism.

I believe he might have forgotten the "rational agent" assumption of
market capitalism. The software that can be acquired at low cost
clearly does not serve the needs of many users, who, if they can
afford it, purchase very expensive systems. Are all these people
irrational agents, simply wanting to spend alot of money for things
that can be acquired freely or for little cost? I doubt it. And
if such usable tools are going to be very expensive -- as Tim
Bray warns us when talking about the "battalions of programmers"
required to make XML effective -- should not that consideration
be raised when examining the TEI?

So why the disconnect between the availability of free SGML
parsers and acknowledged expense and scarcity of useful
tools? That was the point of my post. By outlining my
considerations in the context of development of PhiloLogic at
ARTFL I was hoping to shed some light on the issue that
puzzles Michael, namely the cool reception of TEI among developers
in humanities computing.

I must have missed the boat since Michael suggests that my
>> reasoning is identical to that of the person who does
>> not want to have to think about the rational numbers, since there are
>> so many of them, and who decides instead to limit discussion to the
>> integers, so as to have fewer to think about.
Yes, both are infinite sets. However, the integers between 1 and 10 is
a finite number, if I recall my math from high school ;-), while the
rational numbers between 1 and 10 are infinite. I KNEW I should
have gone to math class more often....

So, let me try again. Michael writes that
>> Users are not required to use every element type in the TEI encoding
>> scheme. I don't see why software developers are required to do
>> anything clever with every element type either.
and goes on to point out that "many applications can legitimately
ignore lots of them". Agreed. Developers are, at the very
least, going to pick and choose what they are going to handle
and how they might handle it. Currently, PhiloLogic simply
ignores any SGML/HTML tagging that it is not programmed to
recognize, passing it all to the client software (tho' we might
do something to render it on output if it is appropriate).

This is a vital point which I do not believe has been adequately
advertized: encoding something that you can see in Panorama or detect
in an SGML parser in no way assures that there will ever be
software to let you do something further with it! Now, Michael takes
this to mean that I am
>> arguing that, because he does not know what his
>> software should do with an APP or a LEM or a RDG element, I should
>> not use them. Why on earth not?
I am not saying don't use them, or any other encoding for that matter.
It's your dime. ;-) What Michael is making very clear that he
expects developers in humanities computing to be picking and
choosing subsets, possibly very small subsets, of TEI encoding
to develop software for. TEI conformance in no way warrants
that there will be software to make use of whatever is tagged.

But I digress (tho' it is an important digression). We have now
established that developers will, at least for some time, be
picking and choose the encoding subsets that they will develop
software for. The rest of Michael's post, which I think is
both very informative and intelligent, offers his approach
to picking and choosing encoding to process. He wonders why I
would think that we would have to effectively write an SGML parser
in order to load text databases. Simply put, one cannot pick and choose
the encoding you care to develop software for without writing
a recognizer that effectively mimics much of what an SGML parser
does. This is particularly true of TEI, which permits so many
variations in even the most basic encodings, a point which I
touched on in 1996 talk at the ALLC/ACH meeting in Bergen, Norway.

Michael's discussion assumes the use of an SGML parser to
facilitate recognition. It is our experience, however,
that the use of these parsers is considerably less automatic
than one would hope. In fact, Michael has a good sense of this
as he indicates in points 1-3 of his post, since it is akin to
writing style sheets.
>> You have to make these decisions, or analogous ones, for every element
>> type, pretty much no matter what your software is doing and pretty
>> much no matter what your markup system looks like.
You have to make many decisions on all levels to identify what you
want to process and convert it into something that your software can
handle.

Since we are agreed that you have to identify everything that
you want to process and figure out what you want to do with it,
the real question then boils down to where you want to hang the
SGML parser? Inside or outside? That decision comes down to money.
It is certainly technically feasible to marry an existing SGML
parser to a full text loader -- in fact, the ARTFL mafia revisited that
issue over many martinis last Friday :-) -- but I decided that the
process would be difficult, expensive, and not add very much at all
to the final system. So, I adopted a two step process, conversion
of SGML -- EAD, TEI, C-H, etc -- to ATE. Since I wanted to come
up with a simple encoding, I decided finally to adopt a scheme that
is widely used, is very simple, stays within the skills of humanities
students for *internal* database development -- another very important
consideration -- and can be processed without an SGML parser. ATE
can, of course, be used directly bypassing the SGML step completely,
as we have done for some of our internal data entry projects.

Over last Friday's martinis :-), we decided that there were other,
more pressing and interesting problems to tackle given our
limited resources than integration of an SGML parser into PhiloLogic,
such as implementing Unicode in order to handle MANY languages and
querying across multiple databases. There are many issues in text
computing have very little to do with encoding or even SGML/XML and
the TEI. I suspect that part of what puzzles Michael about the cool
reception of humanities computing developers to the TEI is that some
of us may not want to invest significant effort and money in handling
text encoding because we have other projects that are of greater
interest. I had hoped, when the TEI started, that a text encoding
standard would allow me to **REDUCE** the cost of importing and
exporting large numbers of texts, freeing up scarce resources for
work on other efforts.

M

[P.S. I would love to carry this on, but am looking forward to some
well deserved time away from computers, phones, etc., so I guess
it'll have to wait until the ACH-ALLC meetings at UVa. I'm rather
fatigued myself, Michael ;-).]

Mark Olsen
ARTFL Project
University of Chicago

-------------------------------------------------------------------------
Humanist Discussion Group
Information at <http://www.kcl.ac.uk/humanities/cch/humanist/>
<http://www.princeton.edu/~mccarty/humanist/>
=========================================================================