13.0003 TEI, gadfly & the individual scholar

Humanist Discussion Group (humanist@kcl.ac.uk)
Sun, 9 May 1999 22:43:18 +0100 (BST)

Humanist Discussion Group, Vol. 13, No. 3.
Centre for Computing in the Humanities, King's College London
<http://www.princeton.edu/~mccarty/humanist/>
<http://www.kcl.ac.uk/humanities/cch/humanist/>

[1] From: C M Sperberg-McQueen <cmsmcq@acm.org> (198)
Subject: Re: 13.0002 TEI & the Gadfly's buzz

[2] From: Wendell Piez <wapiez@mulberrytech.com> (105)
Subject: Re: 13.0002 TEI & the Gadfly's buzz

[3] From: <cbf@socrates.berkeley.edu>
Subject: Re: 12.0616 TEI & the individual scholar; research =
display

--[1]------------------------------------------------------------------
Date: Sun, 09 May 1999 22:42:01 +0100
From: C M Sperberg-McQueen <cmsmcq@acm.org>
Subject: Re: 13.0002 TEI & the Gadfly's buzz

In Humanist 13.0002, on 7 May 1999, Mark Olsen surmises that he is one
of those I had in mind when I said (Humanist 12.0610) that the
approach taken by software developers has puzzled me, and continues to
puzzle me. I didn't actually have Mark in mind when I wrote that
sentence, but his not is indeed a perfect example of what I meant.

He writes, for example,

> I decided that any attempt
>to build recognizers that would handle all of the possibilities of
>TEI (or other SGML) DTDs would require a very significant development
>effort all by itself. So, rather than include what would effectively
>be a full SGML parser into the system, I decided that we would use
>existing SGML parsers (such as Jim Clark's) to reduce all of the
>variations to a small subset.

The only meaning I am able to attach to this is that Mark believes
that supporting TEI markup would have required him to develop his own
SGML parser, whereas if he limits himself to HTML with "a few
extensions", he can use James Clark's public-domain SGML parser
instead.

If this paraphrase is correct, I can only say that Mark's logic is, to
put it bluntly, breathtaking.

He continues by observing
> ... we have found that one must a handle a consider amount of variation
>from database to database, and even text to text when dealing with
>TEI encoded documents. Trying to build that capability into a
>large scale text search and navigation engine would, I fear, be far
>beyond ARTFL's means.

There is certainly a lot of variation in the ways people use TEI;
that's part of the design. The TEI is not intended as a Procrustean
system, and it is not very useful as a method of enforcing any kind of
hermeneutic orthodoxy. (It would, indeed, be useful to have a more
prescriptive TEI header; a good goal for TEI P4.)

I would have thought, however, that HTML usage is also fairly various
-- actually somewhat more various, since a large percentage of
documents served today as text/html are not even well-formed. How
many HTML documents have you seen this week that had full Dublin
Core headers? Hmmmm.

So that Mark's reasoning is identical to that of the person who does
not want to have to think about the rational numbers, since there are
so many of them, and who decides instead to limit discussion to the
integers, so as to have fewer to think about. (For those who have not
been reading Cantor lately, be reminded that the sets of rational
numbers and integers are both infinite, and both the same size.)

> I am
>also not surprised that Michael has found a generally cool reception
>amongst other developers in humanities computing. This stuff is really
>hard and expensive to do and, so far, I have not seen the possibility of
>radically extended functionality that would warrant the investment.

As one who has been using SGML encoding for every document I've
written for the last twelve years or so, I can report authoritatively
that it is not 'really hard and expensive' to provide the kinds of
functionality that Mark describes, relying on a subset of the TEI
encoding scheme. I use James Clark's free parser, I use Omnimark
Technology's free transformation engine, I use emacs and psgml.el to
edit, and I use Panorama Pro (the only one of these items which cost
any money) to provide clean onscreen formatting. On the occasions
when I have needed typeset output I have used Waterloo GML and TeX to
produce the pages; nowadays I would use Jade to translate into TeX or
into RTF. The myth of the incredible expense needed just to acquire
SGML software is merely a myth. That companies charge high prices for
systems is a consequence of (a) the high utility of those systems to
users who can afford those prices, and (b) the laws of the market
economy. Welcome to capitalism.

>With all of the discussion of text *tagging*, little thought has been
>given to development of systems much beyond how to render individual
>documents in a browser. At a certain point, when writing a system,
>every tag, every attribute, every variation has to be handled
>or ignored. It is easy to develop very extensive tagsets that can
>be demonstrated to balance using an SGML/XML parser/verifier. It is
>MUCH, MUCH harder to develop systems that know what to do with each
>and every tag/value/attribute/whatever.

It is also pretty much wholly unnecessary. If you are building, say,
a full-text retrieval system, your software will need to take special
action on, say, the element types that mark important text structures
(text, body, div, p, possibly s if you want to run things through a
sentence recognizer so you can have one-sentence contexts in results).

You'll probably also want to pay attention to some crucial parts of
the TEI header (title of the work, author, date of publication of the
original, language of the work, and so on; the Dublin Core provides a
good reminder of kinds of information you might want to look for in
the document). To simplify life, a software developer might plausibly
say "We take the title of the document from the first TITLE element
encountered in the TITLESTMT of the TEIHEADER element. We take the
date of first publication from the first DATE element within the
CREATION element (in PROFILEDESC). ... If you want the title, date,
and other bibliographic descriptions to be picked up correctly, put
them in those places." Yes, Virginia, all the information units of
the Dublin Core can occur in the TEI header; some can occur in more
than one place, and others cannot.

For other element types, there are several possibilities:

(1) In finished software, I would want, as a user, to be able to
choose at index time among something like the following possibilities;
depending on the facilities available, some systems won't be in a
position to make all of these options available.

* Suppress this element and its contents; do not index, and
do not send to the user in a result.
* Do not index this element or its contents, but retain in
the document and send to the user if it occurs within a result.
(E.g. if you are sending whole paragraphs to the user, and
this INTERP element occurs in the paragraph, send it along.
Treat it, that is, as a special kind of comment.)
* Index the contents of the element, but ignore the start- and
end-tags. Send (or don't send) the start- and end-tags to
the user in a result.
* Index the occurrence of this element (if you maintain an
index of elements at all).
* Index the contents of this element and record, in the index,
that they occurred within this element. (This is a common
approach to SGML indexing; it allows searches by context.)

In other words, for each element a system needs to make a few
decisions: Do I index this element as an element? Do I index the
contents of this element or not? If something in the source document
is not indexed, should it be exposed to the user at all, or suppressed
entirely?

You have to make these decisions, or analogous ones, for every element
type, pretty much no matter what your software is doing and pretty
much no matter what your markup system looks like. One obvious
implementation technique is to use a table lookup to decide, when an
element is encountered, what to do with it. This table can be
hard-coded in by the programmer, at compile time, or it can be loaded
at run time, which means the programmer can punt on a few questions,
and make the user decide. If the users rebel at the prospect of
answering questions like this for every element type they use, set the
defaults one way or another, and allow the users to override the
defaults when and as they choose.

(2) You could decide you don't want to have to decide what to do on a
case by case basis, and you can make a single rule that applies to all
element types (e.g. index them and their contents), or one rule that
applies to the element types you know you are interested in (div, p,
those ones) and a second rule that applies to everything else
(e.g. suppress the element entirely, or index the contents while
pretending the element's start- and end-tags aren't there).

(3) During development, when deciding what to do with the TEI
dictionary tag set's oVar element just seems like more work than you
want to worry with, you can make the simplifying assumption that they
won't occur in your input. (I write style sheets this way all the
time: the style sheet handles the element types I actually use. When
I write a new document that uses an element type not handled in the
style sheet, it tends to look ugly, so I tend to fix it.) That
assumption is one you can actually guarantee during development and
testing, so you only have to get around to implementing default rules
and lookup tables later on.

>The burden and cost of doing
>*SOMETHING* with all of the possibilities has been passed to the
>developer. It's hard enough to do this as is, particularly within
>the limited resources of most huanities computing outfits, so a cool
>reaction to a specification that entails alot more effort is to be
>expected.

Users are not required to use every element type in the TEI encoding
scheme. I don't see why software developers are required to do
anything clever with every element type either.

It *is* fair to expect software that claims to handle TEI documents
not to roll over every time I use a 'resp' element or something else.
But the TEI DTD is big in part because lots of the element types are
specialized. That means many applications can legitimately ignore
lots of them -- a full-text system might legitimately default to the
no-index rule on, say, the feature structure elements. It might even
refuse to allow the user to override the default. That might
disappoint someone hoping to use your system to do sophisticated
search and retrieval on the feature structure analysis they have put
into their text. But in developing software, the developer has not
(as far as I know) entered into any solemn promise to solve all the
world's problems.

Given that you are not, in fact, obligated to do clever things
with every element type in the DTD, where is the problem?

>Now, the next point in the discussion is that if nobody in humanities
>computing can afford to develop software to handle all of the
>potential variations richly encoded documents, then you gotta ask, why
>encode that heavily?

You encode that heavily when you care about the information you are
encoding. You acquire or develop software to handle what you need to
handle.

What's the mystery here? I have a set of Panorama style sheets --
soon I hope to have equivalents in XSL that can be used with IE5 --
that allow me to read Walther von der Vogelweide with the text of MS
A, or of MS B, or of MS C, or according to Maurer's edition. They
could be adapted to other texts (though the method used gets unwieldy
for more than three witnesses -- I'd want to automate the production
of the stylesheet for large numbers of MSS). Does that stylesheet
handle RESP and UNCERTAIN and FS elements? No -- they don't occur in
the data I'm working with. Why should the stylesheet handle them?

Mark appears to be arguing that, because he does not know what his
software should do with an APP or a LEM or a RDG element, I should
not use them. Why on earth not?

>>> Perhaps the recent discussions about the need for a new generation
>>> of software for text analysis will lead to an improvement of this
>>> situation. Let us all hope so.
>
>No, Michael. Let us all hack... ;-)

Amen. Let us all hack.

-C. M. Sperberg-McQueen
Co-chair, W3C XML Schema Work Group
Senior Research Programmer, University of Illinois at Chicago
Editor, ACH/ACL/ALLC Text Encoding Initiative
Co-coordinator, Model Editions Partnership

cmsmcq@uic.edu, cmsmcq@acm.org
(Note that the address U35395@UICVM.uic.edu now just forwards mail
to cmsmcq@uic.edu and will eventually go away. Beat the rush; go
ahead and change your address book now!)
+1 (312) 413-0317, fax +1 (312) 996-6834

--[2]------------------------------------------------------------------
Date: Sun, 09 May 1999 22:44:31 +0100
From: Wendell Piez <wapiez@mulberrytech.com>
Subject: Re: 13.0002 TEI & the Gadfly's buzz

HUMANIST readers (1150 words):

Mark Olsen's most recent post targets what is sure to emerge as a key
issue as the possibilities of generalized markup (in the form of XML)
become widely felt on the Internet. Scholars who work with the TEI and
who wrestle with the kinds of problems that Prof. Vanhoutte recently
wrote about are pioneers in dealing with these issues.

The difference is between text encoding that is generalized, and which
is supposed to serve "any potential application" by virtue of serving
*no particular* application (such as the TEI aspires to be), and text
encoding that deploys a specific scheme for a specific purpose or range
of purposes (such as ARTFL's PhiloLogic or -- gads! -- HTML for display
in IE5.0, Netscape 2.1 or Lynx).

Keep in mind the truism that converting "up" into an abstract,
generalized, powerfully descriptive form from an application encoding
that fails to denote the text's own structures and features (or denotes
them only in an inaccessible way), is hardly better than taking the text
as blank from the start, while converting "down" from rich, descriptive
encoding into any application encoding should be a mere exercise in
programming. Hence the long-term, and rhetorical benefits:
platform-independence, longevity of data, etc. etc.

But the reason for descriptive encoding is more than rhetorical, even
more than merely practical. We all know that a text is more than a
string of alphanumeric characters. But what is the difference, the
"more"? At least in part, or for some, the romance is in an ancient
aspiration, if always newly felt, to fathom the work "as in itself it
really is." To the scholar, descriptive markup is not just a code for
driving an automated process: it becomes in itself a heuristic and an
interpretive technique. The elusive "more" of the text emerges in a
relationship between the text, and the encoding scheme that traces it.
Surely this is worth our work, we say to ourselves, even while nervously
casting our eye at our need to "share our results." Our excited
supposing that having captured the essence of that more (or even "an"
essence, for those of us who are not Platonists), we should then be able
to provide it with any expression we please, may come as just a
super-added, if saving, grace.

On the other hand, if we describe only an expression or representation
of the text, we are left with that alone. What have we learned from it?
Like having the mask of the Noh dancer, without the movement.

The problem remains that until it takes "bodily form" as application
encoding (whether in the embrace of a particular software package, API,
or transformation engine), any text encoded generically or descriptively
remains exiled in the outer vastness of the empyrean. It may be
beautiful to behold (to those that have the eyes to see) but it is
strangely sterile, an uncommitted, unmeasured potency. It does nothing,
it only abides. (We have wondered why the gods, at ease in their
celestial seats, are jealous of us poor, whiny, grimy mortals. Is this
why?)

Much of our experience with the TEI, it seems to me, has been in
discovering just how wide the gulf can be. We want both to design,
configure and deploy our e-texts in a way consistent with the long-term
vision of the TEI's originators -- and yet also to do "simple" things
like print, search or display texts on screen. So we find -- even within
ourselves, as individual developers -- two camps, puzzled and sometimes
frustrated, looking at each other across the divide. On the one side, we
walk with our heads in the clouds and our feet off the ground,
frequently proclaiming, and fervently praying for, "support for open
standards!" and "better applications from the vendors!" (and no one has
proved they are *not* coming, any day now). On the other, we are tempted
to dismiss the high-minded abstractions as so much academic
pointlessness, and grimly roll up our sleeves, assessing our options of
the moment. Thankfully, as is apparent on this list and elsewhere, we
also have a real, intelligent dialogue between the views, both between
and within the posts and counter-posts.

We should acknowledge what Mark Olsen has reminded us, that descriptive,
content-oriented encoding may not always be practical or cost-effective
when we need to show real-world results. Especially if we are not
particularly concerned with document interchange between unspecified
disparate systems, part of the TEI's original mandate. Time frames are
real, the right moment arrives and passes, and hard-nosed practical
decisions have sometimes to be made.

Nevertheless, the gap is closing. The main development in this area is
the emergence of more accessible transformation tools, which are making
it easier to provide for programmatic conversions between encoding
schemes (earlier this week I mentioned Jade, a DSSSL engine; but many
new tools are emerging under the aegis of XML/XSL). Seeing how much of
this work is free on the net and even open source, it might be hoped
that expertise in such techniques will find itself into the academy.

Given such capabilities, the issue is no longer the mere letter of the
scheme. Rather, it is how well the scheme maps, abstractly, to
application requirements (whether those are seen over an immediate
period, or in the eternal view of things) -- which might be to say, the
scheme's actual spirit -- in combination with an ability to transform
the encoding into (or express it as) whatever the software of the moment
can chew on.

So rather than focus on "TEI" or "something else" in our efforts to be
practical, we should be focusing on developing our requirements
analysis, our strategies and methods for use, transformation, reuse and
repurposing of our encoding schemes, and our understanding of the range
of tools and techniques we can deploy -- while recognizing the
constraints and tradeoffs -- to realize the tangible along with the
intangible rewards. Seen in this light, ARTFL, in making its choices, is
leading the way just as are the TEI projects in making theirs. And the
TEI itself (as its proponents have often argued) is only a means to an
end.

Nor should we finally forget the intangible rewards. An ad hoc scheme
will remain useful as long as the software is maintained to take
advantage of it. For archiving purposes, many such schemes (at least if
the developers have assured some kind of formal validation or
consistency of encoding) might well be cross-converted into TEI (or
other descriptive) markup. As for the TEI texts at Indiana, Michigan,
Brown, UVA, UNC, Oxford, Cork, Alberta, Berkeley, name your repository,
there is no telling how long they will be good for. It should be for a
very long time. Good for what? First, for whatever the projects are
already giving us. Beyond that? The book is still open.

Respectfully,
Wendell Piez

--[3]------------------------------------------------------------------
Date: Sun, 09 May 1999 22:42:41 +0100
From: <cbf@socrates.berkeley.edu>
Subject: Re: 12.0616 TEI & the individual scholar; research = display

John and Michael's anseers to my question make the point that I was
trying to elicit: Unless you're at a major institution with an active
electronic text project you're unlikely to use TEI for the purposes of
preparing a scholarly text, either on paper or electronic.

I think that the reason for this is the lack of solid, reliable,
reasonably priced tools that will work with the TEI in a reasonably
transparent way. This was a subject of some discussion at the
ACH-ALLC conference in Debrecen last year.

I think that we need to start to try to make those tools available in
some sort of organized fashion, just as the Summer Institute for
Linguistics has developed a while suite of tools for linguistic analysis.

Charles Faulhaber Department of Spanish UC Berkeley, CA 94720-2590
(510) 642-3781 FAX (510) 642-7589 cbf@socrates.berkeley.edu

-------------------------------------------------------------------------
Humanist Discussion Group
Information at <http://www.kcl.ac.uk/humanities/cch/humanist/>
<http://www.princeton.edu/~mccarty/humanist/>
=========================================================================