16.390 thinking with the technologies / XML vs. RDM

From: Humanist Discussion Group (by way of Willard McCarty willard.mccarty@kcl.ac.uk)
Date: Thu Dec 26 2002 - 05:49:34 EST

Next message: Humanist Discussion Group (by way of Willard McCarty

               Humanist Discussion Group, Vol. 16, No. 390.
       Centre for Computing in the Humanities, King's College London
                   www.kcl.ac.uk/humanities/cch/humanist/
                     Submit to: humanist@princeton.edu

         Date: Thu, 26 Dec 2002 10:41:58 +0000
         From: Manfred Thaller <manfred.thaller@uni-koeln.de>
         Subject: Thinking with the technologies / XML vs. RDM

For obvious reasons I have been very intrigued by the recent discussion
on "Data Bases v. XML" in Humanist, and had indeed the intention of
contributing to it. The usual information / work overload has prevented
me so far to react to it in a more appropriate way, weighting all the
arguments in the most recent Humanist-incarnation of a discussion in
which outside of Humanist I have taken part for quite some time and
formulate specificlly addressed responses to the individual
contributions.

Christmas being here, and me becoming incommunicado for the next two and
a half weeks, I either participate now or never.

So let me formulate a few general theses, not addressed particularly at
any one contribution so far, rather in the medieval tradition of a
"thesis" summing up ones own point of view.

(1) Humanities' sources contain texts which have some properties, which
as such are different from the types of information handled by computer
science at large, totally INDEPENDENTLY of the sub-field of computer
science we think of, or of the tools this particular field of computer
science has eventially developed. Certain types of ambiguity are always
a good starting example.

(2) Data Bases. Please remember, as far as Comp.Sc. is concerned,
relational data bases are a special case of a far more general
phenomenon. Albeit a case which (a) for virtue of the ease with which it
can be handled by fairly conventional mathematical tools has intrigued
the theoretical Comp.Sc. community for some time and (b) because of the
cornucopia of implemented and easily available tools has dominated
training in the software engineering branches of Comp. Sc. to a degree,
where even Comp. Sc. tend to forget that RDBMS-es are only a very
special case of a broader phenomenon.

In my opinion, Humanities practitioners of Computing tend to obscure the
situation further, by mixing up the need for an "information system",
which is a rather special animal in the herds of DBMS applications, with
the implications of a specific data model. Any information system can
rather easily be described as a set of objects out of which any subset
pertaining to a specific query can be extracted and displayed - possibly
processed a bit before displaying - by ways of an access mechanism
which usually involves some kind of index structure.

In that sense, a Humanities information "bridging the gap between RDBMSs
and XML" might be construed as a solution, which extracts some kind of
information from an XML encoded text - say words - indexes them need
with the help of a relational table for which the word or its lemma
become the primary keys, maintains a link to the textual document from
which the word or wordform have been taken and fires a suitable XML
editor when a specific document selected via that index becomes due for
display.

Well, technically that is not exactly brilliant: A simple B-Tree
indexer, avoiding the overhead of a full blown RDBMS would be
incomparably more effective; but it would be a solution.
In both cases however - indexing via data base or indexing via a more
directly controlled tree - the genuinely Humanistic problem, that the
indexed term might contain examples of ambiguity (doubtful readings,
e.g.) which could easily be marked up in XML, that quality would be
lost, as the tools used for handling the indexing problem, bw it RDBMS
or plain indexer, would have no concept of a "textual atom with embedded
ambiguity".

(3) XML. Well - here the problem is, that "XML" as such does not say
very much. As most of us are aware, the more recent versions of
StarOffice / Open Office use XML for encoding the texts it is
processing. A truly XML based wordprocessing package! Unfortunately that
still does NOT mean, that StarOffice or OpenOffice would be able to
handle textual variance, because the underlying abstract data model -
what Star Office / OpenOffice understand "a text" to be - does not
support that notion. Which is exactly the problem, why it is irrelevant
for fast searching purposes, whether textual ambiguity can be coded in
XML or not, as long as tool which is derived from RDBMS technology or
soemwhere else does not support a underlying data model supporting that
concept.

(4) Metaphorically: A Human being born in North America is not able to
read "War and Peace" in Russian, just because the Russian has been
transcribed from Cyrillic into Latin characters. Only if (s)he has
learned Russian - acquired a "Russian Proecssing Capability" in Comp.Sc.
terms - will (s)he be able to understand it. Unwinding the metaphor: A
RDBMS will only support such things, for which the RDBMS provides -
efficient processing of regular structures. "XML" will only support such
things, for which "XML" provides - formally representing everything that
can be expressed as a tree structure ("tree" here not in the sense of a
B-Tree or offshot of one). A RDBMS able to handle a data type "text with
embedded markup" may provide support for irregular structures; an XML
encoded texts targetted at a processing module which supports a concept
of "textual ambiguity carried through to a search engine" may be able to
support data base like applications on XML-encoded texts.

(5) I am afraid 99.5 % of the current plethora of XML applications or
DBMS development are totally irrelevant for this. When most people speak
about "XML enabled databases" they talk about RDBMSs using XML as a
vehicle transporting RDBMS content from one vendor to the next
(transcribing Cyrillic into Latin, as that is the more widely spread
alphabet). Sigh, even the "native XML data bases" - as Tamino - have
usually as their primary goal for their tools the closest possible
proximity to SQL, as that is what all the Comp.Sc. students have learned
in first year the "natural" data base language to be (cf. 2 above);
which is another way of saying that they fall far short from supporting
meaningful processing of structures easily expressed in XML but alien to
RDM thinking.

(6) Ceterum censeo: This will change only, if Humanities' practitioners
think less about the SURFACE of an IT application - relational table v.
XML encoding, which software product to use - and more about the
underlying data model / data type / knowlegde representation of
Humanities' information in a Comp.Sc. definition. Let us see first, what
a FORMAL representation of a "Humanities Text" means, before we write a
DTD for it or throw it at an innocent and naive RDBMS engine.

Merry Christmas and Happy 2003 to all Humanists,
Manfred

This archive was generated by hypermail 2b30 : Thu Dec 26 2002 - 05:54:03 EST