16.390 thinking with the technologies / XML vs. RDM

From: Humanist Discussion Group (by way of Willard McCarty willard.mccarty@kcl.ac.uk)
Date: Thu Dec 26 2002 - 05:49:34 EST

  • Next message: Humanist Discussion Group (by way of Willard McCarty

                   Humanist Discussion Group, Vol. 16, No. 390.
           Centre for Computing in the Humanities, King's College London
                       www.kcl.ac.uk/humanities/cch/humanist/
                         Submit to: humanist@princeton.edu

             Date: Thu, 26 Dec 2002 10:41:58 +0000
             From: Manfred Thaller <manfred.thaller@uni-koeln.de>
             Subject: Thinking with the technologies / XML vs. RDM

    For obvious reasons I have been very intrigued by the recent discussion
    on "Data Bases v. XML" in Humanist, and had indeed the intention of
    contributing to it. The usual information / work overload has prevented
    me so far to react to it in a more appropriate way, weighting all the
    arguments in the most recent Humanist-incarnation of a discussion in
    which outside of Humanist I have taken part for quite some time and
    formulate specificlly addressed responses to the individual
    contributions.

    Christmas being here, and me becoming incommunicado for the next two and
    a half weeks, I either participate now or never.

    So let me formulate a few general theses, not addressed particularly at
    any one contribution so far, rather in the medieval tradition of a
    "thesis" summing up ones own point of view.

    (1) Humanities' sources contain texts which have some properties, which
    as such are different from the types of information handled by computer
    science at large, totally INDEPENDENTLY of the sub-field of computer
    science we think of, or of the tools this particular field of computer
    science has eventially developed. Certain types of ambiguity are always
    a good starting example.

    (2) Data Bases. Please remember, as far as Comp.Sc. is concerned,
    relational data bases are a special case of a far more general
    phenomenon. Albeit a case which (a) for virtue of the ease with which it
    can be handled by fairly conventional mathematical tools has intrigued
    the theoretical Comp.Sc. community for some time and (b) because of the
    cornucopia of implemented and easily available tools has dominated
    training in the software engineering branches of Comp. Sc. to a degree,
    where even Comp. Sc. tend to forget that RDBMS-es are only a very
    special case of a broader phenomenon.

    In my opinion, Humanities practitioners of Computing tend to obscure the
    situation further, by mixing up the need for an "information system",
    which is a rather special animal in the herds of DBMS applications, with
    the implications of a specific data model. Any information system can
    rather easily be described as a set of objects out of which any subset
    pertaining to a specific query can be extracted and displayed - possibly
    processed a bit before displaying - by ways of an access mechanism
    which usually involves some kind of index structure.

    In that sense, a Humanities information "bridging the gap between RDBMSs
    and XML" might be construed as a solution, which extracts some kind of
    information from an XML encoded text - say words - indexes them need
    with the help of a relational table for which the word or its lemma
    become the primary keys, maintains a link to the textual document from
    which the word or wordform have been taken and fires a suitable XML
    editor when a specific document selected via that index becomes due for
    display.

    Well, technically that is not exactly brilliant: A simple B-Tree
    indexer, avoiding the overhead of a full blown RDBMS would be
    incomparably more effective; but it would be a solution.
    In both cases however - indexing via data base or indexing via a more
    directly controlled tree - the genuinely Humanistic problem, that the
    indexed term might contain examples of ambiguity (doubtful readings,
    e.g.) which could easily be marked up in XML, that quality would be
    lost, as the tools used for handling the indexing problem, bw it RDBMS
    or plain indexer, would have no concept of a "textual atom with embedded
    ambiguity".

    (3) XML. Well - here the problem is, that "XML" as such does not say
    very much. As most of us are aware, the more recent versions of
    StarOffice / Open Office use XML for encoding the texts it is
    processing. A truly XML based wordprocessing package! Unfortunately that
    still does NOT mean, that StarOffice or OpenOffice would be able to
    handle textual variance, because the underlying abstract data model -
    what Star Office / OpenOffice understand "a text" to be - does not
    support that notion. Which is exactly the problem, why it is irrelevant
    for fast searching purposes, whether textual ambiguity can be coded in
    XML or not, as long as tool which is derived from RDBMS technology or
    soemwhere else does not support a underlying data model supporting that
    concept.

    (4) Metaphorically: A Human being born in North America is not able to
    read "War and Peace" in Russian, just because the Russian has been
    transcribed from Cyrillic into Latin characters. Only if (s)he has
    learned Russian - acquired a "Russian Proecssing Capability" in Comp.Sc.
    terms - will (s)he be able to understand it. Unwinding the metaphor: A
    RDBMS will only support such things, for which the RDBMS provides -
    efficient processing of regular structures. "XML" will only support such
    things, for which "XML" provides - formally representing everything that
    can be expressed as a tree structure ("tree" here not in the sense of a
    B-Tree or offshot of one). A RDBMS able to handle a data type "text with
    embedded markup" may provide support for irregular structures; an XML
    encoded texts targetted at a processing module which supports a concept
    of "textual ambiguity carried through to a search engine" may be able to
    support data base like applications on XML-encoded texts.

    (5) I am afraid 99.5 % of the current plethora of XML applications or
    DBMS development are totally irrelevant for this. When most people speak
    about "XML enabled databases" they talk about RDBMSs using XML as a
    vehicle transporting RDBMS content from one vendor to the next
    (transcribing Cyrillic into Latin, as that is the more widely spread
    alphabet). Sigh, even the "native XML data bases" - as Tamino - have
    usually as their primary goal for their tools the closest possible
    proximity to SQL, as that is what all the Comp.Sc. students have learned
    in first year the "natural" data base language to be (cf. 2 above);
    which is another way of saying that they fall far short from supporting
    meaningful processing of structures easily expressed in XML but alien to
    RDM thinking.

    (6) Ceterum censeo: This will change only, if Humanities' practitioners
    think less about the SURFACE of an IT application - relational table v.
    XML encoding, which software product to use - and more about the
    underlying data model / data type / knowlegde representation of
    Humanities' information in a Comp.Sc. definition. Let us see first, what
    a FORMAL representation of a "Humanities Text" means, before we write a
    DTD for it or throw it at an innocent and naive RDBMS engine.

    Merry Christmas and Happy 2003 to all Humanists,
    Manfred



    This archive was generated by hypermail 2b30 : Thu Dec 26 2002 - 05:54:03 EST