19.458 relational database and TEI

From: Humanist Discussion Group (by way of Willard McCarty willard.mccarty_at_kcl.ac.uk>
Date: Wed, 30 Nov 2005 07:09:33 +0000

               Humanist Discussion Group, Vol. 19, No. 458.
       Centre for Computing in the Humanities, King's College London
                   www.kcl.ac.uk/humanities/cch/humanist/
                        www.princeton.edu/humanist/
                     Submit to: humanist_at_princeton.edu

   [1] From: Mark Olsen <mark_at_barkov.uchicago.edu> (25)
         Subject: Re: 19.454 relational database and TEI

   [2] From: James Cummings <James.Cummings_at_ota.ahds.ac.uk> (49)
         Subject: Re: 19.454 relational database and TEI

   [3] From: Joris van Zundert <joris.van.zundert_at_gmail.com> (94)
         Subject: Re: 19.454 relational database and TEI

--[1]------------------------------------------------------------------
         Date: Wed, 30 Nov 2005 06:25:32 +0000
         From: Mark Olsen <mark_at_barkov.uchicago.edu>
         Subject: Re: 19.454 relational database and TEI

Hi,

It could be that David's assertion that Slashdot readers are notoriously
anti-XML -- and even anti-humanities computing -- is correct and that they
should be disregarded as such. I have, however, been seeing alot of
heated discussion about XML and database theory. Have a peek at Fabian
Pascal's voluminous rants as one example:
     http://www.dbdebunk.com/index.html
I suspect that there is some substance to the complaints on both theoretical
and practical grounds.

PhiloLogic (the open source system for TEI and other document collections
the ARTFL project is working on, http://philologic.uchicago.edu) is based
on a mixed mode processing model. We use a relational database engine for
object management and an indexing engine that includes a limited (X)path
notation. We have so far avoided using a number of XML tools or
specifications for large systems because of very significant performance
issues, not to mention prohibitive cost if one looks at commercial
packages. I do exect this to change over time, but some of the
theoretical objections raised by Pascal and others may provide some
outside limits to optimization.

We *are* using XML database tools (eXist and Berkeley) for smaller, specific
applications, most notably for highly dynamic and repurposable individual
documents. But intensive processing of small numbers of documents
as databases as a content management system would not really qualify
as a "database" application in the way that XML critics would frame it.

Mark

--[2]------------------------------------------------------------------
         Date: Wed, 30 Nov 2005 06:26:12 +0000
         From: James Cummings <James.Cummings_at_ota.ahds.ac.uk>
         Subject: Re: 19.454 relational database and TEI

> [quoting from a Slashdot post]
>> >>That's probably because an XML database is NOT a decent idea. XML is
> NOT
>> >>meant to be used as a way to store data! Rather, it's a way to
>> >>communicate data between entities.
>
> Slashdot readers, in my experience, are notoriously anti-XML (as well as
> anti humanities computing--or maybe not so much "anti" as clueless about
> its existence).

I think, as the slashdot quote evinces, that their underlying problem is they
don't conceive of XML as an appropriate way to store data. Moreover, that when
they think of data they mean fairly limited barely, if at all, nested tabular
data. It doesn't occur to them that people might be using it to encode full
length documents, or that this is really the area from which it has arisen. I
wonder what they would see as an appropriate storage format for the kinds of
documents we deal with? XML (as I'm sure many here will agree) certainly has
its flaws, and is not necessarily the solution which everyone is looking for,
but definitely has its benefits for those who can get useful answers from
interrogating documents with an arbitrary depth of nested information.

> Now that XQuery 1.0 is a Candidate Recommendation from
> the W3C, we are likely to see increased commitment to native XML
> database software.
<snip/>
> ...the open-source offerings are still far
> from being production-quality. But that situation should be changing.
> In any case I would argue that XML is an appropriate data storage format
> for many archival and publishing projects in humanities computing.

I'd agree, the range of support for XQuery is increasing. Some of the open
source offerings, eXist being perhaps one of the best, are reaching significant
levels of maturity and are production-quality to some degree. (i.e. dependent
on your amount of XML or your anticipated frequency of querying.) I
think it is
an area people should experiment in more, if they have the
time/energy/patience.

I think Orietta's original question stems from the project's
dependence upon the
local IT support who understands RDBMS/SQL-based solutions and has no interest
in XML, much less querying it. I think her desire to be able to eventually
export from this database to TEI P5 manuscript descriptions is a laudable one,
and her concern is to design a database that won't be too hard to convert.

Those interested in this area might be want to also look at middleware such as:
http://xquare.objectweb.org/bridge/index.html
or an XML Database built on top of a relational database. For example,
http://gborg.postgresql.org/project/xpsql/projdisplay.php built on top of
postgresql.

-James

-- 
Dr James Cummings, Oxford Text Archive, University of Oxford
James dot Cummings at oucs dot ox dot ac dot uk
--[3]------------------------------------------------------------------
         Date: Wed, 30 Nov 2005 06:33:03 +0000
         From: Joris van Zundert <joris.van.zundert_at_gmail.com>
         Subject: Re: 19.454 relational database and TEI
  > Slashdot readers, in my experience, are notoriously anti-XML (as well as
  > anti humanities computing--or maybe not so much "anti" as clueless about
  > its existence). Now that XQuery 1.0 is a Candidate Recommendation from
  > the W3C, we are likely to see increased commitment to native XML
  > database software. And there are already several commercial products
  > that are robust and powerful (e.g. MarkLogic Server, which we use at UVa
  > Press and which Oxford and Elsevier also use). It's true that they are
  > quite expensive (we couldn't possibly have acquired it without
  > dedicated grant money) and that the open-source offerings are still far
  > from being production-quality. But that situation should be changing.
  > In any case I would argue that XML is an appropriate data storage format
  > for many archival and publishing projects in humanities computing.
  >
  > David Sewell
Hi all,
I hesitate to follow up on this post, because=20
this has all the promise of ending up in the all=20
to familiar and seemingly inconsolable=20
do-or-don't-put-your-XML-in-a-database dichotomy.
In my experience the choice depends rather on=20
solid commitment to XML for which ever reason=20
than whether it would actually be a wise choice=20
to burden your database with XML. It seems that=20
people that have a lot of XML like XML databases=20
and that people who don't=85 don't. The former tend=20
to author XML and store it for future use, the=20
latter tend to generate it when it's really=20
needed. And in all probability they're all quite=20
right given their particular goals, needs and=20
context. In any case, for the record I can't=20
resist adding the following remarks...
* I don't think I really care whether=20
Slashdotters like me as a humanities computing=20
person or not. But I do value their opinion about=20
the merit of technical solutions. Their relative=20
shortsightedness of our particular domain doesn't=20
disqualify their judgments on the technical=20
aptness, applicability and quality of engineered approaches.
* A W3C recommendation by itself is not proof for=20
the technical soundness of a proposed solution.=20
Rather it states that it's one of a number of=20
possible solutions to a particular problem, being=20
the preferred choice of a certain group of=20
people. That doesn't imply that the proposed=20
solution is a good solution, a solid one or even=20
a nice one. XSLT 1.0 is a W3C recommendation.=20
However, any engineer will tell you that XSLT=20
mixes the characteristics of a templating=20
language with those of a procedural language.=20
This has caused XSLT to be a limping hybrid=20
that's messy in nature and induces messy code=20
(which is hard to sustain). Unfortunately the=20
XSLT 2.0 candidate makes things worse. So yes, a=20
candidate recommendation will propagate certain=20
solutions, but that might not necessarily be a good thing.
* I do very much agree that XML is an appropriate=20
data storage format for archival purposes. But=20
when it's archiving you're after, why put the=20
XML-files in a database that may prove to be the=20
future obsolete proprietary tool of some=20
commercial vendor? What happens if your grant=20
runs out and you experience vendor lock-in? If=20
it's archiving you're after, put your files in an=20
open source and open standards compliant=20
repository. A search engine will do for discovery=20
and retrieval. Less costly, sustainable maintenance.
* Putting XML in a database (any database)=20
burdens the database with redundant information.=20
In essence any XML Schema is a structured data=20
model. From a purely engineering viewpoint one=20
should directly map such data models to tables,=20
fields and relations in a database. Putting XML=20
into the fields of a database is downgrading the=20
database to a file system like storage=20
environment. Moreover, putting XML in the fields=20
of a database implies a certain amount of=20
overhead of logic to retrieve the information you=20
want from the database. Additional to querying=20
the database for the field which contains the=20
appropriate piece of XML, you'll have to run an=20
XPath or XQuery to extract the relevant=20
information from the XML found in the field.
In the end the question that remains, is whether=20
putting XML in a database causes in any way such=20
an added value that it merits the downsides of=20
possible vendor lock in and performance loss due=20
to the additional logic needed. In my view you=20
either want to archive XML or you want to query=20
XML. In the first case: tend to your storage=20
hardware and don't mind the database. In the=20
latter case an open source parsing solution will do.
So when exactly do you need that database?
y.s.,
Joris
Received on Wed Nov 30 2005 - 02:34:50 EST

This archive was generated by hypermail 2.2.0 : Wed Nov 30 2005 - 02:34:51 EST