[tei-council] building a robust and flexible platform for the collaborative exploration and curation of TEI data?

Thu May 12 14:05:31 EDT 2011

Dear Colleague on the Board or Council,

Over the course of the last year I have been in a number of formal or
informal conversations about the lack of a robust and flexible platform for
the collaborative exploration and curation of textual data in XML form.
There are individual project here or there in which various aspects of such
a platform have been realized for a specific purpose. But there does not
appear to be anything "out there" that provides a customizable framework
that allows projects to get off the ground in days or weeks rather than
weeks or moths. From experience and observation I conclude that the lack of
such a framework is a continuing obstacle to the use of TEI. Add the still
common perception of the TEI as a strange and difficult animal  to the lack
of software based on familiar  technologies, and you have a very high entry
barrier for TEI-based projects. This is in marked contrast to the ubiquity
of Ruby-on-Rails or similar solutions that couple some MVC framework with a
SQL database. 

Is there anything that can or should be done about this? From looking
around, I observe that pieces of a solution exist here and there in various
individual projects. I suspect that the folks in those projects usually have
their hands full finishing their project and lack the time or resources to
abstract from their particulars to a general solution. I am aware of the
German TextGrid project, which will release version 1.0 this summer. From
what I have seen it is a general and powerful tool for building scholarly
editions. I am less sure whether it will support or can be extended to
support data exploration in a sufficiently user-friendly way.

Below I sketch some requirements for a text-centric, but image and
media-friendly environment for the collaborative analysis, annotation, and
curation of cultural heritage data. I call it Hermes since he was the god of
writing (among other things) and lent his name not only to very elegant
French scarves but also to the best portable manual typewriter ever made.
Hermes should scale gracefully from small to moderately large project (texts
in the low thousands with an upper limit of ~100 million words. More is of
course better, but many useful scholarly and pedagogical projects could
operate within that scale. )

Come to think of it, the Hermes typewriter was Swiss, with all the virtues
(and prices) of high-end Swiss product. Can we think of digital Hermes as a
Swiss army knife for TEI purposes, as mechanically solid and perhaps a
little more affordable than the Hermes typewriters of my youth ?

I would be grateful for your advice on whether some of this is feasible and
whether the TEI should take a direct or indirect interest in making it
happen. On rereading an earlier draft of this memo, I noticed that little of
what I say has to do with the first stage of encoding the texts, and most of
it has to do with what you do with them once you have them. So in some ways
all this is out of scope for an initiative that has "encoding" in its name.
But why encode unless there are plausible pathways (with reasonable learning
curves) to doing things with the stuff you have encoded?

Here are my requirements under the headings of
1. User management and permissions
2. A text base that can manage data in XML and relational formats
3. Text analysis
4. Annotation
5. Data curation
6. Hermes and media

1. User management and permissions:

If its data are in the public domain, everybody may browse and search the
collections.
If the data are not in the public domain, they may be browsed and searched
by any member of an institution that has the appropriate licenses.

Users may create user accounts with three different levels of privileges;

Any user who is logged in may create annotations or make curatorial
suggestions. Users may decide to see only things that have been "approved"
(see below), which is the default setting, or everything. Reviewers, who are
given those privileges by editors, may review and edit the contributions of
users, but do not have final approval privileges. Editors have reviewer
privileges and in addition are authorized to approve things for public
display. To the first-time visitor Hermes will always display its approved
materials only. 

2.  Hermes as a text base that can manage data in XML and relational formats
The primary purpose of Hermes is to support  the analysis, annotation, and
curation of XML documents, and in particular of TEI documents.  At the same
time, many data derivates are managed more efficiently, conveniently or in a
more familiar manner in SQL based environments. Thus the Hermes framework
should be able to "talk" with equal ease to XML and relational data. There
are several possible implementation strategies:
1. Hermes is agnostic about what XML it is given.
2. Hermes is restricted to the use of the ~60 TEI elements (not counting the
teiHeader) that are used in the TCP texts or other library-based  projects
that account for the large majority of publicly available and respectably
transcribed texts
3. Hermes is agnostic about what XML it is given but has default settings
that support the element set described in #2
Hermes keeps its display and layout functionalities deliberately simple in
something like the blue buckram style of the Oxford Classical texts.

Hermes supports linguistically annotated texts and  the kinds of analysis
that are enabled by such annotation. Hermes will include a "dynamic lexicon"
that is an organized set of derivatives from the linguistic annotation.
Queries and visualizations based on those data are one example of a
situation in which developers may be more familiar with generating outputs
from familiar relational data.

3. Text analysis in Hermes
Hermes should include basic corpus query tools that support what Lou Burnard
has called "robust searching." Add to these  simple statistical routines,
such as collocation analysis or the G-test,  as well as simple
visualizations that support, for instance, two dimensional visualizations of
the distribution of lexical data along the axes of time and genre.

4. Annotation in Hermes
Most of the annotational needs in Hermes could be supported through the Atom
protocol, which, if I understand this correctly, supports arbitrary XML
"payloads" that can be attached to any XML node as a target. Hermes should
support simple  free-form annotation, but also more structured approaches
that use a template-based approach so that annotations provided by different
users observe common ways of paragraphing, bibliographical citation,
quotation, and some nesting of notes within notes.

Hermes should include a set of canned and customizable xqueries that deal
with common XML aware search scenarios. These should be subject to revision
based on user experience and demands.

5. Data curation in Hermes
A great deal of curation can be modeled as special forms of annotation, with
form-based templates whose content is fed back to a database for review by
the editors (with or without the help of algorithms).  This is especially
true of "curation en passant," when in working with a text you spot an error
and take the time (it should take less than 30 seconds) to report it. In
many cases the mere act of flagging that something is wrong will be of great
help (user X reported that there is an error at location Y).

There are two "behind the scenes" aspects to data curation. First, user
feedback needs to be machine actionable so that editors can get at it from a
variety of perspectives and find the most time-effective ways of approving
or modifying user suggestions. Second, there  is a need for flexible CRUD
capabilities (create, retrieve, update, delete). Ideally, a change once
approved, should lead to an immediate update. Batch updates on a weekly or
monthly basis are OK, but users need to be able to see the results of their
contribution within a relatively short time frame.

It will also be important to make the batch updating of data an event over
which the editors have control so that it can be done "without the help of
10 IT people" as Marjorie Burghart recently put it on the TEI list. More
importantly, if it can only be done when both the editors and the developers
have time, it will be done very rarely. So the developers' and editors's
time tables need to be decoupled.

6. Hermes and media
Hermes is text-centric but media-friendly.

At the most basic level, access to high-quality and easily manipulable page
images of textual sources is a prime desideratum for effective data
curation. Side-by-side display of page image and OCR/transcription with good
support for textual alignment is critical.

Other basic affordances involve image, audio, or video files that are used
as annotations for very granular text elements. For instance,  an audio file
could be attached to a text line by line or word by word, which would be
useful in teaching Chaucer, or in a prosody site, where different forms of
metrical annotation might be accompanied by different performances.

How hard would it be to build a version of Hermes that could be useful at an
early stage of limited scope and complexity, but become more powerful over
time? I can see a variety of projects here and there that implement aspects
of it. OMEKA does something like it, but it is not text centric. To repeat a
point made above, several TEI projects implement important functionalities,
but do not seem to have been built with the idea of reusable tool kits.

Is it a realistic project to share expertise and experience in a common
project that might attract external grant funding? To be specific, members
of the current Council in New Zealand, Nebraska, Vancouver Island, and
London, to mention only a few, have done superb individual projects that
leverage XML database technology in one way or another, using the eXist
database. Can we or should we take a leaf from the Lexus project at the
Nijmegen Max Planck Institute for Psycholinguistics, which happens to use
basex, a recent eXist competitor from the University of Konstanz?  Here is a
statement of their ambition:

> Lexicography in general is a domain where uniformity and interoperability have
> never been the operative words: depending on the purpose and tools used
> different formats, structures and terminologies are being adopted, which makes
> cross lexica search, merging, linking and comparison an extremely difficult
> task. LEXUS is also an attempt at putting an end to this problem. Being based
> on the Lexical Markup Framework (LMF), an abstract model for the creation of
> customized lexicons defined following the recommendations of the ISO/TC 37/SC
> 4 group on the standardization of linguistic terminology, LEXUS allows on the
> one hand to create purpose-specific and tailor-made lexica, and on at same
> time assures their comparability and interoperability with other resources.

This is a very hard problem to solve in Linguistics, and it may be even
harder to solve in the humanities where quot homines tot sententiae is not a
complaint but a celebration. But I'm inclined to think that we would be
better of if we took some steps towards solving or at least managing it.
And while encoding is clearly the primary objective of the TEI, sometimes
the pursuit of a primary objective is best advanced by removing secondary
obstacles. I think that the perceived lack of tools, templates, and
well-documented examples is a significant obstacle to the use of TEI texts
in a wide variety of projects, and that there would be considerable payoff
in focusing for a little while on how to move TEI encoded data into
appropriate Web frameworks, how to mix and match them with other data types,
and how to build the analysis tools and routines that can take full
advantage of the query potential of TEI-encoded data.

I would be delighted to learn that I am quite wrong on all this and that the
appropriate tools and templates are not only fully understood but are also
widely used in many scholarly or pedagogical projects and at different
levels of scope and complexity.  But if there is some truth to my arguments,
can or should we do something about it?

Martin Mueller