[tei-council] What's the appropriate attribute for this?

Conal Tuohy conal.tuohy at vuw.ac.nz
Thu Dec 20 22:23:58 EST 2007


On Thu, 2007-12-20 at 09:39 -0800, Martin Holmes wrote:
> That's good enough for me: @facs it is!

Hi Martin!

It's now obvious that the semantics of @facs haven't been defined
sufficiently clearly, and it looks to me like the TEI Council is going
to have to return to this point and clarify the guidelines in the near
future. In the meantime, given the lack of consensus, I really think you
should keep your options open with regard to the IMT software.

I've made my position clear already (that I think @facs should be
reserved for linking images with transcriptions), but I'd like to clear
up why (pragmatically) I think this is a good idea...

Increasingly the TEI community has to face up to the challenges of scale
and interoperability. I think projects like the German "TextGrid" (in
which TEI-encoded texts are aggregated and exposed to multiple uses) are
a clear sign of (good) things to come. The fact that P5 includes strict
guidelines on using XML namespaces is another sign of this
"globalisation" in the TEI, because although the use of namespaces is an
extra overhead to individual projects, it's also an important factor in
facilitating interchangeability of TEI documents (I am a bit of a
"namespace Nazi" for this reason). In short, as a community we need to
consider what's going on in the wider environment in which our texts
will persist, and we need a broad consensus on the semantics of the TEI.

The plan all along with the new TEI facsimile markup was to support a
number of different use cases and encoding practices. We canvassed a
number of projects (not just in the TEI community) and tried to produce
something that was simple but still covered most common cases. We tried
to provide markup that was adequate for IMT users as well as some other
use cases. 

Let me talk about one of these other use cases which interests me:
automated TEI encoding from OCR. I think it's possible that
OCR-generated TEI may turn out to be the most common application of the
facsimile TEI markup, simply because OCR (for all its lack of "quality")
is automatable, and hence can be done fairly cheaply on a large scale.
Such OCR-generated TEI documents will have facsimile elements containing
a large number of zone elements, each representing just a single word.
(A lot of OCR engines already do this, though they typically use HTML
+CSS or ALTO/METS rather than facsimile TEI). 

Anyway, the crux of the problem is that these TEI documents will of
course use @facs purely to indicate that a piece of TEI is a
transcription of a zone of the scanned image (rather than, say, a
commentary or scholarly analysis of that image, because OCR software
doesn't do literary scholarship), and so the software written to handle
such documents will naturally make the same assumption. If IMT-authored
documents don't share that assumption then we will have an
interoperability problem. We will have 2 classes of TEI facsimiles:
OCR-facsimiles (which use @facs purely for transcripton) and other
facsimiles (which use @facs for other purposes as well), and these texts
will have their own software tools which will be incompatible. 

How might software tools handle these two classes of facsimile texts?
They will need to be able to distinguish a transcription in the strict
sense from something which is a commentary or analysis or whatever
(because it will not be feasible to treat such things in the same way as
transcriptions). If the @facs attribute is used for both types of
relationships, then this is clearly impossible. 

So I conclude we need two linking attributes: one which is purely to
indicate a facsimile-transcription relation, and another with a more
general meaning (i.e. any kind of relation). Personally I don't
particularly care what these attributes are called, but I'd always
thought that people would want to use @ana or @corresp for relationships
other than pure transcription. On the other hand, if @facs is to be used
with a broad meaning, I think it will be necessary to define a new
attribute to cover the more specific meaning (e.g. "transcriptionOf"). 

Phew!

Happy holidays, especially to all you northern-hemisphereans! I'm off to
the beach! ;-)

Con






More information about the tei-council mailing list