[tei-council] PH-PrimarySources (including facsimile markup)

Conal Tuohy Conal.Tuohy at vuw.ac.nz
Sat Sep 15 08:38:09 EDT 2007


I have sent a couple of edited versions of PH-PrimarySources to Lou.

The first version was a simple copy edit which fixes a few minor errors, without altering the substance.

The second version additionally addressed some of the substantive questions remaining about the semantics of the facsimile markup, by describing how *I* think the facsimile markup *should* work. I've also produced altered spec files for the elements concerned: facsimile, surface, and zone.

== schema simplicity ==

My draft schema is simpler and therefore allows much less flexibility in how to mark up a facsimile (it's almost a strict subset schema of the existing draft, in fact). For instance, whereas the current draft allows for a <facsimile> to directly contain <graphic> elements, this would not be possible in my version, which would require every graphic to be nested within a zone, which would always be nested within a surface. Why should we prefer such a rigid markup? I feel that it's important that the facsimile schema itself be kept simple (even at the expense of some verbosity in instance documents), because a simpler schema is easier to understand, and especially because it makes it easier to write processing software which is able to handle all allowable markup.

Dot and I had the goal from early on of offering just 2 levels of complexity - an "entry-level" option for the simplest cases, and a second option for all the other cases. The "simplest case" ended up being thrown out, but we still have a "simple cast" which is accommodated by the ability to say:
<pb facs="http://upload.wikimedia.org/wikipedia/commons/5/50/Handschrift.karlsruhe.blb.jpg/>
I think that if this shortcut is too simple for a particular encoding project, then the project may as well use the full markup, and that there's no real value in offering a broader range of options.

== semantics of <gi>surface</gi> ==

If I understand it correctly, in the existing draft, the <surface> element has one <graphic> child which is specially treated as the "canonical" image of that surface. In my draft, all the graphics representing a surface are treated in the same way. 

In my draft, the <surface> element represents a physical page or other inscribed surface. The surface's @box attribute gives the bounding box of that <surface>, expressed in a chosen coordinate space. The <surface> then contains a number of <zone> elements, each of which represents some rectangular area, using a @box attribute whose values are from the same coordinate space. 

NB just as in the existing draft, the coordinate space may be chosen to suit the encoder. If the encoder has a full image of the page, at sufficient resolution for the measurements they wisht o make, then the obvious option is to use the pixel coordinates of that image as their coordinate space. Alternatively, they might measure the physical artifact in mm or some other physical unit. The important point is that the unit of measurement is arbitrary and only needs to be consistent among the surface and its zones, since its function is only to align those zones and the surface, for which only relative measurements are needed.

== examples ==

I have converted Sebastian's example file (which includes a manuscript page and a gravestone) to use my preferred schema, and I've written an XSLT to convert the example into XHTML in which the images and analytical zones are overlaid, with the transcribed text presented using @title attributes. The XSLT is fairly well commented, and the most complicated parts of it by far are the bits which parse the 4 components of the @box attributes (i.e. the positions of the left, top, bottom, and right edges of the box). If the @box attribute were replaced by 4 distinct attributes (as Sebastian suggested recently, and as was the case in an earlier version of the proposal), the stylesheet would be about half as long. :-)

== remaining issues? ==

I have 2 other items to put forward: the first and more important being to split the @box attribute into four attributes (as above), the other being to possibly extend the content model of <zone> to allow in to include <note>, and perhaps other things?

== files ==

I have attached the files which I have updated to my Trac ticket:

The spec files for the facsimile, surface, and zone elements:
http://tei.oucs.ox.ac.uk/trac/TEIP5/attachment/ticket/291/facsimile.xml
http://tei.oucs.ox.ac.uk/trac/TEIP5/attachment/ticket/291/surface.xml
http://tei.oucs.ox.ac.uk/trac/TEIP5/attachment/ticket/291/zone.xml

The "Transcription of Primary Sources" chapter:
http://tei.oucs.ox.ac.uk/trac/TEIP5/attachment/ticket/291/PH-PrimarySources.xml

An example XML file and an XSLT for rendering it as HTML:
http://tei.oucs.ox.ac.uk/trac/TEIP5/attachment/ticket/291/testtranscr2.xml
http://tei.oucs.ox.ac.uk/trac/TEIP5/attachment/ticket/291/transcr2.xsl

NB the example can be experienced by dropping these last 2 files into the "P5\Test" folder from the SVN repository. The XML file uses an XML processing instruction to invoke the XSLT, so it should be enough to open the XML file in your browser.

Regards to all

Con













More information about the tei-council mailing list