[tei-council] Report from Berlin

Tue Oct 23 16:07:08 EDT 2012

*EIT MMI Meeting, Berlin 22 oct 2012*

As noted at the last FTF, Laurent Romary in his capacity as ISO TC7 WG3 
chair has proposed a new ISO/TEI joint activity in the area of speech 
transcription, which comes with the slightly obscure label of EIT MMI: 
the last part of which is short for “multimodal interaction”, although 
it seems the activity is really only concerned with speech 
transcription. I was invited to attend the third EIT MMI workshop, held 
at the DIN's offices in Berlin. Prime movers in the activity, apart from 
Laurent, appear to be Thomas Schmidt and Andreas Witt from the Institut 
fur Deutsche Sprache in Mannheim, but a number of other European 
research labs, mostly concerned with analysis of corpora of human 
computer interaction, were also represented; specifically: Nadia Mana 
from FBK (Trento, Italy); Tatjana Scheffler (DFKI, Germany); Khiet 
Truong (Univ of Twente) ; Benjamin Weiss (TU Berlin); Mathias Wilhelm 
(DAI Labor); Bertrand Gaiffe (ATILF, Nancy). This being an ISO activity, 
the real world of commerce and industry was also represented by Felix 
Burkhardt from Deutsche Telekom's Innovation Lab.
Related ISO activity mentioned by Laurent included the work on Discourse 
Relations led by Harry Bunt, and the long-awaited MAF (morpho-syntactic 
annotation framework) which are both due to appear Real Soon Now. A 
quick tour de table confirmed my impression that most of the attendees 
were primarily researchers in Human Computer Interaction with little 
direct experience of the construction or encoding of spoken corpora, but 
Thomas Schmidt more than made up for that. The main business of the day 
was to go through his preliminary draft working document, the objective 
of which is to confer ISO authority on a subset of the existing TEI 
proposals for spoken text transcription, with some possible 
modification. The underlying work is well described in Schmidt's recent 
excellent article in TEIJ, so I won't repeat it: essentially, it 
consists of a close look at the majority of transcription formats used 
by the relevant research community/ies and tools, a synthesis of what 
they have in common, and suggestions of how that synthesis maps to TEI. 
This is to a large extent motivated by concerns about preservation and 
migration of data in “legacy” formats.

The discussion began by establishing boundaries: despite my proposal to 
the contrary, it seems there was little appetite to extend the work into 
the area of truly multimodal transcriptions, which was still generally 
felt to be insufficiently understood for a practice-based standard to be 
appropriate. Concern was expressed that we should not make ad hoc 
premature suggestions. So the document really only concerns transcribed 
speech. There was no disagreement with the general approach which is to
distinguish a small number of macro-structural featuresprovide 
guidelines about how to mark up specific units of analysis at the 
micro-structural level, using a subset of the TEI.
I was also much cheered by two further remarks he made
the graph-based “annotation framework” formalisation proposed by Bird 
and Liberman was theoretically complete but so generic as to be 
practically useless (I paraphrase)
at the micro level, everything you need is there in the TEI (I quote)

Discussion focussed on the following points raised by the working document:

*Tiers*

Many existing tools organise transcriptions into “tiers” of annotation. 
These seem to be purely technical artefacts, which can be addressed more 
exactly by used of XML markup. Unlike “levels” of annotation, they have 
no semantics. It's doubtful that we need a <tier> element.

*Metadata -1*

How many of the (very rich) TEI proposals should be included, or 
mentioned? And how should the three things Thomas had found missing be 
supplied? I suggested that <appinfo> was an appropriate way to record 
information about the transcription tool used; that the definition of 
the transcription system used belonged in the <encodingDesc>; and agreed 
that there was nothing specifically provided for recording pointers or 
links to the original video or audio transcribed. In the meeting, I 
speculated that maybe there was scope for extending (or misusing) 
<facsimile> for this last purpose; another possibility which pccurs to 
me as I type these notes is that one could also extend <recordingDesc>.

*Timing*

The timeline is fundamental to the macrostructure of a transcript. 
Thomas' examples all used absolute times for its <when>s, but I 
suggested that relative ones might be easier. The document ordering both 
of <when>s and of transcribed speech should reflect the temporal order 
as far as possible; this would allegedly facilitate interoperability

*Metadata-2*

What metadata was needed, required, recommended for the description of 
participants? (@sex raised its ugly head here). Could we use <person> to 
refer to artificial respondents in MMI experiments? (yes, if they have 
person-like characteristics; no otherwise)

It was noted that almost any personal trait or state might be crucial to 
the analysis of some corpora. We noted that CMDI now recommended using 
the ISOCAT data category registry as an independent way of defining 
metadata terminology; also that ISOCAT was now available within the TEI 
scheme (though whether it fits into personal metadata I am less sure). 
There was (I think) general agreement that we'd reference the various 
options available in the TEI but not incorporate all of them.

We agreed that the principles underlying a given transcription should be 
clearly documented, either in associated articles, in the formal 
specification for an encoding, or in the header of individual documents.

*Utterances*

Several people disliked the expanded element name and its 
definition, for various theoretical reasons. Its definition should be 
modified to remove the implication that it necessarily followed a 
silence, though we seemed to agree that a could only contain a 
stretch of speech from a single speaker.

The temporal alignment of a can be indicated either by @start and 
@end or by nested <anchor/>s : the standard should probably recommend 
use of one or the other methods but not both. We discussed whether or 
not the fact that existing tools did not support the (even simpler) use 
of @trans to indicate overlap should lead us not to recommend it.

*U-plus*

Thomas wanted some method of associating with a the whole block of 
annotations made on it (represented as one or more <interpGrp>s). His 
document suggested using <div> for this purpose. A lighter-weight 
solution might be to include <interpGrp> within , or to propose a new 
wrapper <annotatedU> element.

*Tokenization*

Laurent noted that MAF recommended use of <w> for individual tokens; we 
didn't need to take a stand on the definition of “word” but could simply 
refer to MAF. We needed some way of signalling the things that older 
transcription formats had found important, e.g. words considered 
incomplete, false starts, repetitions, abbreviations etc. so we needed 
to choose an appropriate TEI construct for them, even if we thought the 
concept was not useful or ill-defined. The general purpose <seg> element 
might be the simplest solution, but some diplomacy would be needed about 
how to define its application and its possible @type or @function values.

*Conclusions*

This workgroup will probably produce a useful document describing an 
important use case for the TEI recommendations on spoken language. It is 
currently a Google Doc which the group has agreed to share with the 
Council. I undertook to help turn this into an ODD, which could 
eventually become one of our Exemplars. Work on standardising other 
aspects of transcribed multimodal interactions probably needs to be 
deferred to a later stage.