[tei-council] Fwd: Re: s vs seg, ticket 578

Lou Burnard lou.burnard at retired.ox.ac.uk
Thu Jun 6 05:54:24 EDT 2013




-------- Original Message --------
Subject: 	Re: s vs seg, ticket 578
Date: 	Thu, 6 Jun 2013 00:14:25 +0000
From: 	Piotr Bański <bansp at o2.pl>
To: 	Lou Burnard <lou.burnard at retired.ox.ac.uk>



Hi Lou, (and, via Lou, Hi Council)

On 06/05/2013 07:51 PM, Lou Burnard wrote:
> The reason we have both <s> and <seg> is that an eminent corpus linguist
> (now sadly deceased) opined very strongly that there should be a TEI
> element which enabled users to divide a text  into smaller units (as is
> commonly done in many corpora) which did not nest and which tessellated
> the text completely. That element is <s>.

Understandable approach, on a span-based view of the text, very useful.

It's easy to look at <s>, however, as both a span and a node in the
syntactic constituent analysis, and this may be the entry point to the
"controversy".

> It was pointed out at the time
> that a more general kind of segment which could self nest and which was
> not required to tesselate the entire text would also be very useful.

Sure, though this is more of a constituent-based look, or at least it
more readily provokes such a perspective.

> That element is <seg>. I don't understand why this distinction , which
> is pretty clearly stated in the Guidelines (see eg
> http://www.tei-c.org/release/doc/tei-p5-doc/en/html/AI.html#AILCW) ,
> seems to have become
> problematic all of a sudden.

Isn't it good to realize a problem that has been lurking there for
years. My remark concerned <s> seen from a syntactic constituent
perspective -- in many cases, you want to make sure that it can be
self-nesting:

[S [S Jim likes wine ] but [S Jenny prefers beer] ]

You don't want this in typical sentence-boundary annotation, where you
want <s> elements exhaustively covering the entire text. But when you
try to make it perform both duties (it's a bit like with that sketch of
a cube where you don't know if the particular corner is sticking out
towards you or rather away from you), problems may pop up.

> Do people think the distinction is not
> useful? Do we want to abolish one or other of these elements? (no point
> in keeping both if they are to be used in the same way)?

Is this TEI-talk? ;-) <gram> vs. <gen>, etc., <seg> vs. many others.

One quick solution seems to abandon the recommendation to use <s> in the
syntactic constituency context, but how to do that, other than by
cruelly never permitting

<s> <phr><w/><w/></phr> <w/> <phr> <w/> </phr> </s>

or, in the very same vein,

<s> <s/> <w/> <s/> </s>  (see above for an example)

.. I don't know. (Because what's above seems an attractive way to
quickly annotate syntactic structure, so why not permit it).

Maybe conditionally, by saying in one place in the Guidelines that on
the span-based perspective, you don't typically want <s> to self-nest,
and in the chapter on syntactic structure, by allowing it.

Another solution is to bite it outright and allow <s> to self-nest
across the board, and to delegate the introduction of a possible ban on
self-nesting of <s> to the particular implementer -- a clean
customization, wouldn't it be.

There is a precedent, from another corpus linguist, in a rather
well-tested format:

http://www.cs.vassar.edu/CES/dtd2html/cesDoc/s.html

Best,

   P.


> Do we want to
> swop the names over? Obviously <s> is a special case of <seg>,
> so we could remove it, but that seems a bit unkind.





More information about the tei-council mailing list