[tei-council] [Fwd: Re: recording multiword expression in lemma attribute]

Lou Burnard lou.burnard at computing-services.oxford.ac.uk
Wed May 2 12:56:05 EDT 2007


This is really a datatype problem. Is there any desire on Council to 
review the datatype of the @lem attribute so as to address the issue?

Making it xsd:token rather than data.word would help with the specific 
case Elena raises, at the expense of making this attribute inconsistent 
with all the other cases of "texty" attributes.


-------- Original Message --------
Subject: 	Re: recording multiword expression in lemma attribute
Date: 	Wed, 02 May 2007 17:03:41 +0100
From: 	Elena Pierazzo <elena.pierazzo at kcl.ac.uk>
To: 	Lou Burnard <lou.burnard at COMPUTING-SERVICES.OXFORD.AC.UK>
CC: 	TEI-L at listserv.brown.edu
References: 	<20070502152350.59BF2EB04D at webmail221.herald.ox.ac.uk>



Dear Lou,

thanks for your example: I'll think about it.

I just argue that from a linguistic point of view a lemma is not
necessarily a single word (in case of Romance languages for sure).

As it is, it seems that any project that is trying to lemmatize a text
in a language that has multiword expressions cannot use the <w> element
as it is and need to customize it either modifying the class or creating
new elements.

Furthermore, if the attribute approach is suitable for simple cases, why
TEI should not support complex cases? In many other modules we have the
opportunity to choose which granularity to adopt in the encoding, while
for this it seems that complex cases and projects that will adopt a
complex linguistic approach has to decide on their own how to customize.

Cheers,

Elena

Lou Burnard ha scritto:
> In my opinion, you would do better to put the lemma value into an element of its
> own. The attribute value approach is really only suitable for simple cases.
>
> So if it was me, I would define new elements <form> and <lem> as specialised
> kinds of <seg> (i.e. as synonyms for <seg type="form"> and <seg type="lem">) and
> then mark it up thusly:
>
>
> <w>
> <lem>in primis</lem>
> <form>in prrrrrimmmmissss</form>
> </w>
>
> This means you can put markup into the <lem> as well as spaces
>
> Alternatively, you could adopt a simple convention like this:
>
> <w lem="in_primis">....</w>
>
> Redefining the datatype of the @lem attribute to accept spaces as you propose
> would be a bit problematic since that changes the definition. Of course, you
> could also argue that it *shouldn't* be defined as data.word... but it currently is!
>
>
>
> message <200705021450.l42CiBmN008989 at listserv.brown.edu> Elena Pierazzo
> <elena.pierazzo at KCL.AC.UK> writes:
>   
>> This is a multi-part message in MIME format.
>> --------------010005090407060100080705
>> Content-Type: text/plain; charset=ISO-8859-15; format=flowed
>> Content-Transfer-Encoding: 7bit
>>
>> Dear all,
>>
>> I'm working in a project with a strong lexicographical component so we 
>> are lemmatizing all the words. For this purpose we are using:
>>
>> <w lemma="">word</w>
>>
>> but we are in trouble with multiword expressions (e.g. "in primis").
>>  From a lexicographical point of view it is matter of a single entry 
>> (separating the expression in "in" and "primis" is simply nonsensical).  
>> The problem is that
>>
>> <w lemma="in primis">in primis</w>
>>
>> is not valid as the lemma definition is
>>
>> <attList>
>>      <attDef ident="lemma" mode="change">
>>         <desc>identifies the word's lemma (dictionary entry form).</desc>
>>         <datatype minOccurs="1" maxOccurs="1">
>>            <rng:ref xmlns:rng="http://relaxng.org/ns/structure/1.0" 
>> name="data.word"/>
>>         </datatype>
>>      ...
>>      </attDef>
>> </attList>
>>
>>
>> I can modify the definition, but I was thinking that my problem can be 
>> rather common (for instance, Italian language contains thousands of 
>> multiword expressions...) and would like to submit the question to 
>> everybody.
>>
>> Bests
>>
>> Elena
>>
>>
>>
>> -- 
>> Elena Pierazzo
>> Associate Researcher
>> Centre for Computing in the Humanities
>> King's College London
>> Kay House 7 Arundel St
>> London WC2R 3DX
>>
>> Phone: 0207-848-1949
>> Fax: 0207-848-2980
>>
>> --------------010005090407060100080705
>> Content-Type: text/html; charset=ISO-8859-15
>> Content-Transfer-Encoding: 8bit
>>
>> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
>> <html>
>> <head>
>>   <meta content="text/html;charset=ISO-8859-15"
>>  http-equiv="Content-Type">
>> </head>
>> <body bgcolor="#ffffff" text="#000000">
>> <font size="-1"><font face="Verdana">Dear all,<br>
>> <br>
>> I'm working in a project with a strong lexicographical component so we
>> are lemmatizing all the words. For this purpose we are using:<br>
>> <br>
>> &lt;w lemma=""&gt;word&lt;/w&gt;<br>
>> <br>
>> but we are in trouble with multiword expressions (e.g. "in primis"). <br>
>> From a lexicographical point of view it is matter of a single entry
>> (separating the expression in "in" and "primis" is simply
>> nonsensical).  The problem is that <br>
>> <br>
>> &lt;w lemma="in primis"&gt;in primis&lt;/w&gt;<br>
>> <br>
>> is not valid as the lemma definition is<br>
>> <br>
>> &lt;attList&gt;<br>
>>      &lt;attDef ident="lemma" mode="change"&gt;<br>
>>         &lt;desc&gt;identifies the word's lemma (dictionary entry
>> form).&lt;/desc&gt;<br>
>>         &lt;datatype minOccurs="1" maxOccurs="1"&gt;<br>
>>            &lt;rng:ref xmlns:rng=<a class="moz-txt-link-rfc2396E"
>>     
> href="http://relaxng.org/ns/structure/1.0">"http://relaxng.org/ns/structure/1.0"</a>
>   
>> name="data.word"/&gt;<br>
>>         &lt;/datatype&gt;<br>
>>      ...<br>
>>      &lt;/attDef&gt;<br>
>> &lt;/attList&gt;<br>
>> <br>
>> <br>
>> I can modify the definition, but I was thinking that my problem can be
>> rather common (for instance, Italian language contains thousands of
>> multiword expressions...) and would like to submit the question to
>> everybody.<br>
>> <br>
>> Bests<br>
>> <br>
>> Elena<br>
>> <br>
>> <br>
>> <br>
>> </font></font><span class="moz-txt-tag">-- <br>
>> </span>Elena Pierazzo
>> <br>
>> Associate Researcher
>> <br>
>> Centre for Computing in the Humanities
>> <br>
>> King's College London
>> <br>
>> Kay House 7 Arundel St
>> <br>
>> London WC2R 3DX
>> <br>
>> <br>
>> Phone: 0207-848-1949
>> <br>
>> Fax: 0207-848-2980
>> <br>
>> </body>
>> </html>
>>
>> --------------010005090407060100080705--
>>
>>     



More information about the tei-council mailing list