[tei-council] [Fwd: Re: recording multiword expression in lemma attribute]

Lou Burnard lou.burnard at computing-services.oxford.ac.uk
Thu May 3 04:39:43 EDT 2007


There is one example in Elena's mail. As she points out there are plenty 
of times when it is convenient to regard as a single "word" something 
which is conventionally written as several orthographically distinct 
strings. Examples include "of course" "n'est-ce pas"  "che bella"  "et 
caetera"  etc.  Now,  whether or not you regard the use of  <w> to tag 
such things  as legitimate  you have the problem that at present we 
recommend using an attribute @lemma to carry the lemmatized version of 
the whatever it is (it somehow survived the war on attributes). And the 
lemma corresponding with a multiword may very well be a multiword itself 
(for example you might decide to mark phrasal verbs in this way and want 
to show that "putting up with" should have the lemma "put up with").

So yes, @lemma is the issue. Should it be abolished or should it have 
its datatype changed?


Daniel O'Donnell wrote:
> Before we tamper with something like this, I'd like to see some real
> examples of multiword lemmas... and, how does that affect w? Isn't it
> lemma that's affected?
>
> On Wed, 2007-05-02 at 17:56 +0100, Lou Burnard wrote:
>   
>> This is really a datatype problem. Is there any desire on Council to 
>> review the datatype of the @lem attribute so as to address the issue?
>>
>> Making it xsd:token rather than data.word would help with the specific 
>> case Elena raises, at the expense of making this attribute inconsistent 
>> with all the other cases of "texty" attributes.
>>
>>
>> -------- Original Message --------
>> Subject: 	Re: recording multiword expression in lemma attribute
>> Date: 	Wed, 02 May 2007 17:03:41 +0100
>> From: 	Elena Pierazzo <elena.pierazzo at kcl.ac.uk>
>> To: 	Lou Burnard <lou.burnard at COMPUTING-SERVICES.OXFORD.AC.UK>
>> CC: 	TEI-L at listserv.brown.edu
>> References: 	<20070502152350.59BF2EB04D at webmail221.herald.ox.ac.uk>
>>
>>
>>
>> Dear Lou,
>>
>> thanks for your example: I'll think about it.
>>
>> I just argue that from a linguistic point of view a lemma is not
>> necessarily a single word (in case of Romance languages for sure).
>>
>> As it is, it seems that any project that is trying to lemmatize a text
>> in a language that has multiword expressions cannot use the <w> element
>> as it is and need to customize it either modifying the class or creating
>> new elements.
>>
>> Furthermore, if the attribute approach is suitable for simple cases, why
>> TEI should not support complex cases? In many other modules we have the
>> opportunity to choose which granularity to adopt in the encoding, while
>> for this it seems that complex cases and projects that will adopt a
>> complex linguistic approach has to decide on their own how to customize.
>>
>> Cheers,
>>
>> Elena
>>
>> Lou Burnard ha scritto:
>>     
>>> In my opinion, you would do better to put the lemma value into an element of its
>>> own. The attribute value approach is really only suitable for simple cases.
>>>
>>> So if it was me, I would define new elements <form> and <lem> as specialised
>>> kinds of <seg> (i.e. as synonyms for <seg type="form"> and <seg type="lem">) and
>>> then mark it up thusly:
>>>
>>>
>>> <w>
>>> <lem>in primis</lem>
>>> <form>in prrrrrimmmmissss</form>
>>> </w>
>>>
>>> This means you can put markup into the <lem> as well as spaces
>>>
>>> Alternatively, you could adopt a simple convention like this:
>>>
>>> <w lem="in_primis">....</w>
>>>
>>> Redefining the datatype of the @lem attribute to accept spaces as you propose
>>> would be a bit problematic since that changes the definition. Of course, you
>>> could also argue that it *shouldn't* be defined as data.word... but it currently is!
>>>
>>>
>>>
>>> message <200705021450.l42CiBmN008989 at listserv.brown.edu> Elena Pierazzo
>>> <elena.pierazzo at KCL.AC.UK> writes:
>>>   
>>>       
>>>> This is a multi-part message in MIME format.
>>>> --------------010005090407060100080705
>>>> Content-Type: text/plain; charset=ISO-8859-15; format=flowed
>>>> Content-Transfer-Encoding: 7bit
>>>>
>>>> Dear all,
>>>>
>>>> I'm working in a project with a strong lexicographical component so we 
>>>> are lemmatizing all the words. For this purpose we are using:
>>>>
>>>> <w lemma="">word</w>
>>>>
>>>> but we are in trouble with multiword expressions (e.g. "in primis").
>>>>  From a lexicographical point of view it is matter of a single entry 
>>>> (separating the expression in "in" and "primis" is simply nonsensical).  
>>>> The problem is that
>>>>
>>>> <w lemma="in primis">in primis</w>
>>>>
>>>> is not valid as the lemma definition is
>>>>
>>>> <attList>
>>>>      <attDef ident="lemma" mode="change">
>>>>         <desc>identifies the word's lemma (dictionary entry form).</desc>
>>>>         <datatype minOccurs="1" maxOccurs="1">
>>>>            <rng:ref xmlns:rng="http://relaxng.org/ns/structure/1.0" 
>>>> name="data.word"/>
>>>>         </datatype>
>>>>      ...
>>>>      </attDef>
>>>> </attList>
>>>>
>>>>
>>>> I can modify the definition, but I was thinking that my problem can be 
>>>> rather common (for instance, Italian language contains thousands of 
>>>> multiword expressions...) and would like to submit the question to 
>>>> everybody.
>>>>
>>>> Bests
>>>>
>>>> Elena
>>>>
>>>>
>>>>
>>>> -- 
>>>> Elena Pierazzo
>>>> Associate Researcher
>>>> Centre for Computing in the Humanities
>>>> King's College London
>>>> Kay House 7 Arundel St
>>>> London WC2R 3DX
>>>>
>>>> Phone: 0207-848-1949
>>>> Fax: 0207-848-2980
>>>>
>>>> --------------010005090407060100080705
>>>> Content-Type: text/html; charset=ISO-8859-15
>>>> Content-Transfer-Encoding: 8bit
>>>>
>>>> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
>>>> <html>
>>>> <head>
>>>>   <meta content="text/html;charset=ISO-8859-15"
>>>>  http-equiv="Content-Type">
>>>> </head>
>>>> <body bgcolor="#ffffff" text="#000000">
>>>> <font size="-1"><font face="Verdana">Dear all,<br>
>>>> <br>
>>>> I'm working in a project with a strong lexicographical component so we
>>>> are lemmatizing all the words. For this purpose we are using:<br>
>>>> <br>
>>>> &lt;w lemma=""&gt;word&lt;/w&gt;<br>
>>>> <br>
>>>> but we are in trouble with multiword expressions (e.g. "in primis"). <br>
>>>> From a lexicographical point of view it is matter of a single entry
>>>> (separating the expression in "in" and "primis" is simply
>>>> nonsensical).  The problem is that <br>
>>>> <br>
>>>> &lt;w lemma="in primis"&gt;in primis&lt;/w&gt;<br>
>>>> <br>
>>>> is not valid as the lemma definition is<br>
>>>> <br>
>>>> &lt;attList&gt;<br>
>>>>      &lt;attDef ident="lemma" mode="change"&gt;<br>
>>>>         &lt;desc&gt;identifies the word's lemma (dictionary entry
>>>> form).&lt;/desc&gt;<br>
>>>>         &lt;datatype minOccurs="1" maxOccurs="1"&gt;<br>
>>>>            &lt;rng:ref xmlns:rng=<a class="moz-txt-link-rfc2396E"
>>>>     
>>>>         
>>> href="http://relaxng.org/ns/structure/1.0">"http://relaxng.org/ns/structure/1.0"</a>
>>>   
>>>       
>>>> name="data.word"/&gt;<br>
>>>>         &lt;/datatype&gt;<br>
>>>>      ...<br>
>>>>      &lt;/attDef&gt;<br>
>>>> &lt;/attList&gt;<br>
>>>> <br>
>>>> <br>
>>>> I can modify the definition, but I was thinking that my problem can be
>>>> rather common (for instance, Italian language contains thousands of
>>>> multiword expressions...) and would like to submit the question to
>>>> everybody.<br>
>>>> <br>
>>>> Bests<br>
>>>> <br>
>>>> Elena<br>
>>>> <br>
>>>> <br>
>>>> <br>
>>>> </font></font><span class="moz-txt-tag">-- <br>
>>>> </span>Elena Pierazzo
>>>> <br>
>>>> Associate Researcher
>>>> <br>
>>>> Centre for Computing in the Humanities
>>>> <br>
>>>> King's College London
>>>> <br>
>>>> Kay House 7 Arundel St
>>>> <br>
>>>> London WC2R 3DX
>>>> <br>
>>>> <br>
>>>> Phone: 0207-848-1949
>>>> <br>
>>>> Fax: 0207-848-2980
>>>> <br>
>>>> </body>
>>>> </html>
>>>>
>>>> --------------010005090407060100080705--
>>>>
>>>>     
>>>>         
>> _______________________________________________
>> tei-council mailing list
>> tei-council at lists.village.Virginia.EDU
>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>     




More information about the tei-council mailing list