[tei-council] [Fwd: Re: recording multiword expression in lemma attribute]

Arianna Ciula arianna.ciula at kcl.ac.uk
Thu May 3 05:08:59 EDT 2007


As I said to Elena face to face, I think her point is quite right and I 
would vote for changing the datatype of the @lemma attribute to xsd:token.

Arianna

Lou Burnard wrote:
> This is really a datatype problem. Is there any desire on Council to 
> review the datatype of the @lem attribute so as to address the issue?
> 
> Making it xsd:token rather than data.word would help with the specific 
> case Elena raises, at the expense of making this attribute inconsistent 
> with all the other cases of "texty" attributes.
> 
> 
> -------- Original Message --------
> Subject:     Re: recording multiword expression in lemma attribute
> Date:     Wed, 02 May 2007 17:03:41 +0100
> From:     Elena Pierazzo <elena.pierazzo at kcl.ac.uk>
> To:     Lou Burnard <lou.burnard at COMPUTING-SERVICES.OXFORD.AC.UK>
> CC:     TEI-L at listserv.brown.edu
> References:     <20070502152350.59BF2EB04D at webmail221.herald.ox.ac.uk>
> 
> 
> 
> Dear Lou,
> 
> thanks for your example: I'll think about it.
> 
> I just argue that from a linguistic point of view a lemma is not
> necessarily a single word (in case of Romance languages for sure).
> 
> As it is, it seems that any project that is trying to lemmatize a text
> in a language that has multiword expressions cannot use the <w> element
> as it is and need to customize it either modifying the class or creating
> new elements.
> 
> Furthermore, if the attribute approach is suitable for simple cases, why
> TEI should not support complex cases? In many other modules we have the
> opportunity to choose which granularity to adopt in the encoding, while
> for this it seems that complex cases and projects that will adopt a
> complex linguistic approach has to decide on their own how to customize.
> 
> Cheers,
> 
> Elena
> 
> Lou Burnard ha scritto:
>> In my opinion, you would do better to put the lemma value into an 
>> element of its
>> own. The attribute value approach is really only suitable for simple 
>> cases.
>>
>> So if it was me, I would define new elements <form> and <lem> as 
>> specialised
>> kinds of <seg> (i.e. as synonyms for <seg type="form"> and <seg 
>> type="lem">) and
>> then mark it up thusly:
>>
>>
>> <w>
>> <lem>in primis</lem>
>> <form>in prrrrrimmmmissss</form>
>> </w>
>>
>> This means you can put markup into the <lem> as well as spaces
>>
>> Alternatively, you could adopt a simple convention like this:
>>
>> <w lem="in_primis">....</w>
>>
>> Redefining the datatype of the @lem attribute to accept spaces as you 
>> propose
>> would be a bit problematic since that changes the definition. Of 
>> course, you
>> could also argue that it *shouldn't* be defined as data.word... but it 
>> currently is!
>>
>>
>>
>> message <200705021450.l42CiBmN008989 at listserv.brown.edu> Elena Pierazzo
>> <elena.pierazzo at KCL.AC.UK> writes:
>>  
>>> This is a multi-part message in MIME format.
>>> --------------010005090407060100080705
>>> Content-Type: text/plain; charset=ISO-8859-15; format=flowed
>>> Content-Transfer-Encoding: 7bit
>>>
>>> Dear all,
>>>
>>> I'm working in a project with a strong lexicographical component so 
>>> we are lemmatizing all the words. For this purpose we are using:
>>>
>>> <w lemma="">word</w>
>>>
>>> but we are in trouble with multiword expressions (e.g. "in primis").
>>>  From a lexicographical point of view it is matter of a single entry 
>>> (separating the expression in "in" and "primis" is simply 
>>> nonsensical).  The problem is that
>>>
>>> <w lemma="in primis">in primis</w>
>>>
>>> is not valid as the lemma definition is
>>>
>>> <attList>
>>>      <attDef ident="lemma" mode="change">
>>>         <desc>identifies the word's lemma (dictionary entry 
>>> form).</desc>
>>>         <datatype minOccurs="1" maxOccurs="1">
>>>            <rng:ref xmlns:rng="http://relaxng.org/ns/structure/1.0" 
>>> name="data.word"/>
>>>         </datatype>
>>>      ...
>>>      </attDef>
>>> </attList>
>>>
>>>
>>> I can modify the definition, but I was thinking that my problem can 
>>> be rather common (for instance, Italian language contains thousands 
>>> of multiword expressions...) and would like to submit the question to 
>>> everybody.
>>>
>>> Bests
>>>
>>> Elena
>>>
>>>
>>>
>>> -- 
>>> Elena Pierazzo
>>> Associate Researcher
>>> Centre for Computing in the Humanities
>>> King's College London
>>> Kay House 7 Arundel St
>>> London WC2R 3DX
>>>
>>> Phone: 0207-848-1949
>>> Fax: 0207-848-2980
>>>
>>> --------------010005090407060100080705
>>> Content-Type: text/html; charset=ISO-8859-15
>>> Content-Transfer-Encoding: 8bit
>>>
>>> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
>>> <html>
>>> <head>
>>>   <meta content="text/html;charset=ISO-8859-15"
>>>  http-equiv="Content-Type">
>>> </head>
>>> <body bgcolor="#ffffff" text="#000000">
>>> <font size="-1"><font face="Verdana">Dear all,<br>
>>> <br>
>>> I'm working in a project with a strong lexicographical component so we
>>> are lemmatizing all the words. For this purpose we are using:<br>
>>> <br>
>>> &lt;w lemma=""&gt;word&lt;/w&gt;<br>
>>> <br>
>>> but we are in trouble with multiword expressions (e.g. "in primis"). 
>>> <br>
>>> From a lexicographical point of view it is matter of a single entry
>>> (separating the expression in "in" and "primis" is simply
>>> nonsensical).  The problem is that <br>
>>> <br>
>>> &lt;w lemma="in primis"&gt;in primis&lt;/w&gt;<br>
>>> <br>
>>> is not valid as the lemma definition is<br>
>>> <br>
>>> &lt;attList&gt;<br>
>>>      &lt;attDef ident="lemma" mode="change"&gt;<br>
>>>         &lt;desc&gt;identifies the word's lemma (dictionary entry
>>> form).&lt;/desc&gt;<br>
>>>         &lt;datatype minOccurs="1" maxOccurs="1"&gt;<br>
>>>            &lt;rng:ref xmlns:rng=<a class="moz-txt-link-rfc2396E"
>>>     
>> href="http://relaxng.org/ns/structure/1.0">"http://relaxng.org/ns/structure/1.0"</a> 
>>
>>  
>>> name="data.word"/&gt;<br>
>>>         &lt;/datatype&gt;<br>
>>>      ...<br>
>>>      &lt;/attDef&gt;<br>
>>> &lt;/attList&gt;<br>
>>> <br>
>>> <br>
>>> I can modify the definition, but I was thinking that my problem can be
>>> rather common (for instance, Italian language contains thousands of
>>> multiword expressions...) and would like to submit the question to
>>> everybody.<br>
>>> <br>
>>> Bests<br>
>>> <br>
>>> Elena<br>
>>> <br>
>>> <br>
>>> <br>
>>> </font></font><span class="moz-txt-tag">-- <br>
>>> </span>Elena Pierazzo
>>> <br>
>>> Associate Researcher
>>> <br>
>>> Centre for Computing in the Humanities
>>> <br>
>>> King's College London
>>> <br>
>>> Kay House 7 Arundel St
>>> <br>
>>> London WC2R 3DX
>>> <br>
>>> <br>
>>> Phone: 0207-848-1949
>>> <br>
>>> Fax: 0207-848-2980
>>> <br>
>>> </body>
>>> </html>
>>>
>>> --------------010005090407060100080705--
>>>
>>>     
> _______________________________________________
> tei-council mailing list
> tei-council at lists.village.Virginia.EDU
> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council

-- 
Dr Arianna Ciula
Research Associate
Centre for Computing in the Humanities
King's College London
Strand
London WC2R 2LS (UK)
Tel: +44 (0)20 78481945
http://www.kcl.ac.uk/cch



More information about the tei-council mailing list