[tei-council] [Fwd: Re: recording multiword expression in lemma attribute]
Arianna Ciula
arianna.ciula at kcl.ac.uk
Thu May 3 05:08:59 EDT 2007
As I said to Elena face to face, I think her point is quite right and I
would vote for changing the datatype of the @lemma attribute to xsd:token.
Arianna
Lou Burnard wrote:
> This is really a datatype problem. Is there any desire on Council to
> review the datatype of the @lem attribute so as to address the issue?
>
> Making it xsd:token rather than data.word would help with the specific
> case Elena raises, at the expense of making this attribute inconsistent
> with all the other cases of "texty" attributes.
>
>
> -------- Original Message --------
> Subject: Re: recording multiword expression in lemma attribute
> Date: Wed, 02 May 2007 17:03:41 +0100
> From: Elena Pierazzo <elena.pierazzo at kcl.ac.uk>
> To: Lou Burnard <lou.burnard at COMPUTING-SERVICES.OXFORD.AC.UK>
> CC: TEI-L at listserv.brown.edu
> References: <20070502152350.59BF2EB04D at webmail221.herald.ox.ac.uk>
>
>
>
> Dear Lou,
>
> thanks for your example: I'll think about it.
>
> I just argue that from a linguistic point of view a lemma is not
> necessarily a single word (in case of Romance languages for sure).
>
> As it is, it seems that any project that is trying to lemmatize a text
> in a language that has multiword expressions cannot use the <w> element
> as it is and need to customize it either modifying the class or creating
> new elements.
>
> Furthermore, if the attribute approach is suitable for simple cases, why
> TEI should not support complex cases? In many other modules we have the
> opportunity to choose which granularity to adopt in the encoding, while
> for this it seems that complex cases and projects that will adopt a
> complex linguistic approach has to decide on their own how to customize.
>
> Cheers,
>
> Elena
>
> Lou Burnard ha scritto:
>> In my opinion, you would do better to put the lemma value into an
>> element of its
>> own. The attribute value approach is really only suitable for simple
>> cases.
>>
>> So if it was me, I would define new elements <form> and <lem> as
>> specialised
>> kinds of <seg> (i.e. as synonyms for <seg type="form"> and <seg
>> type="lem">) and
>> then mark it up thusly:
>>
>>
>> <w>
>> <lem>in primis</lem>
>> <form>in prrrrrimmmmissss</form>
>> </w>
>>
>> This means you can put markup into the <lem> as well as spaces
>>
>> Alternatively, you could adopt a simple convention like this:
>>
>> <w lem="in_primis">....</w>
>>
>> Redefining the datatype of the @lem attribute to accept spaces as you
>> propose
>> would be a bit problematic since that changes the definition. Of
>> course, you
>> could also argue that it *shouldn't* be defined as data.word... but it
>> currently is!
>>
>>
>>
>> message <200705021450.l42CiBmN008989 at listserv.brown.edu> Elena Pierazzo
>> <elena.pierazzo at KCL.AC.UK> writes:
>>
>>> This is a multi-part message in MIME format.
>>> --------------010005090407060100080705
>>> Content-Type: text/plain; charset=ISO-8859-15; format=flowed
>>> Content-Transfer-Encoding: 7bit
>>>
>>> Dear all,
>>>
>>> I'm working in a project with a strong lexicographical component so
>>> we are lemmatizing all the words. For this purpose we are using:
>>>
>>> <w lemma="">word</w>
>>>
>>> but we are in trouble with multiword expressions (e.g. "in primis").
>>> From a lexicographical point of view it is matter of a single entry
>>> (separating the expression in "in" and "primis" is simply
>>> nonsensical). The problem is that
>>>
>>> <w lemma="in primis">in primis</w>
>>>
>>> is not valid as the lemma definition is
>>>
>>> <attList>
>>> <attDef ident="lemma" mode="change">
>>> <desc>identifies the word's lemma (dictionary entry
>>> form).</desc>
>>> <datatype minOccurs="1" maxOccurs="1">
>>> <rng:ref xmlns:rng="http://relaxng.org/ns/structure/1.0"
>>> name="data.word"/>
>>> </datatype>
>>> ...
>>> </attDef>
>>> </attList>
>>>
>>>
>>> I can modify the definition, but I was thinking that my problem can
>>> be rather common (for instance, Italian language contains thousands
>>> of multiword expressions...) and would like to submit the question to
>>> everybody.
>>>
>>> Bests
>>>
>>> Elena
>>>
>>>
>>>
>>> --
>>> Elena Pierazzo
>>> Associate Researcher
>>> Centre for Computing in the Humanities
>>> King's College London
>>> Kay House 7 Arundel St
>>> London WC2R 3DX
>>>
>>> Phone: 0207-848-1949
>>> Fax: 0207-848-2980
>>>
>>> --------------010005090407060100080705
>>> Content-Type: text/html; charset=ISO-8859-15
>>> Content-Transfer-Encoding: 8bit
>>>
>>> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
>>> <html>
>>> <head>
>>> <meta content="text/html;charset=ISO-8859-15"
>>> http-equiv="Content-Type">
>>> </head>
>>> <body bgcolor="#ffffff" text="#000000">
>>> <font size="-1"><font face="Verdana">Dear all,<br>
>>> <br>
>>> I'm working in a project with a strong lexicographical component so we
>>> are lemmatizing all the words. For this purpose we are using:<br>
>>> <br>
>>> <w lemma="">word</w><br>
>>> <br>
>>> but we are in trouble with multiword expressions (e.g. "in primis").
>>> <br>
>>> From a lexicographical point of view it is matter of a single entry
>>> (separating the expression in "in" and "primis" is simply
>>> nonsensical). The problem is that <br>
>>> <br>
>>> <w lemma="in primis">in primis</w><br>
>>> <br>
>>> is not valid as the lemma definition is<br>
>>> <br>
>>> <attList><br>
>>> <attDef ident="lemma" mode="change"><br>
>>> <desc>identifies the word's lemma (dictionary entry
>>> form).</desc><br>
>>> <datatype minOccurs="1" maxOccurs="1"><br>
>>> <rng:ref xmlns:rng=<a class="moz-txt-link-rfc2396E"
>>>
>> href="http://relaxng.org/ns/structure/1.0">"http://relaxng.org/ns/structure/1.0"</a>
>>
>>
>>> name="data.word"/><br>
>>> </datatype><br>
>>> ...<br>
>>> </attDef><br>
>>> </attList><br>
>>> <br>
>>> <br>
>>> I can modify the definition, but I was thinking that my problem can be
>>> rather common (for instance, Italian language contains thousands of
>>> multiword expressions...) and would like to submit the question to
>>> everybody.<br>
>>> <br>
>>> Bests<br>
>>> <br>
>>> Elena<br>
>>> <br>
>>> <br>
>>> <br>
>>> </font></font><span class="moz-txt-tag">-- <br>
>>> </span>Elena Pierazzo
>>> <br>
>>> Associate Researcher
>>> <br>
>>> Centre for Computing in the Humanities
>>> <br>
>>> King's College London
>>> <br>
>>> Kay House 7 Arundel St
>>> <br>
>>> London WC2R 3DX
>>> <br>
>>> <br>
>>> Phone: 0207-848-1949
>>> <br>
>>> Fax: 0207-848-2980
>>> <br>
>>> </body>
>>> </html>
>>>
>>> --------------010005090407060100080705--
>>>
>>>
> _______________________________________________
> tei-council mailing list
> tei-council at lists.village.Virginia.EDU
> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
--
Dr Arianna Ciula
Research Associate
Centre for Computing in the Humanities
King's College London
Strand
London WC2R 2LS (UK)
Tel: +44 (0)20 78481945
http://www.kcl.ac.uk/cch
More information about the tei-council
mailing list