[tei-council] [Fwd: Re: recording multiword expression in lemma attribute]

Dot Porter dporter at uky.edu
Thu May 3 15:57:27 EDT 2007


On 5/3/07, Lou Burnard <lou.burnard at computing-services.oxford.ac.uk> wrote:
> So yes, @lemma is the issue. Should it be abolished or should it have
> its datatype changed?
>
My first choice is to have the datatype changed to a pointer, with my
second preference being @lemma becoming <lemma>

Dot

>
> Daniel O'Donnell wrote:
> > Before we tamper with something like this, I'd like to see some real
> > examples of multiword lemmas... and, how does that affect w? Isn't it
> > lemma that's affected?
> >
> > On Wed, 2007-05-02 at 17:56 +0100, Lou Burnard wrote:
> >
> >> This is really a datatype problem. Is there any desire on Council to
> >> review the datatype of the @lem attribute so as to address the issue?
> >>
> >> Making it xsd:token rather than data.word would help with the specific
> >> case Elena raises, at the expense of making this attribute inconsistent
> >> with all the other cases of "texty" attributes.
> >>
> >>
> >> -------- Original Message --------
> >> Subject:     Re: recording multiword expression in lemma attribute
> >> Date:        Wed, 02 May 2007 17:03:41 +0100
> >> From:        Elena Pierazzo <elena.pierazzo at kcl.ac.uk>
> >> To:  Lou Burnard <lou.burnard at COMPUTING-SERVICES.OXFORD.AC.UK>
> >> CC:  TEI-L at listserv.brown.edu
> >> References:  <20070502152350.59BF2EB04D at webmail221.herald.ox.ac.uk>
> >>
> >>
> >>
> >> Dear Lou,
> >>
> >> thanks for your example: I'll think about it.
> >>
> >> I just argue that from a linguistic point of view a lemma is not
> >> necessarily a single word (in case of Romance languages for sure).
> >>
> >> As it is, it seems that any project that is trying to lemmatize a text
> >> in a language that has multiword expressions cannot use the <w> element
> >> as it is and need to customize it either modifying the class or creating
> >> new elements.
> >>
> >> Furthermore, if the attribute approach is suitable for simple cases, why
> >> TEI should not support complex cases? In many other modules we have the
> >> opportunity to choose which granularity to adopt in the encoding, while
> >> for this it seems that complex cases and projects that will adopt a
> >> complex linguistic approach has to decide on their own how to customize.
> >>
> >> Cheers,
> >>
> >> Elena
> >>
> >> Lou Burnard ha scritto:
> >>
> >>> In my opinion, you would do better to put the lemma value into an element of its
> >>> own. The attribute value approach is really only suitable for simple cases.
> >>>
> >>> So if it was me, I would define new elements <form> and <lem> as specialised
> >>> kinds of <seg> (i.e. as synonyms for <seg type="form"> and <seg type="lem">) and
> >>> then mark it up thusly:
> >>>
> >>>
> >>> <w>
> >>> <lem>in primis</lem>
> >>> <form>in prrrrrimmmmissss</form>
> >>> </w>
> >>>
> >>> This means you can put markup into the <lem> as well as spaces
> >>>
> >>> Alternatively, you could adopt a simple convention like this:
> >>>
> >>> <w lem="in_primis">....</w>
> >>>
> >>> Redefining the datatype of the @lem attribute to accept spaces as you propose
> >>> would be a bit problematic since that changes the definition. Of course, you
> >>> could also argue that it *shouldn't* be defined as data.word... but it currently is!
> >>>
> >>>
> >>>
> >>> message <200705021450.l42CiBmN008989 at listserv.brown.edu> Elena Pierazzo
> >>> <elena.pierazzo at KCL.AC.UK> writes:
> >>>
> >>>
> >>>> This is a multi-part message in MIME format.
> >>>> --------------010005090407060100080705
> >>>> Content-Type: text/plain; charset=ISO-8859-15; format=flowed
> >>>> Content-Transfer-Encoding: 7bit
> >>>>
> >>>> Dear all,
> >>>>
> >>>> I'm working in a project with a strong lexicographical component so we
> >>>> are lemmatizing all the words. For this purpose we are using:
> >>>>
> >>>> <w lemma="">word</w>
> >>>>
> >>>> but we are in trouble with multiword expressions (e.g. "in primis").
> >>>>  From a lexicographical point of view it is matter of a single entry
> >>>> (separating the expression in "in" and "primis" is simply nonsensical).
> >>>> The problem is that
> >>>>
> >>>> <w lemma="in primis">in primis</w>
> >>>>
> >>>> is not valid as the lemma definition is
> >>>>
> >>>> <attList>
> >>>>      <attDef ident="lemma" mode="change">
> >>>>         <desc>identifies the word's lemma (dictionary entry form).</desc>
> >>>>         <datatype minOccurs="1" maxOccurs="1">
> >>>>            <rng:ref xmlns:rng="http://relaxng.org/ns/structure/1.0"
> >>>> name="data.word"/>
> >>>>         </datatype>
> >>>>      ...
> >>>>      </attDef>
> >>>> </attList>
> >>>>
> >>>>
> >>>> I can modify the definition, but I was thinking that my problem can be
> >>>> rather common (for instance, Italian language contains thousands of
> >>>> multiword expressions...) and would like to submit the question to
> >>>> everybody.
> >>>>
> >>>> Bests
> >>>>
> >>>> Elena
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Elena Pierazzo
> >>>> Associate Researcher
> >>>> Centre for Computing in the Humanities
> >>>> King's College London
> >>>> Kay House 7 Arundel St
> >>>> London WC2R 3DX
> >>>>
> >>>> Phone: 0207-848-1949
> >>>> Fax: 0207-848-2980
> >>>>
> >>>> --------------010005090407060100080705
> >>>> Content-Type: text/html; charset=ISO-8859-15
> >>>> Content-Transfer-Encoding: 8bit
> >>>>
> >>>> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> >>>> <html>
> >>>> <head>
> >>>>   <meta content="text/html;charset=ISO-8859-15"
> >>>>  http-equiv="Content-Type">
> >>>> </head>
> >>>> <body bgcolor="#ffffff" text="#000000">
> >>>> <font size="-1"><font face="Verdana">Dear all,<br>
> >>>> <br>
> >>>> I'm working in a project with a strong lexicographical component so we
> >>>> are lemmatizing all the words. For this purpose we are using:<br>
> >>>> <br>
> >>>> &lt;w lemma=""&gt;word&lt;/w&gt;<br>
> >>>> <br>
> >>>> but we are in trouble with multiword expressions (e.g. "in primis"). <br>
> >>>> From a lexicographical point of view it is matter of a single entry
> >>>> (separating the expression in "in" and "primis" is simply
> >>>> nonsensical).  The problem is that <br>
> >>>> <br>
> >>>> &lt;w lemma="in primis"&gt;in primis&lt;/w&gt;<br>
> >>>> <br>
> >>>> is not valid as the lemma definition is<br>
> >>>> <br>
> >>>> &lt;attList&gt;<br>
> >>>>      &lt;attDef ident="lemma" mode="change"&gt;<br>
> >>>>         &lt;desc&gt;identifies the word's lemma (dictionary entry
> >>>> form).&lt;/desc&gt;<br>
> >>>>         &lt;datatype minOccurs="1" maxOccurs="1"&gt;<br>
> >>>>            &lt;rng:ref xmlns:rng=<a class="moz-txt-link-rfc2396E"
> >>>>
> >>>>
> >>> href="http://relaxng.org/ns/structure/1.0">"http://relaxng.org/ns/structure/1.0"</a>
> >>>
> >>>
> >>>> name="data.word"/&gt;<br>
> >>>>         &lt;/datatype&gt;<br>
> >>>>      ...<br>
> >>>>      &lt;/attDef&gt;<br>
> >>>> &lt;/attList&gt;<br>
> >>>> <br>
> >>>> <br>
> >>>> I can modify the definition, but I was thinking that my problem can be
> >>>> rather common (for instance, Italian language contains thousands of
> >>>> multiword expressions...) and would like to submit the question to
> >>>> everybody.<br>
> >>>> <br>
> >>>> Bests<br>
> >>>> <br>
> >>>> Elena<br>
> >>>> <br>
> >>>> <br>
> >>>> <br>
> >>>> </font></font><span class="moz-txt-tag">-- <br>
> >>>> </span>Elena Pierazzo
> >>>> <br>
> >>>> Associate Researcher
> >>>> <br>
> >>>> Centre for Computing in the Humanities
> >>>> <br>
> >>>> King's College London
> >>>> <br>
> >>>> Kay House 7 Arundel St
> >>>> <br>
> >>>> London WC2R 3DX
> >>>> <br>
> >>>> <br>
> >>>> Phone: 0207-848-1949
> >>>> <br>
> >>>> Fax: 0207-848-2980
> >>>> <br>
> >>>> </body>
> >>>> </html>
> >>>>
> >>>> --------------010005090407060100080705--
> >>>>
> >>>>
> >>>>
> >> _______________________________________________
> >> tei-council mailing list
> >> tei-council at lists.village.Virginia.EDU
> >> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
> >>
>
> _______________________________________________
> tei-council mailing list
> tei-council at lists.village.Virginia.EDU
> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>


-- 
***************************************
Dot Porter, University of Kentucky
#####
Program Coordinator
Collaboratory for Research in Computing for Humanities
dporter at uky.edu          859-257-9549
#####
Editorial Assistant, REVEAL Project
Center for Visualization and Virtual Environments
porter at vis.uky.edu
***************************************



More information about the tei-council mailing list