4.0171 Lemmatization (2/99)

Elaine Brennan & Allen Renear (EDITORS@BROWNVM.BITNET)
Wed, 13 Jun 90 17:46:05 EDT

Humanist Discussion Group, Vol. 4, No. 0171. Wednesday, 13 Jun 1990.


(1) Date: Sat, 02 Jun 90 16:36:00 CST (63 lines)
From: "Robin C. Cover" <ZRCC1001@SMUVM1>
Subject: LEMMATIZATION

(2) Date: 13 JUN 90 16:40 CET (36 lines)
From: A400101@DM0LRZ01
Subject: 4.163 lemmatization

(1) --------------------------------------------------------------------
Date: Sat, 02 Jun 90 16:36:00 CST
From: "Robin C. Cover" <ZRCC1001@SMUVM1>
Subject: LEMMATIZATION

Re: 4.0163, Faulhaber's query on lemmatization programs

I'm sure lemmatization programs must be available for English (beyond
those that may be mentioned in the <cit>Humanities Computing
Yearbook</cit>). For a more generalized approach (other natural
languages), you might examine the interlinear text (IT) processing
program developed at SIL. It's not a dedicated lemmatizer, but a tool
for developing a corpus of annotated interlinear text. Among other
things, it maintains lexical mappings between multi-line interlinear
aligned fields, two of which could certainly be your base test and
lemma. Much of the annotation process is done automatically. They also
have a related program (something like ITF, "interlinear text
formatter"), a (La)TeX based tool that supports typesetting of texts in
interesting interlinear text formats -- though this package may not be
complete yet. The IT program does just about what you asked for, though
a lot more.

The IT program (quoting an older brochure):

* keeps word and morpheme annotations aligned vertically with base form
* saves word and morpheme annotations in on-line lexical database
* retrieves previous annotations to ensure consistency
* inserts annotations automatically when unambiguous
* asks for user input when annotation is unknown or has multiple options
* adds new annotations automatically to lexical database
* allows user to specify organization and type of annotations
* up to 22 kinds of annotations per analyzed text

Word or morpheme annotations might include: phonetic transcription,
traditional orthography, allomorphic transcription, morphemic
representation, lexemic representation, morpheme glosses, word glosses,
grammatical categories, syntactic bracketing, functional labels,
semantic case roles, semantic subcategorization, participant indexing,
intonation, and so forth. In addition to word- and morpheme-level
glosses, freeform (clause, sentence) annotations are also possible.

There might be a description of IT in the 1988 HCY, but it would be
out-of-date by now. You want the description of the Mac program.
The earliest DOS versions of IT were (in my judgment) a bit hard to use.
With the improved documentation and interactive interface in the
Macintosh incarnation, I think IT is quite a usable tool. The Mac
version is quite superior to the IBM version, especially in guiding the
user through the process of creating a text model. The program
currently has only minimal facilities for database manipulation (though
it can generate dictionaries and mappings of various kinds). For full
database handling of these interlinear text databases, one can use LBase,
a PC-based program mentioned previously on this forum.

For availability of IT, contact:

The Academic Book Center
Summer Institute of Linguistics
7500 W. Camp Wisdom Road
Dallas, TX 75236 USA
(214) 709-2404

Robin Cover
BITNET: zrcc1001@smuvm1
INTERNET: robin@txsil.lonestar.org
(2) --------------------------------------------------------------37----
DATE: 13 JUN 90 16:40 CET
FROM: A400101@DM0LRZ01
SUBJECT: 4.163 lemmatization

As regards Charles Faulhaber's query on lemmatization, I know of no
generally available, _language-independent_ programs. But I am doubtful
about lemmatizing interactively at all. It really is preferable, in our
experience (we have just produced a lemmatized concordance to Gratian's
Decretum, which is a text of about 4 MB with 420000 words), to be able to
look at a whole set of lexical forms with information about whether the
system thinks they are unambiguous or not. If you arrange your text
into a table of forms (case-insensitive) and a table of instances giving
information about position in text, how to convert back for
upper/lower-case, etc., then you only need to lemmatize the _forms_ in a
text (in medieval Latin, the ratio of forms to occurences is about 1:10
in a large text - I don't know how this compares with other languages
and much would probably depend on whether the orthography is
standardized or not). Most forms a good morphological analyser will be
able to tell you are unambiguous and you can look at its decision and
then in most cases can forget about hand intervention at all; for the
seemingly ambiguous ones you can in a lot of cases (again, generalising
here from medieval Latin) say that the ambiguousness is only theoretical
- and that leaves you with only a very small number of forms which need
to be disambiguated by hand.

If you let things come by you interactively and do them on the fly, the
risk of mistakes is much greater, and it's hard, in a large text, to stay
consistent.

None of this answers the question of what programs will do this for you
- but if your preferred system has a relational DBMS and you have a
reasonably helpfully organized machine-readable dictionary in the
language of your choice it ought to be possible to set something up
without too much difficulty ...

Timothy Reuter, Monumenta Germaniae Historica, Munich