4.1171 CELEX: Dutch/German/English Lexical Databases (1/207)

Elaine Brennan & Allen Renear (EDITORS@BROWNVM.BITNET)
Tue, 26 Mar 91 00:02:56 EST

Humanist Discussion Group, Vol. 4, No. 1171. Tuesday, 26 Mar 1991.

Date: Thu, 21 Mar 91 16:56 (Nederland)
From: Gavin Burnage <GAVIN@CELEX.KUN.NL>
Subject: CELEX -- Centre for Lexical Information


Since 1986, the Dutch national Expertise Centre CELEX (Centre for
Lexical Information) has been constructing large electronic
databases containing various types of lexical data on present-day
Dutch, English and German. CELEX makes this information available
to institutes and companies engaged in language and speech
research and in the development of language and speech oriented
technological systems. Using the specially-developed program FLEX,
you can access the databases with ease -- no technical expertise
is necessary -- and extract information which matches the detailed
requirements you have. In addition, CELEX can offer assistance
with respect to related research and development projects.

The lexical data are stored in three separate databases. The Dutch
database is now complete, and contains information on
approximately 400,000 present-day Dutch wordforms. The English
database currently contains 100,000 wordforms and will soon by
extended with another 50,000 wordforms. The first version of the
German database was made available in August 1990 and contains
51,000 wordforms. New information on translation equivalency is
currently being developed, along with additional syntactic and
semantic subcategorizations to establish semantic links between
the three databases. The results of these extensions will be made
available in 1991.

The information contained in all three databases has been derived
from various sources. For the most part, dictionary information
has been combined with frequency data taken from large text
corpora. By means of various manual and automatic procedures,
CELEX has checked, improved and extended the information. On offer
now is detailed information on the orthography (spelling),
phonology (pronunciation), morphology (word structure:
inflectional and derivational), syntax (grammar) and frequency of
words. An important feature of the CELEX databases is that all the
information in them has been represented to meet the formal and
strict requirements of computational applications. The data are
contained in a relational DBMS (database management system), a
highly flexible tool for storing, updating and manipulating the
data; it also allows users to make individual selections from the
vast quantities of data included. The CELEX user interface FLEX
was specially designed to make it easy for non-technical people to
use the databases. Researchers can log in to CELEX, create their
own particular lexicons using FLEX, and extract the information
for their own use. By selecting specific items from the numerous
possibilities presented in the FLEX menus, and by specifying
restrictions on the selection of words from the databases, you can
define and control the contents of your lexicon.

LEXICON TYPES

When you begin work with any of the databases, you can normally
choose between two so-called `lexicon types': either a LEMMA
LEXICON or a a WORDFORM LEXICON.

Each lexicon type is based on a specific kind of main entry, a
lemma or a headword. The lemma lexicon is the one most similar to
an ordinary dictionary since each entry refers to a full set of
inflected words, dealt with together under some convenient
heading. Dictionaries normally represent lemmas as headwords: the
verb lemma `call' represents all the verbal forms which `call' can
appear as. In the CELEX English database, the lemma is represented
by the conventional dictionary-type headword, while in the Dutch
and German databases you can choose between the conventional
headword form and the stem form. In contrast, entries in a
wordform lexicon deal with each individual flection -- this is
where you find `call', `calls', `called', and `calling'.

INFORMATION AVAILABLE

For both lexicon types you can select any number of columns from
the 150 columns available for each lexicon type. The table below
summarizes the sort of information you could include in an English
lexicon:

-------------------------------------------------------------------
Orthography - with or without diacritics
(spelling) - with or without word division positions
- alternative spellings
- number of letters/syllables

Phonology - phonetic transcriptions (using SAMPA notation or
(pronunciation) Computer Phonetic Alphabet (CPA) notation) with:
- syllable boundaries
- primary and secondary stress markers
- consonant-vowel patterns
- number of phonemes/syllables
- alternative pronunciations

Morphology - Derivational/compositional:
(word structure) - division into stems and affixes
- flat or hierarchical representations
- Inflectional:
- stems and their inflections

Syntax - word class
(grammar) - subcategorizations per word class

Frequency - COBUILD frequency*
-------------------------------------------------------------------
*These frequency data are based on the COBUILD corpus (sized 18
million words) built up by the University of Birmingham, UK

AN EXAMPLE

If you create a small English Lemma lexicon (that is, one with
only a few columns), you might extract information like this from
it:

-----------------------------------------------------------------
Headword Pronunciation Morphology: Mor: Class Freq
Structure Class
----------- ---------------- ------------------- ----- ----- ----
celebrant "sE-lI-br@nt ((celebrate),(ant)) Vx N 6
celebration %sE-lI-"breI-Sn, ((celebrate),(ion)) Vx N 201
cell "sEl (cell) N N 1210
cellar "sE-l@r* (cellar) N N 228
cellarage "sE-l@-rIdZ ((cellar),(age)) Nx N 0
cellist "tSE-lIst ((cello),(ist)) Nx N 5
cello "tSE-l@U (cello) N N 25
cellular "sEl-jU-l@r* ((cell),(ular)) Nx A 21
celluloid "sEl-jU-lOId ((cellulose),(oid)) Nx N 29
------------------------------------------------------------------


Similarly, a small English wordforms lexicon giving the flections
associated with the lemmas above might look like this:

--------------------------------------------------------------
Word Word division Pronunciation Class Type Freq
------------ --------------- ----------------- ----- ---- ----
celebrant cel-e-brant "sE-lI-br@nt N sing 2
celebrants cel-e-brants "sE-lI-br@nts N plu 4
celebration cel-e-bra-tion %sE-lI-"breI-Sn, N sing 144
celebrations cel-e-bra-tions %sE-lI-"breI-Sn,z N plu 57
cell cell "sEl N sing 655
cells cells "sElz N plu 555
cellar cel-lar "sE-l@r* N sing 187
cellars cel-lars "sE-l@z N plu 41
cellarage cel-lar-age "sE-l@-rIdZ N sing 0
cellarages cel-lar-ag-es "sE-l@-rI-dZIz N plu 0
cellist cel-list "tSE-lIst N sing 5
cellists cel-lists "tSE-lIsts N plu 0
cello cel-lo "tSE-l@U N sing 24
cellos cel-los "tSE-l@Uz N plu 1
cellular cel-lu-lar "sEl-jU-l@r* A pos 21
celluloid cel-lu-loid "sEl-jU-lOId N sing 29
--------------------------------------------------------------

GETTING AT THE DATABASES

People in the Netherlands can log in to CELEX using SURFnet, the
Dutch academic network. People elsewhere can use the available PSDNs
(Packet Switching Data Networks). In the UK, JANET users connect
first to the PSS gateways in London and Manchester, and then log in
to CELEX. Some locations in the UK and the rest of Europe currently
have direct IXI connections free of charge. In the US, any of the
public PSDNs (TYMNET, AUTONET, or UNINET to name just a few) can
provide direct access to the CELEX machine. In Germany the national
PSDN is called DATEX-P. Most countries have a PSDN which can provide
a connection to let you log in and work with the CELEX databases,
and several users outside the Netherlands have been able to do it --
there are CELEX users in the USA, the UK, Germany, Belgium and
Austria. If, however, the network connections aren't sufficient,
then CELEX can prepare the information you require and send it on
tape.

COSTS AND CONDITIONS

Before access to the databases is provided, a licence agreement
between the user (usually the user's institution) and CELEX is
drawn up, which settles the conditions and rights concerning
access to and use of the databases. In most cases, charges are
levied for the use of the database. Since the mention of money
usually causes alarm in academic circles, it's worth stressing
that this is purely a cost-covering exercise to ensure that the
system can be maintained, and that more information can be
developed. This is Dutch government policy at present: state funds
enable a central resource to be set up for the general good in the
hope that others who need such resources will not waste time and
money in constructing similar facilities. Once set up, those
facilities are available at a price far lower than the cost of new
development would be. For academic and research purposes, the fees
asked are modest. Naturally when commercial use is made of the
data, higher fees are appropriate.

MORE INFORMATION

If you are interested in finding out more about CELEX, then please
get in touch with us. We can send you copies of our introductory
booklet, plus back issues of the five newsletters so far
published, and answer any specific questions you might have. In
many cases a `trial' account can be set up to let you look round
the databases before making any financial commitment

You can send email:

CELEX@CELEX.KUN.NL (Internet)
CELEX@HNYMPI52 (EARN/BITNET),

or write to the following address:

CELEX -- Centre for Lexical Information
University of Nijmegen
Wundtlaan 1
6525 XD NIJMEGEN
The Netherlands