4.1053 Database for Syntactic Analysis (1/118)

Elaine Brennan & Allen Renear (EDITORS@BROWNVM.BITNET)
Mon, 18 Feb 91 18:15:14 EST

Humanist Discussion Group, Vol. 4, No. 1053. Monday, 18 Feb 1991.

Date: Fri, 15 Feb 91 14:28 MET
From: COR_HVH@KUNRC1.URC.KUN.NL
Subject: Database for/with (syntactic analysis) trees

- New also is a freely copyable demo version for MSDOS.

See below for details and for a general introduction to the LDB.
-------------------------

The Linguistic DataBase (LDB)

The LDB is a database system developed by the TOSCA group at Nijmegen
University which allows linguists who are not experts in computing
to access syntactically analyzed corpora. The data in the database
comprises `syntactic analysis trees' of the contiguous utterances
in a natural-language text. Since these trees are built from a
continuous text, they give a good representation of actual
language use and can thus provide a testing ground for linguistic
hypotheses. The range of extractable information in such a
database is mainly dependent on the degree to which the text has
been prepared. Formerly studies of corpora were restricted to the
level of words or word-classes, but with the Linguistic DataBase
it becomes possible to extend these studies to the level of
syntax, so that larger constituents can be analyzed.

Unlike currently available database packages, the LDB has
been created specifically to handle the type of data linguists
need to analyze - a labelled tree structure with a variable
number of branches at each node and the possibility of recursion.
The LDB can be used to examine the trees on the terminal
screen, search for utterances with given properties, and handle
database-wide queries about constructs in the utterances.

The LDB does not presume special graphics hardware. For
this reason it has been implemented for common machines (VAX and
IBM PC/AT) and common terminals (VT100, ADM3, etc.).
Where possible, special terminal features are used,
such as highlighting and graphics characters, but even on the so-
called `dumb' ADM3A the trees are represented by an
acceptable imitation of graphics. Terminal types not already
provided for can be easily installed by the user.

The LDB also does not presume a computationally expert
user. Thus control of the program is designed to be simple and
clear. The overall control is handled by a menu system, which
displays short descriptions of the choices, each of which can be
activated by a single keystroke. In the Tree Viewer, which is
used to examine an analysis tree on the terminal screen, there is
not enough space left on the screen to produce these
descriptions, so that commands (mostly of one keystroke) are
listed in abbreviated form. A description of all commands can be
accessed by a `help' command, however.

For queries going beyond a single tree, the Exploration Scheme
formalism has been developed. An Exploration Scheme consists of a
search pattern, itself a tree much like the analysis trees, and a
specification of the operations to be performed on the
information the pattern discovers. The possibilities of
Exploration Schemes are various. They range from a simple search
for a tree, in order to examine it with the Tree Viewer, to the
creation of frequency tables. The formalism is designed in such a
way that the novice can start exploring immediately. From there,
he can gradually expand his knowledge to the more complex
features. In order to facilitate formulating Exploration Schemes
the LDB has a special scheme editor.

The LDB package comes with the Nijmegen Corpus, a 130,000
word collection of modern British English with a full syntactic
analysis of each utterance. To each node in the tree (i.e. each
constituent in the utterance) has been attached a function and a
category label. In the future more corpora will become available.
Furthermore, since the database system is independent of both
formalism and language, it is possible to use it for any other
kind of analyzed corpus.

The LDB package requires (1) VAX with VMS; (2) IBM PC (AT preferred),
640K RAM, hard disk, at least one 1.2 Mb high-capacity diskette drive, MS-DOS,
no special graphics hardware; or (3) any UNIX machine, competent C-compiler,
enough knowledge about terminal and file I/O to be able to
configure the program to the system. Not copy protected. Source
code (ca. 25,000 lines of CDL2) not available.

It costs Hfl. 100 (academic institutions), Hfl. 5000 (other).
[as of Jan. 1991 Hfl. 1 is about $ 0.60]
A user manual is not included in the academic distribution;
the book Linguistic Exploitation of Syntactic Databases (see
publications) contains all necessary information and is priced at Hfl. 70.

A (fully functional) demonstration version for any MSDOS machine with harddisk
is available
- on a 5.25" 360K diskette from the address below
- by ftp at phoibos.cs.kun.nl in the directory pub/LDB
- by listserv from LISTSERV@HEARN as files
LDBDEMOC INF TOSCA-L
LDBDEMOC UUE TOSCA-L


For more information contact
Hans van Halteren
TOSCA Group
Department of English
University of Nijmegen
P.O. Box 9103
6500 HD Nijmegen
The Netherlands
tel: (+31)-080-512836
e-mail: cor_hvh@kunrc1.urc.kun.nl

Publications

van Halteren, Hans and Nelleke Oostdijk. ``Using an Analyzed
Corpus as a Linguistic Database'', in Computers in Literary
and Linguistic Computing, Proceedings of the
XIIIth ALLC Conference (Norwich 1986),
John Roper (vol. ed.), J. Hamesse and A. Zampolli (series eds.)

van Halteren, Hans and Theo van den Heuvel. Linguistic
Exploitation of Syntactic Databases. (Rodopi, Amsterdam 1990).

de Haan, Pieter. ``Exploring the Linguistic Database: Noun Phrase
Complexity and Language Variation'', in Corpus Linguistics
and Beyond, Willem Meijs, ed. (Rodopi, Amsterdam 1987).