6.0330 Susanne Corpus and Access Information (1/161)

Elaine Brennan & Allen Renear (EDITORS@BROWNVM.BITNET)
Tue, 27 Oct 1992 10:42:56 EST

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Elaine Brennan & Allen Renear: "6.0331 Rs: Buying Computers; E-Texts; E-Addresses (3/71)"
Previous message: Elaine Brennan & Allen Renear: "6.0329 Survey of Computational Linguistic Courses (1/363)"

Humanist Discussion Group, Vol. 6, No. 0330. Tuesday, 27 Oct 1992.

Date: Tue, 27 Oct 92 12:16:35 +0100
From: ide@grtc.cnrs-mrs.fr (Nancy Ide)
Subject: susanne corpus

From: Geoffrey Sampson <geoffs@cogs.susx.ac.uk>

THE SUSANNE CORPUS

[Revised announcement including modified access instructions]

26 October 1992

Geoffrey Sampson

School of Cognitive & Computing Sciences
University of Sussex
Falmer, Brighton BN1 9QH, England

geoffs@uk.ac.susx.cogs

Colleagues needing the use of a grammatically-analysed corpus of English
may like to know that Release 1 of the SUSANNE Corpus is now complete, and
is freely available from the Oxford Text Archive via anonymous ftp to any
machine connected to the Internet. Instructions for retrieving a copy of
the Corpus are given at the end of this announcement.

The SUSANNE Corpus has been created, with the sponsorship of the Economic
and Social Research Council (UK), as part of the process of developing
a comprehensive NLP-oriented taxonomy and annotation scheme for the (logical
and surface) grammar of English. The SUSANNE scheme attempts to provide
a method of representing all aspects of English grammar which are sufficiently
definite to be susceptible of formal annotation, with the categories and
boundaries between categories specified in sufficient detail that, ideally,
two analysts independently annotating the same text and referring to the
same scheme must produce the same structural analysis. The SUSANNE scheme
may be likened to a "Linnaean taxonomy" of the grammatical domain: its
aim (comparable to that of Linnaeus's eighteenth-century taxonomy for the
domain of botany) is not to identify categories which are theoretically
optimal or which necessarily reflect the psychological organization of speakers'
linguistic competence, but simply to offer a scheme of categories and ways
of applying them that make it practical for NLP researchers to register
everything that occurs in real-life usage systematically and unambiguously,
and for researchers at different sites to exchange empirical grammatical data
without misunderstandings over local uses of analytic terminology.

The SUSANNE Corpus comprises an approximately 128,000-word subset of the
Brown Corpus of American English, annotated in accordance with the SUSANNE
scheme. The SUSANNE analytic scheme is defined in detail in a book by myself,
ENGLISH FOR THE COMPUTER, forthcoming from Oxford University Press,
and briefly in a documentation file which accompanies the Corpus. The Chairman
of the Analysis and Interpretation Working Group of the US/EC-sponsored
Text Encoding Initiative has proposed the adoption of the scheme as a
recognised TEI standard. The SUSANNE scheme aims to specify annotation norms
for the modern English language; it does not cover other languages, although it
is hoped that the general principles of the SUSANNE scheme may prove
helpful in developing comparable taxonomies for these.

Regrettably, Release 1 of the SUSANNE Corpus is not a "TEI-conformant"
resource, though aspects of the annotation scheme have been decided in
such a way as to facilitate a move to TEI conformance in later releases.
The working timetable of the Initiative meant that relevant aspects
of the TEI Guidelines were not yet complete at the point when
the SUSANNE Corpus was ready for initial release; delaying this release
would have been unfortunate.

Although the SUSANNE analytic scheme is by now rather tightly defined,
Release 1 of the SUSANNE Corpus undoubtedly still contains errors despite
considerable proof-checking. It is intended to correct these in later
releases; I should be extremely grateful if users discovering errors
would notify me, preferably by post rather than e-mail.

The SUSANNE Corpus consists of 64 data files (each comprising an annotated
version of one Brown text), together with a documentation file. However,
the versions held by the Oxford Text Archive are compressed, in order
to reduce file transfer time, into single files in two alternative formats,
suitable for Unix users and for users who have access only to a PC.
The procedure for retrieving a copy of the Corpus in either case is
as follows:

>From a machine on the Internet, type either:

ftp black.ox.ac.uk

or, since the Archive is not yet in many official name tables:

ftp 129.67.1.165

When connected, you will be prompted for an account name, to which
you should respond:

ftp

or:

anonymous

You will be asked to supply a password, in response to which you should type
your e-mail address. After this is accepted, your first command should
be to move to the directory containing the Text Archive files, by typing:

cd ota

To see a list of the files and directories currently available, type:

ls

All files relating to the SUSANNE Corpus are kept in the directory "susanne",
so your next command should be:

cd susanne

Apart from a README file containing the instructions which you are
currently reading, this directory contains the two alternative compressed
versons of the SUSANNE Corpus. To retrieve a copy of the corpus,
if you are a Unix user, type:

get susanne.tar.Z

Having successfully transferred a copy of "susanne.tar.Z" to your home system,
get the material into a usable state by the successive commands:

uncompress susanne.tar.Z

and:

tar -xf susanne.tar

If you are not a Unix user, you need to retrieve the other version of the
Corpus, which will be uncompressed using the PKUNZIP software on an IBM-PC.
First, set ftp transfer mode to binary by typing the command:

bin

at the ftp prompt. Then retrieve the appropriate version of the Corpus by
typing:

get susanne.zip

Having transferred a copy of the Corpus to your home machine, uncompress it
with the command:

pkunzip -x susanne.zip

In either case (whether you have followed the Unix or the non-Unix instructions)
you should now have the Corpus split up into its 65 files, one of which,
"SUSANNE.doc", is a text file describing the format and contents of the
64 data files.

To log out of the ftp connexion, type:

bye

If you encounter any problems, please send an e-mail message to
archive@black.ox.ac.uk or archive@uk.ac.oxford.vax.

Next message: Elaine Brennan & Allen Renear: "6.0331 Rs: Buying Computers; E-Texts; E-Addresses (3/71)"
Previous message: Elaine Brennan & Allen Renear: "6.0329 Survey of Computational Linguistic Courses (1/363)"