5.0780 European Corpus Initiative (1/159)

Elaine Brennan & Allen Renear (EDITORS@BROWNVM.BITNET)
Mon, 23 Mar 1992 21:39:50 EST

Humanist Discussion Group, Vol. 5, No. 0780. Monday, 23 Mar 1992.

Date: Sat, 21 Mar 92 10:13:34 GMT
From: Henry "S." Thompson <ht@cogsci.edinburgh.ac.uk>
Subject: European Corpus Initiative: Call for Contributions

European Corpus Initiative

Call For Contributions

March, 1992

The European Corpus Initiative was founded to oversee the acquisition
and preparation of a large multi-lingual corpus to be made available
in digital form for scientific research at cost and without royalties.
We believe that widespread easy access to such material would be a
great stimulus to scientific research and technology development as
regards language and language technology. We support existing and
projected national and international efforts to carefully design,
collect and publish large-scale multi-lingual written and spoken
corpora, but also believe it will be some time before the scientific
and material resources necessary to bring these projects to fruition
will be found. In the interim, a small and rapid effort to collect and
distribute existing material can serve to show the way. No amount of
abstract argument as to the value of corpus material is as powerful as
the experience of actually having access to some in one's laboratory.
We aim to make that experience possible very soon, at a very low cost.

The ECI is carrying out the first phase of this activity on a purely
voluntary basis, under the guidance of an ad-hoc steering committee,
using facilities donated by the Human Communication Research Centre at
the University of Edinburgh and a small sum for expenses and
production costs provided by the European Network for Language and
Speech under its Linguistic Resources programme together with the
Network of European Reference Corpora.

Our present goal is to produce in short order (we're currently aiming
for October 1992) a multi-lingual corpus covering as many as possible
of the major European languages, in a consistent format, with
standardised (TEI-conformant) markup, insofar as resources allow. Our
primary focus in this first effort is on textual material of all
kinds, including transcriptions of spoken material, but if space and
resources permit we may be able to include some sampled speech data as
well. If in doubt as to the appropriateness of a contribution, please
contact us before assuming we won't want it.

As our main method of distribution for this corpus, we will produce a
CD-ROM, possibly two if enough material can be collected and prepared
in time. We estimate that we should be able to make the results
available for around 25 ECU.

Because of the low level of resource available for this effort, we are
entirely dependent on the goodwill of those members of the research
community who have appropriate corpus material, to make it available
to us for wide distribution. PLEASE SEND US YOUR DATA. We have
promises of material for many, but by no means all, of the languages
we would like to cover, and in only one or two cases do we have as
much as we would like. We can't guarantee to use everything which is
offered, but please, let us judge whether it would be useful. If you
know of someone with material which might be appropriate, who may not
have received this notice, please pass it on to them.

To contribute data, please send electronic or paper mail to one of the
addresses given below, describing the data, its current format and the
medium it is stored in, and the restrictions on its use, if any, which
you would have to impose in making it available to us.

Although we hope to make the bulk of the data available with as few
restrictions on use as possible, we understand that for various
reasons, including restrictions imposed by the original providers of
material to those who now hold it, restrictions may be required.
Accordingly, researchers who acquire our data will be required to sign
a statement along the following lines:


ECI User Agreement

This statement describes the terms of an agreement between the person
whose signature is affixed below (hereafter called "the user") and the
European Network for Speech and Language ("ELSNET") in which the user
will receive material, as specified below, from the European Corpus
Initiative ("ECI").

The ECI is an activity which collects machine-readable language
material for the purpose of scientific and humanistic research, and
distributes it at cost and without royalties.

Under this agreement, the user will receive a machine-readable copy of
the material specified below. The user agrees that the material
received under this agreement will be used only for research purposes
within the user's own research group. The user further agrees not to
re-distribute the material to others outside of the user's research
group, and that all members of the group will respect the terms of
this agreement.

The user acknowledges that some of the material, as specified below,
is subject to copyright restrictions, and that violations of such
restrictions may result in legal liability. The user agrees to abide
by the copyright restrictions, and to notify all associates who access
the material of the copyright restrictions.

<a listing of the material, with copyright notices and additional
specific restrictions, if any>

Copyright for format modifications to any of the materials on this
CD-ROM is assigned to ELSNET.


We interpret the aim of the ECI User Agreement, and of our efforts in
providing this data, as follows:

The aim of the European Corpus Initiative is to oversee the
acquisition and preparation of a large multi-lingual corpus, to be
made available for scientific research without royalties. All
copyrighted materials submitted for inclusion in the collection remain
the exclusive property of the copyright holders for all other
purposes. You should not redistribute the data that you get from us,
nor should you sell it, or charge for access to it, or otherwise put
it to any direct commercial use. However, commercial application of
"analytical materials" derived from the text, such as statistical
tables or grammar rules, is explicitly permitted, as long as copyright
law is observed.

Copyright holders who agree to make material available are being very
generous. Their contributions will make possible a resource of great
general utility for research and development in language technology
and linguistics. It is not our intent to deprive them of any revenues
that they should receive in the ordinary course of their business.
Thus it would be a violation of trust, as well as a violation of
copyright law, for you to republish a dictionary or other work to be
distributed under this agreement, whether in print or electronic form.

European Corpus Initiative Steering Committee

The current members of the Steering Committee are Nicoletta Calzolari
(University of Pisa), Robert Dale (ELSNET), Mark Liberman (University
of Pennsylvania), Wolf Paprotte (University of Munster), Henry
Thompson (University of Edinburgh) and Susan Warwick-Armstrong (ISSCO,

Addresses for further information and offers of material for

Henry S. Thompson (ECI)
2 Buccleuch Place
Edinburgh EH8 9LW

Fax: +44 31 650-4587


Susan Warwick-Armstrong (ECI)
54 route des Acacias
CH-1227 Geneve

Fax: +41 22 300 1086