11.0682 ELRA news

Humanist Discussion Group (humanist@kcl.ac.uk)
Tue, 7 Apr 1998 20:39:36 +0100 (BST)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Humanist Discussion Group: "11.0682 conferences"
Previous message: Humanist Discussion Group: "11.0681 messages lost!"

Humanist Discussion Group, Vol. 11, No. 682.
Centre for Computing in the Humanities, King's College London
<http://www.princeton.edu/~mccarty/humanist/>
<http://www.kcl.ac.uk/humanities/cch/humanist/>

[1] From: "David L. Gants" <dgants@english.uga.edu> (92)
Subject: ELRA Focus - MLCC Multilingual Corpora for Co-
operation

[2] From: "David L. Gants" <dgants@english.uga.edu> (67)
Subject: ELRA News - New speech resources 1/2

[3] From: "David L. Gants" <dgants@english.uga.edu> (57)
Subject: ELRA News - New speech resources 2/2

--[1]------------------------------------------------------------------
Date: Tue, 7 Apr 1998 08:51:33 -0400 (EDT)
From: "David L. Gants" <dgants@english.uga.edu>
Subject: ELRA Focus - MLCC Multilingual Corpora for Co-operation

>> From: info-elra@calva.net (Valerie Mapelli)
EUROPEAN LANGUAGE RESOURCES ASSOCIATION
ELRA Focus
MLCC Multilingual Corpora for Co-operation

A collection of newspaper articles from financial newspapers=20
in 6 languages (Dutch, English, French, German, Italian and Spanish)=20
and a set of parallel texts in the 9 European Union=20
official languages (as of 1993)

The current catalogue of ELRA consists of more than 500 language resources (!)
available for speech, written or terminology works. This electronic message
aims to remind of the availability of one of them, namely the MLCC Multilingual
Corpora for Co-operation.

The MLCC text corpus has two main components - one set to allow comparable
studies to be carried out in different languages and one set as the basis for
translation studies.

The first set is referred as the Polylingual Document Collection (ELRA-W0006),
a collection of newspaper articles from financial newspapers in 6 languages
(Dutch, English, French, German, Italian and Spanish). It consists of the
following sub-corpora:

Dutch - "Het Financieele Dagblad" - 1992-1993
The corpus contains articles from the Dutch financial newspaper "Het
Financieele Dagblad" editions of 2nd January 1992 through to 24th December
1993. It contains around 8.5 million words of text.

English - "The Financial Times" - 1993
The corpus contains articles from the British financial newspaper "The
Financial Times" editions from the year 1993. The corpus contains around
30 million words.

French - "Le Monde" - 1992-1993
A corpus of articles from the French newspaper "Le Monde", consisting of
two years worth (1992-1993) of articles on financial subjects,
approximately 10 million words.

German - "Handelsblatt" - 1986-1988
This subcorpus consists of articles from the period 02.01.1986 to
15.06.1988. It contains some 33 million words. It may be possible to
obtain more recent articles from "Handelsblatt".

Italian - "Il Sole 24 Ore" - 1992-1993
The corpus described here contains articles from the Italian financial
newspaper "Il Sole 24 Ore" from the year 1992. This corpus contains some
1.88 million words. The SGML-markup was done by the University of
Edinburgh.

Spanish - "Expansion" - 1994
This subcorpus contains articles from the Spanish financial newspaper
"Expansion" editions from 21.10.1991 to 24.10.1991 and 14.05.1994 to
27.12.1994. It contains some 10 million words.

Price for ELRA members:
for research use: 360 ECU
for commercial use: 1500 ECU

Price for non-members:
for research use: 750 ECU
for commercial use: 3200 ECU

The second set is a Multilingual Parallel Corpus (ELRA-W0007) consisting of
translated data in nine European languages: Danish, Dutch, English, French,
German, Greek, Italian, Portuguese and Spanish. The parallel data, provided
by the European Commission, comprises two sub-corpora from the Official
Journal of the European Communities:

Official Journal of the European Commission, C Series:
Written Questions 1993
Records of questions and answers regarding European Community matters.
The data is regularly published as one section of the C Series of the
Official Journal of the European Community in all official languages
(previously nine). This corpus contains written questions asked by
members of the European Parliament and corresponding answers from
the European Commission in 9 parallel versions. The total size of th
corpus is approximately 10.2 million words (ca. 1.1 million words
per language).

Official Journal of the European Commission, Annex: Debates of the European
Parliament 1992-1994
This parallel corpus is the records of Parliamentary sitting published
as an annex to the Official Journal of the European Community Debates
of the European Parliament. The Parliamentary Debates are a record of
what was said by members of the meeting as well as written input provided
to the meeting. The original data from which the translations are produced
consist of a transcript of the sittings, each member speaking in the
language of his choice. The final version consists of nine parallel
versions of the material. The texts delivered comprise the Debates of
Parliament from January 1992 to July 1994. This sub-corpus contains some
5 to 8 million words per language.

Price for ELRA members:
for research use: 120 ECU
for commercial use: 480 ECU
Price for non-members:
for research use: 200 ECU
for commercial use: 800 ECU

********************************************
For more information, please contact:
ELRA/ELDA
55-57 rue Brillat Savarin
75013 PARIS
Tel: +33 1 43 13 33 33
Fax: +33 1 43 13 33 30
E-mail: info-elra@calva.net
http://www.icp.grenet.fr/ELRA/home.html
********************************************

--[2]------------------------------------------------------------------
Date: Tue, 7 Apr 1998 08:52:48 -0400 (EDT)
From: "David L. Gants" <dgants@english.uga.edu>
Subject: ELRA News - New speech resources 1/2

>> From: info-elra@calva.net (Valerie Mapelli)

EUROPEAN LANGUAGE RESOURCES ASSOCIATION
ELRA News

*** NEW SPEECH RESOURCES - Part 1 ***

The ELRA catalogue is growing up. Since our last news on this electronic list,
the following resources appeared in our catalogue.

****************************
* ELRA-S0046 PolyVar *
****************************

PolyVar is a speaker verification database comprising native and
non-native speakers of French, mainly from Switzerland but also=20
from other European countries. It consists of read and spontaneous=20
speech recorded by 143 speakers (85 male and 58 female) amounting=20
to 160 hours of speech. Each speaker recorded from 1 to 229 sessions,=20
giving a total of 3,600 recorded sessions. The data are provided with
orthographic annotation.

The number of calls per speaker is as follows:
=B7 13 speakers called 100 times
=B7 9 speakers called from 51 to 100 times
=B7 16 speakers called from 21 to 50 times
=B7 3 speakers called from 11 to 20 times
=B7 31 speakers called from 2 to 10 times
=B7 71 speakers called only once

Each speaker uttered up to 53 different items per session, including:
=B7 3 sequences of digits (1 ID number, 1 credit card number and 1 sequence
of 6 digits)
=B7 24 application words (17 words about tourism =96 Martigny)
=B7 10 read sentences
=B7 4 numbers (2 natural numbers, 2 amounts)
=B7 2 items with dates (1 read/1 spontaneous) =20
=B7 2 items with hours (1 read/1 spontaneous)
=B7 2 spelled words
=B7 3 spontaneous answers (questions about their gender, native language and
the weather)
=B7 1 comment
=B7 1 telephone enquiry

File format: 8-bit a-law
Standard in use: NIST
Sampling rate: 8 kHz
Medium: 8 CD-ROMs

Price for ELRA members: =20
for research use: 1,000 ECU =20
for commercial use: 2,000 ECU =20

Price for non-members:
for commercial use: 4,000 ECU
for research use: 2,000 ECU

**************************************************************
* ELRA-S0047 SpeechDat Speaker Verification database *
**************************************************************

This subset of PolyVar consists of 20 speakers which recorded 50 sessions.=20
The format in use is a-law with SAM headers.

Medium: 3 CD-ROMs

Price for ELRA members: =20
for research use: 750 ECU =20
for commercial use: 1500 ECU =20

Price for non members:
for research use: 1500 ECU
for commercial use: 3000 ECU

--[3]------------------------------------------------------------------
Date: Tue, 7 Apr 1998 08:54:33 -0400 (EDT)
From: "David L. Gants" <dgants@english.uga.edu>
Subject: ELRA News - New speech resources 2/2

>> From: info-elra@calva.net (Valerie Mapelli)

EUROPEAN LANGUAGE RESOURCES ASSOCIATION
ELRA News

*** NEW SPEECH RESOURCES - Part 2 ***

The ELRA catalogue is growing up. Since our last news on this electronic list,
the following resources appeared in our catalogue.

************************************************
* ELRA-S0048 SIelex (Siemens Phonetic Lexicon *
************************************************
The lexicon consists of a list of 100.000 entries with phonetic
transcriptions, main stress markers and syllable boundary markers. Most of=20
the entries were selected from the political and economical parts of two=20
German newspapers namely the 'S=FCddeutsche Zeitung' (SZ) and the 'Frankfurter=20
Allgemeine Zeitung' (FAZ). The transcription follows in most parts the so
called standard German pronunciation. Departures are described in the=20
documentation. For some entries multiple pronunciations are taken into
account especially in the case of homographs and abbreviations. The=20
alphabet chosen is extended German SAM-PA, but it can easily be translated=20
into other alphabets. The character set chosen is ISO-8859-1, a tool for=20
conversion into LATEX is provided with the CD-ROM.

Price for ELRA members: 27500 ECU =20
Price for non members: 35000 ECU

*********************************
* ELRA-S0049 The SPK database *
*********************************

SPK is an Italian speech database of isolated and connected digits. It was=20
designed and collected at the Istituto per la Ricerca Scientifica e
Tecnologica (ITC/IRST), Trento, Italy. SPK was conceived for speaker=20
recognition and verification purposes.

With this CD-ROM, speech material corresponding to isolated digits
acquired from 100 speakers (30 females and 70 males, from 23 to 50 years=20
old) is released. Most of the speakers are from the North-East of Italy..=20
Speech material was collected from each speaker during five recording=20
sessions scheduled on different days. During a recording session four=20
repetitions of the ten Italian digits were acquired from a speaker. A total=20
of 20,000 speech waveform files form the corpus.

Recordings were performed in a quiet room. Speech was acquired at 48 kHz,
with 16 bit accuracy, by means of a Digital Audio Tape-Recoder Sony=20
TCD-D10PRO and a super-cardioid microphone Sennheiser MKH 416-T.=20
Then, digital recordings were downsampled to 16 kHz. Speech waveform files=20
in the corpus were stored in the NIST-SPHERE format by using the SPHERE
library, version 2.6a.

Price for ELRA members: =20
for research use: 400 ECU =20
for commercial use: 800 ECU =20

Price for non members:=09
for research use: 800 ECU
for commercial use: 1,600 ECU

-------------------------------------------------------------------------
Humanist Discussion Group
Information at <http://www.kcl.ac.uk/humanities/cch/humanist/>
<http://www.princeton.edu/~mccarty/humanist/>
=========================================================================

Next message: Humanist Discussion Group: "11.0682 conferences"
Previous message: Humanist Discussion Group: "11.0681 messages lost!"