12.0111 LDC & ELRA Releases

Humanist Discussion Group (humanist@kcl.ac.uk)
Wed, 1 Jul 1998 22:46:10 +0100 (BST)

Humanist Discussion Group, Vol. 12, No. 111.
Centre for Computing in the Humanities, King's College London
<http://www.princeton.edu/~mccarty/humanist/>
<http://www.kcl.ac.uk/humanities/cch/humanist/>

[1] From: "David L. Gants" <dgants@english.uga.edu> (36)
Subject: A New Corpus from the LDC

[2] From: "David L. Gants" <dgants@english.uga.edu> (69)
Subject: A New Release From the LDC

[3] From: "David L. Gants" <dgants@english.uga.edu> (27)
Subject: SPECIAL ANNOUNCEMENT FROM ELRA - VALIDATION MANUALS

--[1]------------------------------------------------------------------
Date: Wed, 1 Jul 1998 16:31:28 -0400 (EDT)
From: "David L. Gants" <dgants@english.uga.edu>
Subject: A New Corpus from the LDC

>> From: LDC Office <ldc@unagi.cis.upenn.edu>

Announcing a NEW RELEASE from the
Linguistic Data Consortium

****************************************
1997 Spanish Broadcast News (HUB-4NE)
****************************************

This corpus contains a portion of the acoustic data
designated as the training set for the 1997 DARPA HUB-4
Spanish Benchmark. It contains speech and transcripts of
30 hours of broadcast news from the following sources:

VOA
Univision
Televisa

All acoustic files are in NIST SPHERE format, without compression.
The sample data are 16-bit linear PCM, 16-KHz sample frequency,
single channel. Most files contain 30 minutes of recorded
material, and some contain 60 or 120 minutes (approximately); the
sampling format requires roughly 2 megabytes (MB) per minute of
recording, so the file sizes are typically around 60 MB, with some
files ranging up to 120 or 240 MB.

The transcripts are in SGML format, using the same markup
conventions that have been applied to the other 1997 Broadcast
News speech corpora (in English and Mandarin), and are transmitted
by ftp, not on the cdroms with speech data.

Because of restrictions imposed by the copyright holders, this
corpus is available to 1998 LDC members only.

If you would like to order a copy of this corpus, please email
your request to <ldc@unagi.cis.upenn.edu>. If you need additional
information before placing your order, or would like to inquire
about membership in the LDC, please send email or call (215)
898-0464.

Further information about the LDC and its available corpora can be
accessed on the Linguistic Data Consortium WWW Home Page at URL:

http://www.ldc.upenn.edu/

Information is also available via ftp at ftp.cis.upenn.edu under
pub/ldc; for ftp access, please use "anonymous" as your login
name, and give your email address when asked for password.

--[2]------------------------------------------------------------------
Date: Wed, 1 Jul 1998 16:32:58 -0400 (EDT)
From: "David L. Gants" <dgants@english.uga.edu>
Subject: A New Release From the LDC

>> From: LDC Office <ldc@unagi.cis.upenn.edu>

Announcing a NEW RELEASE from the
Linguistic Data Consortium

************************************************
TAIWANESE PUTONGHUA SPEECH AND TRANSCRIPT CORPUS
************************************************

This set of data on Taiwanese accented Putonghua
(PTH) was recorded in Taiwan from December 1994 to
January 1995. Taiwanese accented PTH refers to PTH
spoken by people who were born in Taiwan and whose
first language is Taiwanese (Southern Min). A total
of 40 speakers; ranging in age, education, birth
place, and family dialect; were recorded. There were
5 two-speaker dialogues and 30 single-speaker
monologues. The dialogues were about 20 minutes each
and the monologues were about 10 minutes each.
Dialogues were recorded on two tracks, one for each
speaker. Monologues were recorded on one track.

The recordings were done in ordinary, but quiet
rooms. The speakers were asked in advance to speak in
conversation style, without notes, on any topic they
chose, or no topic at all. Most speakers spoke
spontaneously and the topic drifted freely. Some
speakers talked about their professional work in a
rather formal way. One speaker (#20, a public health
official) used notes. We consider this variation in
speech style a merit of the data.

The recording tools consisted of a portable DAT
(Teac) which recorded at a 44.1 kHz sampling rate at
16 bits linear quantization. The microphones were
AudioTechnica lapel microphones with a preamp and XLR
connection to the DAT. The XLR helped low noise
recordings, and the AudioTechnica provided
widebandwidth, flat response over the speech range of
interest, was unidirectional to minimize cross-talk,
and very light in comparison with standard
microphones. Both single-speaker monologues and
two-speaker dialogues were recorded using this system
on standard DAT tape.

Before recording, all speakers read and signed the
'Informed Consent Form', which was written in Chinese
and which largely followed the standard format
approved by the Human Subject Committee of the
University of Michigan. The form stated that the
participation in the recording was entirely voluntary
and that the speech may be used for linguistic
teaching and research purposes.

The speech data are accompanied by transcripts. The
monologues have start and end time stamps. The 5
dialogues are time stamped by speaker turn.

Institutions that have membership in the LDC during
the 1998 Membership Year will be able to receive this
corpus in the same manner as all other text and
speech corpora published by the LDC.

Nonmembers can receive a copy of the Taiwanese
Putonghua Speech and Transcript Corpus for $750.

If you would like to order a copy of this corpus,
please email your request to
<ldc@unagi.cis.upenn.edu>. If you need additional
information before placing your order, or would like
to inquire about membership in the LDC, please send
email or call (215) 898-0464.

Further information about the LDC and its available
corpora can be accessed on the Linguistic Data
Consortium WWW Home Page at URL:

http://www.ldc.upenn.edu/

Information is also available via ftp at
ftp.cis.upenn.edu under pub/ldc; for ftp access,
please use "anonymous" as your login name, and give
your email address when asked for password.

--[3]------------------------------------------------------------------
Date: Wed, 1 Jul 1998 16:35:28 -0400 (EDT)
From: "David L. Gants" <dgants@english.uga.edu>
Subject: SPECIAL ANNOUNCEMENT FROM ELRA - VALIDATION MANUALS

>> From: info-elra@calva.net (Valerie Mapelli)

EUROPEAN LANGUAGE RESOURCES ASSOCIATION
ELRA News

*** SPECIAL ANNOUNCEMENT FROM ELRA - NEW RELEASE OF THE
VALIDATION MANUAL FOR LEXICA ***

ELRA is happy to announce that the validation manual entitled "A Draft
Manual for the Validation of Lexica", from N. Underwood & C. Navaretta, has
been revised and can now be obtained free of charge from the ELRA Web site.

Available validation manuals for lexica and corpora are listed below:
Lexicon validation:
"Towards a standard for the evaluation of lexica", N Underwood & C
Navaretta.
"A Draft Manual for the Validation of Lexica - Final Report", N Underwood
& C Navaretta; Release 1.1, 1st Revision, June 1998.
Corpora validation:
"An analytic framework for the validation of language corpora", P Baker, L
Burnard, A McEnery & A Wilson.
"Techniques for the validation of corpora", P Baker, L Burnard, A McEnery
& A Wilson.

To obtain copies, download them from the ELRA Web site:

http://www.icp.grenet.fr/ELRA/home.html

*****************************************
* ELRA *
* 55-57, rue Brillat Savarin *
* 75013 Paris, France *
* tel. +33 1 43 13 33 33 *
* fax. +33 1 43 13 33 30 *
* email. info-elra@calva.net *
*****************************************

-------------------------------------------------------------------------
Humanist Discussion Group
Information at <http://www.kcl.ac.uk/humanities/cch/humanist/>
<http://www.princeton.edu/~mccarty/humanist/>
=========================================================================