11.0562 new corpora on WWW

Humanist Discussion Group (humanist@kcl.ac.uk)
Wed, 4 Feb 1998 18:32:23 +0000 (GMT)

Humanist Discussion Group, Vol. 11, No. 562.
Centre for Computing in the Humanities, King's College London
<http://www.princeton.edu/~mccarty/humanist/>
<http://www.kcl.ac.uk/humanities/cch/humanist/>

[1] From: Chris Powell <sooty@umich.edu> (23)
Subject: New Middle English texts online

[2] From: Daniel Ridings <ridings@svenska.gu.se> (64)
From: "Jan-Gunnar Tingsell <tingsell@hum.gu.se>"
<jgt@hum.gu.se>
Subject: Wordclassed tagged Swedish corpus on the web (fwd)

--[1]------------------------------------------------------------------
Date: Tue, 3 Feb 1998 10:58:24 -0500 (EST)
From: Chris Powell <sooty@umich.edu>
Subject: New Middle English texts online

The Humanities Text Initiative at the University of Michigan is pleased to
announce that six new texts have been added to the Corpus of Middle
English Prose and Verse: An Alphabet of Tales, The Book of the Knight of
La Tour-Landry, Merlin, Malory's La Morte Darthur, Three Prose Versions
of the Secreta Secretorum, and Chaucer's Treatise on the Astrolabe.

The Corpus of Middle English Prose and Verse is the largest collection of
texts in Middle English on the World Wide Web, with 42 titles comprising
almost three million words. The entire Corpus can be browsed or searched
all at once, or searches can be restricted to an individual text or a
user-defined group of texts using the Personal Collection search mode.
The texts are encoded in Standard Generalized Mark-up Language (SGML)
using the TEI Guidelines and converted into HTML on-the-fly. New texts
continue to be added to the collection; we hope to add ten more before
the end of the semester. Full information about the creation of these
texts -- scanning, keyboarding, proofing, and markup -- is available in
the text headers.

The Corpus of Middle English Prose and Verse is available at
http://www.hti.umich.edu/english/mideng/

Christina Powell
Coordinator, Humanities Text Intiative
University of Michigan
http://www.hti.umich.edu/

--[2]------------------------------------------------------------------
Date: Mon, 2 Feb 1998 16:55:39 +0100 (MET)
From: Daniel Ridings <ridings@svenska.gu.se>
Subject: Wordclassed tagged Swedish corpus on the web (fwd)

There is now access to 10,000,000 words of corpus material via the web.
It has been tagged with the Swedish version of the PAROLE tagset (156 tags)..
It is possible to search for individual words, phrases, or tags, thus making
it possible to extract patterns based on the morphosyntactic tags.
(see http:www2.echo.lu/langeng/en/le2/le-parole/le-parole.html for
more information about PAROLE)

Please note that in the following the quotation marks are part of the query
language and are thereby essential.

Truncation is not a simple * but .* (period-star).

WORD (followed by truncated examples)

[word="skattemedel"]
(tax revenue)

Since word searches are so frequent the above can be abbreviated to:

"skattemedel"

With truncations:

"skatte.*"
".*medel"

PHRASES

"för" "egen" "del"
(for his/her part)

"för" [] "del"
(för followed by any word and then "del")

"för" []{1,3} "del"
(för followed by 1-3 words then followed by "del")

"för" []{1,3} "del" within S
(as above, but limited in range to within an s-unit (sentence))

TAGS (msd = MorphoSyntactic Description)

[msd="DF@US@S"] []{0,4} [msd=NCUSN@DS"]
(all NP's consisting of a determiner in the definite form, 0-4 words and
a noun with genus=utrum, numerus=singular, case=normal and the feature for
definite or indefinite set at "definite").

Similar searches can be done for prepositions (msd=SPS).

The first letter of a tag provides the word class information (N=nomen,
A=adjective, V=verb, S=preposition, R=adverb, D=determiner etc.) The other
positions are features (genus, numerus, case, definite/indefinite for nouns,
mood, tense, passive/active (actually s-form, not passive for these tags) for
verbs. There are two tables providing correspondences between the Swedish
PAROLE tags and the tags used in the Stockholm-Umeå Corpus.

Fairly advanced queries can be made. For example, the periphrastic futurum
in Swedish consists of "kommer" (come) + infinitive marker + infinitive.
In recent times, the infinitive marker is being left out with growing
frequency. This can be confirmed by comparing the corpus from 1965 with the
PAROLE corpus (both are on the web page). The search string would be as
follows:

"kommer" [word!="att" & msd!="(V@I.*|FI)"]{0,4} [msd="V@N.*"] within S

!= (not equal), & (and), | (or)

"kommer" followed by 0-4 words (which are not "att", not verbs in the
indicative or internal punctuation (FI), followed by an infinitive (V@N.*)
within a sentence.

The same query run against the contemporary material and the material from
thirty years ago is revealing.

The address is: http://ldb20.svenska.gu.se

The query motor is the one from IMS in Stuttgart (Oli Christ et al.).
http://www.ims.uni-stuttgart/Tools/CorpusTools

Granted, not everyone is interested in Swedish, but for those who are, this
could be quite helpful. Eventually the interface will be made nicer
and in the course of the next few weeks the material will be
lemmatized. The tagging has been performed by yours truly with a version
of Eric Brill's tagger that I'm working on. I'm almost satisfied with it,
but not quite. There are mistakes and this tagged version is a phase in
my efforts of improving the tagger.

Daniel Ridings
Språkdata
Göteborgs universitet
Sweden

-------------------------------------------------------------------------
Humanist Discussion Group
Information at <http://www.kcl.ac.uk/humanities/cch/humanist/>
<http://www.princeton.edu/~mccarty/humanist/>
=========================================================================