3.1067 queries, various and interesting (169)

Willard McCarty (MCCARTY@vm.epas.utoronto.ca)
Fri, 16 Feb 90 22:45:05 EST

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Willard McCarty: "3.1068 SGML and hypertext, cont. (126)"
Previous message: Willard McCarty: "3.1066 Halio's article on machined writing, cont. (70)"

Humanist Discussion Group, Vol. 3, No. 1067. Friday, 16 Feb 1990.

(1) Date: Thu, 15 Feb 90 22:19 CST (15 lines)
From: <THEOBIBLE@STMARYTX>
Subject: ALLIANT MINISUPERCOMPUTER

(2) Date: Fri, 16 Feb 90 11:06 MET (98 lines)
From: "Pieter C. Masereeuw" <PIETER@ALF.LET.UVA.NL>
Subject: Searching in tagged text corpora.

(3) Date: Thu, 15 Feb 90 08:54:08 EST (11 lines)
From: Tzvee Zahavy <MAIC@UMINN1>
Subject: Computer Applications in Judaic Studies

(4) Date: 16 Feb 90 08:47 EST (13 lines)
From: Malcolm Hayward <MHAYWARD@IUPCP6.BITNET>
Subject: Humanist Question

(1) --------------------------------------------------------------------
Date: Thu, 15 Feb 90 22:19 CST
From: <THEOBIBLE@STMARYTX>
Subject: ALLIANT MINISUPERCOMPUTER

ST. MARY'S UNIVERSITY WILL SOON RECEIVE A DONATED ALLIANT
MINISUPERCOMPUTER THAT WILL MULTIPLY THE POWER OF OUR CURRENT
11/780 VAX BY TWENTY TO FORTY TIMES. OUR ENGINEERS ARE, OF
COURSE, DELIGHTED.
BUT AS DEAN OF HUMANITIES AND SOCIAL SCIENCES, I WOULD LIKE
SOME HELP IN GATHERING INFORMATION ABOUT WHAT KINDS OF SOFTWARE
PACKAGES IN THE HUMANITIES ARE AVAILABLE FOR THAT MUCH POWER.
WHAT CAN WE DO WITH THE ALLIANT THAT WE CAN'T DO WITH PC'S OR
MACS? THANKS FOR YOUR ASSISTANCE.
CHARLES H. MILLER (theobible@stmarytx)
(2) --------------------------------------------------------------108---
Date: Fri, 16 Feb 90 11:06 MET
From: "Pieter C. Masereeuw" <PIETER@ALF.LET.UVA.NL>
Subject: Searching in tagged text corpora.

Dear Humanists,

I would like to raise a discussion about programs that search for
patterns in tagged text corpora. With "tagged text corpora" I mean
electronic texts that have been enriched with some kind of information,
for instance, a tag and/or lemma (="dictionary entry") for each word. A
good example of a tagged corpus is the Brown corpus of American English,
developed at Brown University by Francis and Kucera.

Of all text corpora, tagged corpora are a minority. This is of course
because it is much harder to produce a tagged corpus than an ordinary
one. Tagged corpora are of course much more useful for linguistic and
literary research.

There must be many programs around that perform searches in such corpora.
Of course, there are WordCruncher, AskSam, Freebase and the like, but
these programs are more fit for ordinary texts than tagged corpora
(though with some tricks, they can do some useful things).

What I am interested in is:

o what features should a program for searching in tagged text corpora
ideally have
o what programs are there already in use and what features do they
have

In my view, such a search program should have the following features:

o it should have a knowledge of things like sentences (maybe even broader
than that), words, punctuation marks, lemmata and tags
o it should be able to search patterns that are constituted by sequences
of these elements (sentences, words, etc.)
o in the patterns, one should be able to use wildcards, boolean
operators and sequence operators in order to create more complex
patterns
o for corpora with a more advanced tagging system, the program should
allow different levels of information, all of which can be combined
in a search pattern. These levels have not necessarily a (hierarchical)
relation with each other. Several kinds of levels can be imagined:

- The word level
This level contains the text "just as it is"
- The lemma level
This level contains the lemmata of the words in the word level. Note
that it is in principle possible to have one lemma correspond with
more than one word: think of constructions like in German:
"Gestern RIEF er seinen Bruder AN". One might even think of
a hierarchy of lemmata, where ANRUFEN is the general lemma with
in this sentence the hierarchically lower lemmata "RUFEN" and "AN".
- The word-tag level
Like lemmata, one tag can in principle correspond with more than
one word (e.g. in periphrastic verb constructions, which are found
in many languages, among which English "He HAD CALLED his brother").
- The clause-tag level
This level contains syntactic tags like "NP" "SUBJECT" and so on.
This level should not only allow hierarchy; maybe it should even
allow recursion (for instance, a NP being subordinate to another
NP). A special problem in this respect is that the order of tags in
this level can dramatically deviate from the order of the words that
are 'covered'. This is especially true for free word-order
languages like Latin.
o finally, the program should perform its tasks fast and behave itself
in a user-friendly way.

Here in Amsterdam, we developed (some 12 years ago) a program that was
(surprisingly) called Query. It was developed for the "Eindhoven
corpus" (a Dutch variant of the Brown corpus) and has been succesfully
employed for other corpora, such as the Brown corpus itself, the Lob
Corpus (British English), the Lund corpus (spoken English), the Liege
corpus (Latin), Hungarian, Russian, Quechua, Swedish, Greek and maybe
more. The query program searches for sentences that contained patterns
of words, lemmata and codes, which could be combined by boolean
operators (and, or, not), sequence operators (followed by) on the word,
code and sentence level. There are wildcards for words, letters and
tags. For a program with a lineair search algorithm, it performed its
task surprisingly vast (10 minutes for the entire Brown corpus on a Data
General MV/4000 computer).

Destiny wants this program to be replaced by another program within one
year, since our good old MV/4000 is being replaced by a Digital Vax with
VMS (no one asked for it, but we had to buy it). The Query program
itself is practically not portable: it has partly been written in
assembler and partly in a foreign tongue called DG/L.

We have some expertise to create a new search program for Vax/VMS and
micros. Before we even contemplate the design of such a program, we
would like to know what is already there in the world (and if it is
useful) and what properties could be imagined by others.

Pieter C. Masereeuw
Dept. of Computational Linguistics
University of Amsterdam
The Netherlands

email: PIETER@ALF.LET.UVA.NL

(3) --------------------------------------------------------------15----
Date: Thu, 15 Feb 90 08:54:08 EST
From: Tzvee Zahavy <MAIC@UMINN1>
Subject: Computer Applications in Judaic Studies

I would be grateful for any information on the subject of computers
and Judaic Studies for use in an article I am writing. I will include
information on (1) research applications, (2) teaching applications,
(3) databases, (4) wordprocessing, (5) other.
Please reply via bitnet or ordinary mail: Professor Tzvee Zahavy,
University of Minnesota, Classical and Near Eastern Studies,
176 Klaeber Court, Minneapolis, MN 55455. Thank you.
(4) --------------------------------------------------------------19----
Date: 16 Feb 90 08:47 EST
From: Malcolm Hayward <MHAYWARD@IUPCP6.BITNET>
Subject: Humanist Question

A graduate student of mine in a poetics class last semester used
a spread-sheet program for prosodic analysis. Kind of interesting
results were achieved. I'd be interested in hearing from anyone
working with this sort of thing.

Malcolm Hayward MHayward@IUP
Department of English Phone: 412-357-2322 or
IUP 412-357-2261
Indiana, PA 15705

Next message: Willard McCarty: "3.1068 SGML and hypertext, cont. (126)"
Previous message: Willard McCarty: "3.1066 Halio's article on machined writing, cont. (70)"