4.1292 Computational Linguistics & Humanities Computing (1/185)

Elaine Brennan & Allen Renear (EDITORS@BROWNVM.BITNET)
Sat, 27 Apr 91 14:17:26 EDT

Humanist Discussion Group, Vol. 4, No. 1292. Saturday, 27 Apr 1991.

Date: Thu, 25 Apr 91 13:48 EST
From: "NANCY M. IDE (914) 437 5988" <IDE@VASSAR>
Subject: Computational Linguistics and Humanities Computing

Before answering the question about computational linguistics and trying
to point out where humanities computing and computational linguistics
overlap, I want to advertise a bit: There will be a special issue of
_Computers and the Humanities_ devoted to the intersection of
computational linguistics and humanities computing, edited by Donald
Walker and myself, which should come out in spring of 1992. Included in
this issue will be articles reporting work which falls into this
intersection, as well as an overview article which will answer in much
more detail the question of where the overlap exists and may exist in
the future. Also, at this year's Association for Computational
Linguistics conference in Berkeley in June, I will give a workshop on
this topic. Finally, for those interested in a broad consideration of
how work in computational linguistics can be of use in humanities
computing, especially content analysis, I can refer you to a somewhat
outdated but possibly still useful paper entitled "The Computational
Determination of Meaning in Literary Texts," which appeared in
_Computers and the Humanities: Today's Research, Tomorrow's Teaching_,
Toronto: University of Toronto, 1986.

What follows is a short answer to the question posed, and to
Allen's additional request:

Computational Linguistics is a field concerned with a computer's
handling of natural languages such as English. "Handling" includes
(principally) understanding, generating, and translating. Accomplishing
these things would enable machines to understand and respond to user
queries and commands which are expressed in their native languages rather
than an artifical computer language, and would enable computers to do
things like translation from one language to another. The European
Community, with the advent of 1992 and at least nine languages to deal
with, is very interested in automatic translation.

Very broadly speaking, the task of handling natural language
by machine involves the following:

o morphology (finding root forms for words in inflected or derived forms)
----------
o syntax (determining the constituent grammatical pieces of sentences)
------
o semantics (determining the relations among elements of a sentence in
---------
terms of their meaning)
o pragmatics (determining which pieces of general world knowledge affect
----------
the meaning of a sentence, and how)

This does not include what one would consider to handle speech and says
nothing about elements of discourse larger than the sentence.

For example in the sentence "The rain stopped the tennis game,"
morphological analysis determines things like the fact the STOP is the
root form of STOPPED; syntactic analysis determines that the subject of
the sentence is a noun phrase with determiner THE and head RAIN, main
verb STOP marked as 3rd person, past tense, transitive or intransitive;
semantic analysis identifes STOP as the action of the sentence, with the
subject noun phrase THE RAIN in the agent role and the direct object
noun phrase THE TENNIS GAME in the object role; pragmatics would fill in
the information that tennis games are played outside and (usually)
cannot be played in the rain. To do these things one requires knoweldge
of words and their parts of speech (since most words have more than one
this can be tricky), which sense of a word is intended, what the
possible roles fillers for any verb may be along with their semantic
properties (such as animate, physical object, etc.).

This is a tremendously simplified explanation. In fact it is
astoundingly difficult to handle natural language by machine. Just to
give you an idea, it is a fact that no existing parser can correctly
identify the syntactic constituents in unconstrained language (that is,
language in any domain, without specific constraints on form or content)
more than about 30% of the time. Let alone coping with semantic issues
in any comprehensive way. Because of the difficulties, computational
linguistics has had to focus increasingly on a variety of
sub-sub-sub-problems which are at times not readily identifiable as
contributing to the overall goals of the discipline. Examples which
reflect my own interests include things like trying to extract
information about words from definition texts in everyday dictionaries
which happen to exist in machine readable form; determining the different
relevant collocates for words in order to distinguish semantic
properties; building lists of words and root forms; trying to find a
tagging system capable of representing all the differing opinions about
what the set of parts of speech actually is, or which can represent
every linguist's different idea of a syntactic analysis; gathering
corpora that represent a valid sample of general language use and finding
a suitable means to represent them in machine readable form; etc., etc.

The overlap between humanities computing and computational linguistics
is first of all simply due to a shared interest: both humanists using
computing (mainly those concerned with literary and linguistic analysis)
and computational linguists are trying to use computers to analyze
texts. In humanites computing, the analysis has traditionally involved
such things as analysis of style, analysis of content and theme, as well
as providing better access to textual materials by providing
concordances and other retrieval tools. Now, for analyses of style and
content, humanists need information such as the part of speech for each
word in the text, which they need to determine the syntactic
constituents of sentences. For analysis of content, one needs things
like sets of semantically related words (and all the inflected and
derived forms in which of semantically related words (and all the
inflected and derived forms in which they may appear in a text); if they
want to go further they could use information about semantic properties
(animate etc.) and the roles (e.g., agent, object) words and phrases
play in various sentences in a text. These are all things that
computational linguists need to know too. So, both groups have been doing
a lot of the same things for a long time. The sad part is that the two
groups have been working almost entirely independently of one another.

The independence I think resulted because both groups saw themselves as
working with entirely independent methodologies. Humanities people
tended to be concerned with texts and not individual sentences. The
texts involved literary language--what Martin Kay called "remarkable"
language in his keynote address at the recent ACH/ALLC conference in
Tempe AZ, as opposed to the "unremarkable" language with which
computational linguists are concerned (that is, they want first to deal
with the straightforward, let alone coping with things as complex as
metaphor, irony, etc.). Humanists were concerned with broad features of
style and content--general patterns across texts that indicated trends
and frequent patterns or usages, while computational linguists struggled
to come up with a deep and complete representation of the syntax and
full meaning of a handful of sentences like the one about rain and
tennis cited above. Thus humanists used statistical methods to to find
the probable and the characteristic, while computational linguists
relied on linguistic theory. At the same time, humanists were amassing
corpora of texts, paying close attention to things like genre, as well
as word lists and various other resources. Computational linguists,
focussed on the sentence, used 35-word hand constructed lexicons and
never thought much about texts.

Recently, both humanists and computational linguists have found
themselves up against a wall. Humanists have found that they cannot go
further in analyzing things like style and theme without deeper
information about syntax and semantics. Computational linguists have
found that linguistic theories predict the possible, but they need
information about the and probable and characteristic properties of
language in order to make any more progress in handling language by
machine. So, humanities people are becoming interested in some of the
methods and results computational linguists have been working on for the
past few decades, while the computational linguists are beginning to
apply statistical methods to large corpora in order to gather
information about general properties of language use, and have begun to
use tools like concordances and word lists. The methods they are using
and the resources they are applying them to are those of humanities
computing. The main difference is that humanists have been concerned
with remarkable language and computational linguists are interested, for
the time being, in unremarkable language.

Recognizing the increasing overlap, here have been some efforts to bring
together the two groups, who still persist in working independently in
large measure. At the ACH/ALLC conference in Tempe, the Association for
Computational Linguistics held a special session in which a number of
computational linguists presented some of their recent work. Their
methodologies, all involving statistics, were familiar to those who have
worked in this area of humanities computing. The applications were also
familiar for the most part: one (Church) analyzed word co-occurrence
patterns in a corpus of unremarkable texts to determine semantic
properties of related words, one (Liberman) analyzed vocabulary use in a
corpus of unremarkable texts. Two (Mercer, Kay) spoke about aligning
bilingual corpora using statistical methods, something which I am not
aware has been done in humanities computing. The session served two
purposes: to show that the methods were the same in both fields, and to
make humanists aware that these methods applied to unremarkable texts can
give them information which could be useful (for comparision purposes,
for example) in analyzing remarkable ones. Another purpose the session
served may have been to make the computational linguists aware of the
decades of work in humanities computing--such as studies of vocabulary
use--in exactly the same areas with very similar methods. They could
gain a lot from studying humanists' methods, and considering their
results. Remarkable language can also tall us a lot about the unremarkable.

Efforts like the Text Encoding Inititave, the ACL Data Collection
Initiative, the Consortium for Lexical Research, and the Dictionary
Encoding Initiative are all concerned with gathering, developing, and
representing resources that are of direct and obvious use for both humanites
computing and computational linguistics. Here, the overlap is most apparent,
since both groups are now concerned with textual materials in large amounts,
and therefore necessarily with their nature and representation.

There is much more to say about the overlap, but I hope this provides
the flavor. And I hope the ACL session at Tempe and other efforts to
increase communication between humanists and computational linguists
begin a profitable interchange for both sides.

--Nancy Ide,
Department of Computer Science, Vassar College
President, Association for Computers and the Humanities