11.0402 award for Rav-Milim Project

Humanist Discussion Group (humanist@kcl.ac.uk)
Thu, 13 Nov 1997 21:05:21 +0000 (GMT)

Humanist Discussion Group, Vol. 11, No. 402.
Centre for Computing in the Humanities, King's College London
<http://www.princeton.edu/~mccarty/humanist/>
<http://www.kcl.ac.uk/humanities/cch/humanist/>

Date: Thu, 13 Nov 1997 10:09:02 GMT
From: Yaacov Choueka <choueka@macs.biu.ac.il>
Subject: award for the Rav-Milim Project

[Those of you familiar with the work of Professor Choueka may know that
the award reported here has a broader significance for the work we do,
commitment to long-term research. All too often, it seems, we are pushed
to undertake short-term projects, with their quick rewards and relative
safety. As a result, big projects do not get done, and the field suffers
accordingly. I wonder how many of us, able to afford the luxury, have
sworn off the quick fix, numerous repetitive conference papers and
small topics. What Choueka reports below is none of these, and I think we
all might applaud him for sticking with the big job. --WM]

(Announcement, translated from Hebrew):

Prof. Yaacov Choueka, from the Department of Mathematics and
Computer Science of Bar-Ilan University, and Head of its Institute of
Information Retrieval and Computational Linguistics, was
recently awarded Israel Prime Minister Prize for Computer
Programming - 1997.
The prize was awarded by the Prime Minister Mr. Netanyahu on
16 September 1997 at a ceremony held at the Kenesset (Israeli
Parliament) auditorium and attended by a large number of guests
from academic circles of computer science as well as by
representatives from the high-tech industry and from governmental
and parliamentary agencies.
The award was given to Prof. Choueka for his program
"Nakdan-Text" (Text Vocalizer) and for the underlying "Rav-Milim"
computerized infrastructure for intelligent processing of modern
Hebrew.
Once installed as an add-on to a Hebrew word-processor
(e.g., Microsoft Hebrew Word), Nakdan-Text
can then automatically vocalize any given Hebrew text (letter,
document, or even entire books) on the fly, with 95% accuracy.
A first version of Nakdan-Text already won the 1992 IPA
(Information Processing Association of Israel) prize for best
technological applications of computing.

The Rav Milim system, highlights of which are given below, was
developed at the Center for Educational Technology (C.E.T.)
in Tel-Aviv by a large team of linguists, programmers and
lexicographers directed by Prof. Choueka, in the years 1989-1996.
Yoni Ne'eman, from CET, the chief programmer of Rav-Milim and
in charge of the linguistic algorithms, was a co-recipient of
the prize.

Other recipients of the prize were Prof. David Harel from the
Weizman Institute of Science (a former student of Prof. Choueka)
for his "Dynamic Statecharts" system, and the managers of
Checkpoint Co., a high-tech start-up, created about four
years ago by three students in their twenties from the Hebrew
University in Jerusalem to develop Firewalls for the Internet,
now worth at Wall Street close to 1 billion dollars.

Prof Amir Pnueli, Chairman of the Award Committee (himself
a recipient of the prestigious Turing award), presented the
committee recommendations and Prof. Choueka responded
in the name of the prize's recipients, stressing the critical
importance of building and continuously updating the basic tools
necessary for the intelligent processing of Hebrew, the language
which embodies the very soul of the Jewish people and of its
immense 3,000 years of cultural heritage,

-----------------------------------------------------------------

"Rav-Milim" (Multi-Words)

A Computerized Infrastructure for
Intelligent Processing of Modern Hebrew

Principal Investigator: Yaacov Choueka

(Highlights)

"Rav-Milim" is a broad, comprehensive, robust and integrated
computerized infrastructure for the intelligent processing of
modern Hebrew, developed in the years 1989-1996 at the Center for
Educational Technology in Tel-Aviv. Large teams of programmers.
linguists, computational linguists, lexicographers and editors
were involved in this project, which was initiated, directed and
supervised by Prof. Y. Choueka from Bar-Ilan University.
Yoni Ne'eman was in charge of the linguistic algorithms as well
as chief programmer of the project. The names of some of the other
major team members are given at the end.
A few papers on the system and its various components are now under
preparation.

The basic modules of the system, from which scores of products and
applications (both computerized and printed) have been derived,
are as follows:

- "Milim": A complete, accurate, comprehensive and portable
morphological analyzer and lemmatizer for modern Hebrew (there is an
estimated 70 million of word-forms in Hebrew). The program takes as
input any word (string of characters) in Hebrew and outputs the
set of all its (linguistically correct) grammatical analyses,
including: root, dictionary entry, part-of-speech, gender-number for
nouns and adjectives, mode-tense-person-gender-number for verbs,
attached prepositions, attached pronouns (including
person-gender-number of the pronoun), and more.
Milim recognizes all common modes of Hebrew spelling
(defective - "hasser" and plene - "male") and also
some extra-linguistic units such as acronyms (abundant in Hebrew),
abbreviations, and frequent proper nouns (of persons, places, products).
The program, a library of subroutines in C, takes a few hundred K's,
and can analyze about 1,000 words per second on a Pentium PC.

- "Katvan" (spelling checker): Unlike English, an adequate
spelling checker for Hebrew can not consist of long lists of
words with some rudimentary suffix stripping, and has to be based on a
morphological analyzer. Katvan is an accurate and
comprehensive Hebrew spelling-checker based on Milim, that recognizes
both the "defective" and "plene" spellings, and can correctly convert
from one mode to the other (it also suggests corrections to flawed
strings). Katvan was chosen by Microsoft and Word Perfect to be
the standard spelling-checker for their Hebrew word-processors.

- "Nakdan" (Vocalizer): A program that, given a word-form and its
grammatical analysis, will output its (unique) vocalization
(including long and short vowels, stresses, etc.) according to
the rules of grammatical Hebrew vocalization. Given any word in Hebrew
(without context), the program will activate "Milim" to get all its
possible morphological analyses, and will attach to each of
them the appropriate vocalization, thus producing as output the set of
all (linguistically correct, context-free) possible vocalizations of
that word.

- "Nakdan-Text" (Text Vocalizer): Given a sentence in Hebrew,
this program will vocalize it, by first activating "Nakdan" to find
all possible morphological analyses and attached vocalizations of
every word in the sentence, then choosing, for every such word,
the "correct" context-dependent one, using short-context syntactical
rules as well as some probablistic and statistical modules.
The program works with a 95% accuracy, and is available, e.g., as an
on-the-shelf add-on to Microsoft Hebrew Word.
After installation, any Word document (or even book), can be
vocalized by just marking it and clicking on the pertinent icon;
the vocalization is done online, and the document can be printed
with the diacritic vocalization points on any (Word-supported)
printer. Proofreading and correcting the erroneous vocalizations
are very easy and do not require a professional linguist (as is the
case generally with manual vocalization).
Nakdan-Text is an essential step for Text-to-Speech applications in
Hebrew; without such vocalization, computerized "reading" is
obviously impossible.

- "Hamilon" (The Dictionary): A new dictionary of Hebrew,
built a-priori on modern lexicographical principles and with an
architecture that is easy to use and embed in computerized processing
contexts. Radically different in philosophy and approach from the
available classical dictionaries of Hebrew, the Rav-Milim dictionary is
synchronic (rather than historical), descriptive (rather than
normative, although bad usage is clearly tagged as such),
comprehensive - covering all registers of the language (from the
literary to the slang and vulgar) and all strata (from the biblical
to the modern) - but not exhaustive (omitting historical curiosities,
discarded inventions, etc) and user-oriented. Following the new
sensitivity to meaning-in-context acquired by the extensive
processing of large corpora, the full and rich spectrum of the
different meanings of an entry is deployed, and usage examples
for every (non-encyclopoedic) entry, carefully designed to highlight
its appropriate sociolinguistic context, are given. For each entry,
the family of its related terms (words with the same root and the
same semantic field) is detailed. Special attention is given to
collocations (a generic term used here loosely for compound nouns,
verbal attachements, fixed phrases, idioms, etc, that deserve a
special dictionary heading and explanation): every collocation
appears under each of its pertinent entries, and some 8,000
new collocations (out of a total of 20,000), never recorded before,
are explained.
The printed version of the dictionary was published in April 1997
(by C.E.T., Steimatzky and Miskal) as a 6-volume set, and the
computerized version appeared at about the same time, as part of
"The Hebrew Language CD", described below.

- "The Hebrew Language CD": All of the grammatical and lexicographic
modules described above, and more, are integrated in this CD-ROM,
which is in fact a complete "laboratory" of Hebrew processing
(on the word level).
Keying any word, the user can spell-check it or ask for its
(correct) spelling in the different modes, see its vocalization(s)
and its decomposition into meaningful components, look at its complete
morphological analysis (or analyses), see the full family (in the
sense defined above) of related terms, review all collocations that
contain it (there may be hundreds of them) - and for each one that he
marks, read its explanation - , ask for the full conjugation table
of the corresponding base-form (in both vocalized and non-vocalized
forms and spellings), ask for all entries that have the same
vocalization pattern, and, of course, ask to see the full dictionary
record of the appropriate entry. It should be noted here that looking
for a word in a printed Hebrew dictionary can be a frustrating experience
even for experienced users, since one has first to reduce the word, in
the form encountered, to its base-form (or its root), a task that
is not needed here. The user enters the word in any variant encountered,
and the program will automatically display the pertinent entry
(or, sometimes, entries). This feature also allows the user to mark any
string in an explanation or a usage-example, and the appropriate
entry and explanations will be displayed, ad infinitum.

- "Young Rav-Milim - The Dictionary": A dictionary of modern Hebrew
(2 vols, 1,000 pgs, same publishers as above) for the young (ages 7-16),
with (1000, color) illustrations (the first of its kind ever in
Hebrew). All of the dictionary contents (entries and subentries,
collocations, explanations, usage examples, etc) reflect the young world
of knowledge and associations. A unique feature of the dictionary is
the thousands of annotations scattered in it, giving the reader
a wealth of additional interesting information on morphological,
grammatical, semantical, historical and cultural aspects of the
entry. The page layout is reminiscent of a Talmudic page: a rectangular
box of basic text, surrounded by related glossaries, commentaries and notes.
The dictionary thus functions as an attractive book to read and browse
into, in addition to its basic function as a reference book.

- "Young Rav-Milim - The Multimedia CD-ROM": A multimedia version of
the dictionary, that reflects the whole contents of the printed one,
and, in addition, pre-taped pronounciation of the entries, typical
sounds for appropriate entries (animals, musical instruments, special
verbs, etc), linguistic and "dictionary" games, etc.

Rav-Milim Team (major participants):
------------------------------------

Yaacov Choueka, PI and Director
Yoni Ne'eman, Chief programmer and in charge of linguisitic
algorithms

Programmers: Avi Danon, Yosi Sarousi

Linguistics: Rahel Finkel, Hagit Avioz

The Dictionary:

Steering Committee:
Prof. Yacov Choueka, Prof. M.Z. Kaddari (Vice-President, Academy
of Hebrew Language), Prof. R. Nir (Hebrew University), Prof.
R. Mirkin (Academy of Hebrew Language), Prof. O.Schwarzwald (Bar-Ilan
University), M. Zinger.

Editor-in-Chief: Uzzi Freidkin
Senior Editors: Dr Haym Cohen, Yael Zachi-Yannai
Science and Technology Editor: Yakhin Unna
Assistant Editors: Rahel Finkel, Hagit Avioz, Sara Choueka

Dictionary for the Young:

Steering Committee:

Prof. R. Berman (Tel-Aviv University), Dr. Zvia Walden (Berl
College), Prof R. Nir, Dr. Dorit Ravid, Prof. Maya Fruchtman,
Prof. O. Schwarzwald

Editor: Yael Zachi-Yannai
Assistant Editors: Hagit Avioz, Sara Choueka
Consultants: Uzzi Freidkin (lexicography), Dr. Haym Cohen
(linguistics), Dr Zvia Walden (Educational approach and design).

Multimedia version:

Design and supervision: Ofra Razel

-------------------------------------------------------------------------
Humanist Discussion Group
Information at <http://www.kcl.ac.uk/humanities/cch/humanist/>
<http://www.princeton.edu/~mccarty/humanist/>
=========================================================================