14.0274 counting words

From: by way of Willard McCarty (willard@lists.village.Virginia.EDU)
Date: 09/27/00

Next message: by way of Willard McCarty: "14.0275 noisy libraries"

Previous message: by way of Willard McCarty: "14.0273 conferences"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

               Humanist Discussion Group, Vol. 14, No. 274.
       Centre for Computing in the Humanities, King's College London
               <http://www.princeton.edu/~mccarty/humanist/>
              <http://www.kcl.ac.uk/humanities/cch/humanist/>

   [1]   From:    Eric Johnson <johnsone@jupiter.dsu.edu>             (10)
         Subject: Counting words

   [2]   From:    "Jim Marchand" <marchand@ux1.cso.uiuc.edu>          (30)
         Subject: letter counts

   [3]   From:    Randall Pierce <rpierce@jsucc.jsu.edu>               (8)
         Subject: ETAOIN SHRDLU

   [4]   From:    Einat Amitay <einat@ics.mq.edu.au>                  (13)
         Subject: Latin letter frequency & word lists

--[1]------------------------------------------------------------------
         Date: Wed, 27 Sep 2000 09:29:13 +0100
         From: Eric Johnson <johnsone@jupiter.dsu.edu>
         Subject: Counting words

     Some years back, I wrote a computer program to count the words in an
ASCII text file.  My program can be downloaded from the web at:

http://www.dsu.edu/~johnsone/sno.html

WORDS and WordsNT (which version I recommend) run from a PC command line.
An article about the program (on the same web page) gives information
about the program.

     I would be interested to know if anyone finds my program useful.

     --Eric Johnson
       johnsone@jupiter.dsu.edu
       http://www.dsu.edu/~johnsone/

--[2]------------------------------------------------------------------
         Date: Wed, 27 Sep 2000 09:30:03 +0100
         From: "Jim Marchand" <marchand@ux1.cso.uiuc.edu>
         Subject: letter counts

Talking about primitives of all kinds, the discussion of the
frequency of letters in Latin brings up a number:

1. What corpus should one use?  How many running words seems
enough? Zipf, although he was a statistician, used a corpus of 5000
running words from one author (Cicero), surely too little. Cetainly, Anne's
count (7.8 million words!) is large enough.

2. Do we not need to label our count carefully?  Latin, German,
French, etc. are surely too large and inclusive.  Zipf used a
chrestomathy from many different periods for French, surely a bad
notion as a first step.

3. Letter counts vs. phoneme counts. Letters are graphemes at best,
not phonemes.

4. What to do with non-ASCII letters?  The count cited by Zipf (and
others) for German contains no umlauts, that for French no accented
letters, no distinction between c and c-cedilla.

5. What do we do with editions, and which do we choose.  For
example, it is common to distinguish between i and j, u and v in
editions, though these are Renaissance inventions for the most
part. What do we do with assimilations?  Some editions use inl- for
ill- even.  Whatever is done, it needs to be spelled out carefully,
if it is important.

6. Where an edition includes two or more witnesses, we need to be
careful to distinguish between them, perhaps not so much in the
matter of letter counts, but otherwise.  If our editions contain
conjectural emendations, do we not need to excise and/or label
these? In making a frequency count of Gothic words, for example,
should we not be careful not to use two versions of the same text
twice?

As I look through Martin Joos's excellent dissertation (Wisconsin,
1942), a count of Gothic graphemes, I notice all these problems, so
that the frequencies seem skewed to me.  Perhaps I am too finicky.

--[3]------------------------------------------------------------------
         Date: Wed, 27 Sep 2000 09:31:02 +0100
         From: Randall Pierce <rpierce@jsucc.jsu.edu>
         Subject: ETAOIN SHRDLU

   The above mnemonic seems to be losing meaning and relevance to many in
the linguistics community. I have mentioned it to some full professors
in the field of language arts and gotten a blank stare. I had an
instructor who told me that frequency in language is a myth based on
ignorance of what language is all about. That was some twenty years ago.
I suggest that he ask a cryptographer about that theory. Just a point I
would like to make. Frequency in Latin? Hmmm. I think that etymology is
a fast-fading study and we are losing much by its possible disappearance
from the regular curriculum. Randall

--[4]------------------------------------------------------------------
         Date: Wed, 27 Sep 2000 09:31:20 +0100
         From: Einat Amitay <einat@ics.mq.edu.au>
         Subject: Latin letter frequency & word lists

Hi All,

I believe most of you already know about the CORPORA mailing list
(http://www.hit.uib.no/corpora/). It is a list dedicated to the study of
language corpora and what tools we can develop to work with those. Together
they have answered many questions similar to the ones some of Humanist's people
pose recently. I think it will be a good source for answers (and maybe more
questions) about language sample collections and how these can be analysed.

Just a thought,
+:o)
einat

--
Einat Amitay
einat@ics.mq.edu.au
http://www.ics.mq.edu.au/~einat

Next message: by way of Willard McCarty: "14.0275 noisy libraries"
Previous message: by way of Willard McCarty: "14.0273 conferences"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b30 : 09/27/00 EDT