Humanist Discussion Group, Vol. 14, No. 274. Centre for Computing in the Humanities, King's College London <http://www.princeton.edu/~mccarty/humanist/> <http://www.kcl.ac.uk/humanities/cch/humanist/> [1] From: Eric Johnson <johnsone@jupiter.dsu.edu> (10) Subject: Counting words [2] From: "Jim Marchand" <marchand@ux1.cso.uiuc.edu> (30) Subject: letter counts [3] From: Randall Pierce <rpierce@jsucc.jsu.edu> (8) Subject: ETAOIN SHRDLU [4] From: Einat Amitay <einat@ics.mq.edu.au> (13) Subject: Latin letter frequency & word lists --[1]------------------------------------------------------------------ Date: Wed, 27 Sep 2000 09:29:13 +0100 From: Eric Johnson <johnsone@jupiter.dsu.edu> Subject: Counting words Some years back, I wrote a computer program to count the words in an ASCII text file. My program can be downloaded from the web at: http://www.dsu.edu/~johnsone/sno.html WORDS and WordsNT (which version I recommend) run from a PC command line. An article about the program (on the same web page) gives information about the program. I would be interested to know if anyone finds my program useful. --Eric Johnson johnsone@jupiter.dsu.edu http://www.dsu.edu/~johnsone/ --[2]------------------------------------------------------------------ Date: Wed, 27 Sep 2000 09:30:03 +0100 From: "Jim Marchand" <marchand@ux1.cso.uiuc.edu> Subject: letter counts Talking about primitives of all kinds, the discussion of the frequency of letters in Latin brings up a number: 1. What corpus should one use? How many running words seems enough? Zipf, although he was a statistician, used a corpus of 5000 running words from one author (Cicero), surely too little. Cetainly, Anne's count (7.8 million words!) is large enough. 2. Do we not need to label our count carefully? Latin, German, French, etc. are surely too large and inclusive. Zipf used a chrestomathy from many different periods for French, surely a bad notion as a first step. 3. Letter counts vs. phoneme counts. Letters are graphemes at best, not phonemes. 4. What to do with non-ASCII letters? The count cited by Zipf (and others) for German contains no umlauts, that for French no accented letters, no distinction between c and c-cedilla. 5. What do we do with editions, and which do we choose. For example, it is common to distinguish between i and j, u and v in editions, though these are Renaissance inventions for the most part. What do we do with assimilations? Some editions use inl- for ill- even. Whatever is done, it needs to be spelled out carefully, if it is important. 6. Where an edition includes two or more witnesses, we need to be careful to distinguish between them, perhaps not so much in the matter of letter counts, but otherwise. If our editions contain conjectural emendations, do we not need to excise and/or label these? In making a frequency count of Gothic words, for example, should we not be careful not to use two versions of the same text twice? As I look through Martin Joos's excellent dissertation (Wisconsin, 1942), a count of Gothic graphemes, I notice all these problems, so that the frequencies seem skewed to me. Perhaps I am too finicky. --[3]------------------------------------------------------------------ Date: Wed, 27 Sep 2000 09:31:02 +0100 From: Randall Pierce <rpierce@jsucc.jsu.edu> Subject: ETAOIN SHRDLU The above mnemonic seems to be losing meaning and relevance to many in the linguistics community. I have mentioned it to some full professors in the field of language arts and gotten a blank stare. I had an instructor who told me that frequency in language is a myth based on ignorance of what language is all about. That was some twenty years ago. I suggest that he ask a cryptographer about that theory. Just a point I would like to make. Frequency in Latin? Hmmm. I think that etymology is a fast-fading study and we are losing much by its possible disappearance from the regular curriculum. Randall --[4]------------------------------------------------------------------ Date: Wed, 27 Sep 2000 09:31:20 +0100 From: Einat Amitay <einat@ics.mq.edu.au> Subject: Latin letter frequency & word lists Hi All, I believe most of you already know about the CORPORA mailing list (http://www.hit.uib.no/corpora/). It is a list dedicated to the study of language corpora and what tools we can develop to work with those. Together they have answered many questions similar to the ones some of Humanist's people pose recently. I think it will be a good source for answers (and maybe more questions) about language sample collections and how these can be analysed. Just a thought, +:o) einat -- Einat Amitay einat@ics.mq.edu.au http://www.ics.mq.edu.au/~einat
This archive was generated by hypermail 2b30 : 09/27/00 EDT