14.0274 counting words

From: by way of Willard McCarty (willard@lists.village.Virginia.EDU)
Date: 09/27/00

  • Next message: by way of Willard McCarty: "14.0275 noisy libraries"

                   Humanist Discussion Group, Vol. 14, No. 274.
           Centre for Computing in the Humanities, King's College London
                   <http://www.princeton.edu/~mccarty/humanist/>
                  <http://www.kcl.ac.uk/humanities/cch/humanist/>
    
       [1]   From:    Eric Johnson <johnsone@jupiter.dsu.edu>             (10)
             Subject: Counting words
    
       [2]   From:    "Jim Marchand" <marchand@ux1.cso.uiuc.edu>          (30)
             Subject: letter counts
    
       [3]   From:    Randall Pierce <rpierce@jsucc.jsu.edu>               (8)
             Subject: ETAOIN SHRDLU
    
       [4]   From:    Einat Amitay <einat@ics.mq.edu.au>                  (13)
             Subject: Latin letter frequency & word lists
    
    
    --[1]------------------------------------------------------------------
             Date: Wed, 27 Sep 2000 09:29:13 +0100
             From: Eric Johnson <johnsone@jupiter.dsu.edu>
             Subject: Counting words
    
    
         Some years back, I wrote a computer program to count the words in an
    ASCII text file.  My program can be downloaded from the web at:
    
    http://www.dsu.edu/~johnsone/sno.html
    
    WORDS and WordsNT (which version I recommend) run from a PC command line.
    An article about the program (on the same web page) gives information
    about the program.
    
         I would be interested to know if anyone finds my program useful.
    
         --Eric Johnson
           johnsone@jupiter.dsu.edu
           http://www.dsu.edu/~johnsone/
    
    
    
    
    --[2]------------------------------------------------------------------
             Date: Wed, 27 Sep 2000 09:30:03 +0100
             From: "Jim Marchand" <marchand@ux1.cso.uiuc.edu>
             Subject: letter counts
    
    Talking about primitives of all kinds, the discussion of the
    frequency of letters in Latin brings up a number:
    
    1. What corpus should one use?  How many running words seems
    enough? Zipf, although he was a statistician, used a corpus of 5000
    running words from one author (Cicero), surely too little. Cetainly, Anne's
    count (7.8 million words!) is large enough.
    
    2. Do we not need to label our count carefully?  Latin, German,
    French, etc. are surely too large and inclusive.  Zipf used a
    chrestomathy from many different periods for French, surely a bad
    notion as a first step.
    
    3. Letter counts vs. phoneme counts. Letters are graphemes at best,
    not phonemes.
    
    4. What to do with non-ASCII letters?  The count cited by Zipf (and
    others) for German contains no umlauts, that for French no accented
    letters, no distinction between c and c-cedilla.
    
    5. What do we do with editions, and which do we choose.  For
    example, it is common to distinguish between i and j, u and v in
    editions, though these are Renaissance inventions for the most
    part. What do we do with assimilations?  Some editions use inl- for
    ill- even.  Whatever is done, it needs to be spelled out carefully,
    if it is important.
    
    6. Where an edition includes two or more witnesses, we need to be
    careful to distinguish between them, perhaps not so much in the
    matter of letter counts, but otherwise.  If our editions contain
    conjectural emendations, do we not need to excise and/or label
    these? In making a frequency count of Gothic words, for example,
    should we not be careful not to use two versions of the same text
    twice?
    
    As I look through Martin Joos's excellent dissertation (Wisconsin,
    1942), a count of Gothic graphemes, I notice all these problems, so
    that the frequencies seem skewed to me.  Perhaps I am too finicky.
    
    --[3]------------------------------------------------------------------
             Date: Wed, 27 Sep 2000 09:31:02 +0100
             From: Randall Pierce <rpierce@jsucc.jsu.edu>
             Subject: ETAOIN SHRDLU
    
       The above mnemonic seems to be losing meaning and relevance to many in
    the linguistics community. I have mentioned it to some full professors
    in the field of language arts and gotten a blank stare. I had an
    instructor who told me that frequency in language is a myth based on
    ignorance of what language is all about. That was some twenty years ago.
    I suggest that he ask a cryptographer about that theory. Just a point I
    would like to make. Frequency in Latin? Hmmm. I think that etymology is
    a fast-fading study and we are losing much by its possible disappearance
    from the regular curriculum. Randall
    
    --[4]------------------------------------------------------------------
             Date: Wed, 27 Sep 2000 09:31:20 +0100
             From: Einat Amitay <einat@ics.mq.edu.au>
             Subject: Latin letter frequency & word lists
    
    Hi All,
    
    I believe most of you already know about the CORPORA mailing list
    (http://www.hit.uib.no/corpora/). It is a list dedicated to the study of
    language corpora and what tools we can develop to work with those. Together
    they have answered many questions similar to the ones some of Humanist's people
    pose recently. I think it will be a good source for answers (and maybe more
    questions) about language sample collections and how these can be analysed.
    
    Just a thought,
    +:o)
    einat
    
    --
    Einat Amitay
    einat@ics.mq.edu.au
    http://www.ics.mq.edu.au/~einat
    



    This archive was generated by hypermail 2b30 : 09/27/00 EDT