Humanist Discussion Group, Vol. 18, No. 478.
Centre for Computing in the Humanities, King's College London
www.kcl.ac.uk/humanities/cch/humanist/
www.princeton.edu/humanist/
Submit to: humanist_at_princeton.edu
[1] From: "Okyere, Emmanuel II" <chief_at_okyere.org> (79)
Subject: RE: 18.474 indexing local machines
[2] From: Willard McCarty <willard.mccarty_at_kcl.ac.uk> (35)
Subject: structured & unstructured searching
[3] From: David Sewell <dsewell_at_virginia.edu> (17)
Subject: Re: 18.474 indexing local machines
--[1]------------------------------------------------------------------
Date: Sun, 09 Jan 2005 09:11:20 +0000
From: "Okyere, Emmanuel II" <chief_at_okyere.org>
Subject: RE: 18.474 indexing local machines
| --[1]------------------------------------------------------------------
| Date: Sat, 08 Jan 2005 10:16:29 +0000
| From: Jan Christoph Meister <jan-c-meister_at_uni-hamburg.de>
| |
| 07:51
| 08.01.2005
|
| A most interesting topic indeed! However, let me pose a trivial
| question: how good is an automatic indexing tool when it comes to
| searching for NEW information,
A valid question, however I think if the information is _new_ (and I am
assuming here that you had control over where it was put) then for the most
part, you know where to look for it; so that searching for it, instead of
just going to get it wouldn't be the right route. The other part of this,
which is the more direct response to the question, is that your search
results are only as good as how current your index cache is. One of the
things I do with the copernic desktop search tool (www.copernic.com) is that
I make sure it re-runs the indexing at 4am everyday, so that deleted/new
items might be captured; I'm not really sure how successful this is, but
hopefully it works quite well.
| or for information that might be
| available (local or distributed), but classified, contextualized and,
| most importantly, VERBALIZED in an unanticipated manner?
You are right. Because these tools are provided for a generic audience,
there's no way they can satisfy custom/proprietary contexts _well enough_; I
do think, however, that all the popular full-text search engines come close
enough in approximating what you are looking for, as long as they can, to a
certain degree, parse the contexts in question.
| It seems that clever algorithms plus enough brute force computing
| power have made obsolete the deeply nested systematic index structures
| of yonder. Having a rough idea of what your looking for is good
| enough to make you choose the Google over the Yahoo-directory route.
| However, the snag is that you can only find what you are able to
| represent in a string that matches or approximates something captured in
| the index
| data base. That's exactly why libraries have systematic catalogues: we
| can do deductive searches, starting with some generic top-level
| concept and then drill down to find the new and unexpected (rather
| than simply re-find stuff we knew we had somewhere, but just couldn't
| trace). To put it in a philosophical nutshell: if we decide to go the
| unstructured route and subscribe to what I'd like to call the 'Google
| paradigm' of knowledge representation then aren't we locking ourselves
| into the static configuration of knowledge as we have it here and now,
| expressed in the strings indexed by the machine? In fact, is an
| unsystematic index 're-presenting' knowledge in the first place? Is it
| not just simply defining its relational coordinates in the database,
| without any semantic surplus value generated that would allow us to
| retool and reconfigure our knowledge elements?
You touch on a number of issues here and I certainly agree with you for the
most part; for instance, I "doc" folders that are further broken down into
subject areas I read on--"softeng" for software engineering for instance,
"java" for purely java-related stuff and such; it makes my collection of
documents more valuable and easy to track down. You can however see that at
a certain level, the document areas begin to overlap; for instance, I could
have material that could very well be put in either the "softeng" or "java"
folders (and I could certainly make the argument that both folders have
unique-enough reasons to co-exist). Clearly, a search tool (that indexes
keywords I'm likely to use to drill down to what I'm looking for, like you
mention) is invaluable here as it saves me the effort of manually looking
the information, to a certain extent.
The other part of this is that the "libraries have a systematic catalogues"
comment is what you are trying to reproduce here; at least in part, in that
the indexing tool creates a "system" for locating your records. The indexes
collated are only as valuable as how much time they use in leading you to
what you are looking for; so that, I am not sure of this, but you might have
an inverse relationship for value and document collection size here. I want
to believe that for individual use, document collection size is small
enough, and the individual's domain is narrow/unique enough to make the
keywords used effectively lead to what you are searching for and to make the
indexes accrued worth having.
Emmanuel
--- Emmanuel OKYERE II CTO - AKUABA, LLC Phone/Fax: 703.815.4702 PGP Key ID: 0xA7FD6168 MSN: compubandit AIM: compubndit http://www.okyere.org/ --[2]------------------------------------------------------------------ Date: Sun, 09 Jan 2005 09:12:06 +0000 From: Willard McCarty <willard.mccarty_at_kcl.ac.uk> Subject: structured & unstructured searching In Humanist 14.484 Jan Cristoph Meister points out one strong reason for maintaining highly structured representations: "we can do deductive searches, starting with some generic top-level concept and then drill down to find the new and unexpected (rather than simply re-find stuff we knew we had somewhere, but just couldn't trace)." Certainly when searching a large collection that you've not assembled yourself, a highly structured catalogue can be invaluable. I wonder, however, whether the situation isn't rather different with your own collection, particularly if as a matter of course it contains notes you've made on the collected items. And I wonder too if the strategies for searching do not change as collections get larger -- whether there are not thresholds past which one technique becomes more valuable than another. This is, I think, a subject to be explored by experiment rather than a priori. In evaluating googling, I think we have to take care not to conceptualize what happens as a single, non-interactive query with a singular set of answers. Often when I google I begin with one character string and then modify it or scrap it and substitute another, depending on the results. Indeed, the results are very important in determining the searches that follow. Often what happens is that my ideas change as the search progresses. Some of the most valuable finds have been of things I have not started out searching for. I wouldn't ever advocate using one way of searching only. If we had done that as hunters and gatherers, we would never have survived. There are many ways to find and follow clues. But I do think that the Doctrine of Full and Perfect Englightenment through Metadata needs to be subjected to the same criterion, particularly since encoding the world's resources is an exceedingly expensive project. Again, I would think that the level of encoding is a pragmatic question rather than a principled one. Yours, WM [NB: If you do not receive a reply within 24 hours please resend] Dr Willard McCarty | Senior Lecturer | Centre for Computing in the Humanities | King's College London | Kay House, 7 Arundel Street | London WC2R 3DX | U.K. | +44 (0)20 7848-2784 fax: -2980 || willard.mccarty_at_kcl.ac.uk www.kcl.ac.uk/humanities/cch/wlm/ --[3]------------------------------------------------------------------ Date: Sun, 09 Jan 2005 09:12:38 +0000 From: David Sewell <dsewell_at_virginia.edu> Subject: Re: 18.474 indexing local machines For people who store email on a Unix-ish machine where they are also able to install software, a good email indexing solution might be "mairix": http://www.rc0.org.uk/mairix/ It handles the traditional Unix mbox format as well as MH and maildir ones. There's a Debian Linux package as well as source code for rolling your own. If you're comfortable installing *nix programs and setting basic options in a configuration file, you won't have any problem getting it to work. (Caveat: don't just try to do a quickstart by editing your config file without reading the brief user documentation, which gives clear explanations of available options.) DS -- David Sewell, Editorial and Technical Manager Electronic Imprint, The University of Virginia Press PO Box 400318, Charlottesville, VA 22904-4318 USA Courier: 310 Old Ivy Way, Suite 302, Charlottesville VA 22903 Email: dsewell_at_virginia.edu Tel: +1 434 924 9973Received on Sun Jan 09 2005 - 04:44:02 EST
This archive was generated by hypermail 2.2.0 : Sun Jan 09 2005 - 04:44:07 EST