18.478 indexing local machines

From: Humanist Discussion Group (by way of Willard McCarty willard.mccarty_at_kcl.ac.uk>
Date: Sun, 9 Jan 2005 09:37:54 +0000

               Humanist Discussion Group, Vol. 18, No. 478.
       Centre for Computing in the Humanities, King's College London
                   www.kcl.ac.uk/humanities/cch/humanist/
                        www.princeton.edu/humanist/
                     Submit to: humanist_at_princeton.edu

   [1] From: "Okyere, Emmanuel II" <chief_at_okyere.org> (79)
         Subject: RE: 18.474 indexing local machines

   [2] From: Willard McCarty <willard.mccarty_at_kcl.ac.uk> (35)
         Subject: structured & unstructured searching

   [3] From: David Sewell <dsewell_at_virginia.edu> (17)
         Subject: Re: 18.474 indexing local machines

--[1]------------------------------------------------------------------
         Date: Sun, 09 Jan 2005 09:11:20 +0000
         From: "Okyere, Emmanuel II" <chief_at_okyere.org>
         Subject: RE: 18.474 indexing local machines

| --[1]------------------------------------------------------------------
| Date: Sat, 08 Jan 2005 10:16:29 +0000
| From: Jan Christoph Meister <jan-c-meister_at_uni-hamburg.de>
| |
| 07:51
| 08.01.2005
|
| A most interesting topic indeed! However, let me pose a trivial
| question: how good is an automatic indexing tool when it comes to
| searching for NEW information,

A valid question, however I think if the information is _new_ (and I am
assuming here that you had control over where it was put) then for the most
part, you know where to look for it; so that searching for it, instead of
just going to get it wouldn't be the right route. The other part of this,
which is the more direct response to the question, is that your search
results are only as good as how current your index cache is. One of the
things I do with the copernic desktop search tool (www.copernic.com) is that
I make sure it re-runs the indexing at 4am everyday, so that deleted/new
items might be captured; I'm not really sure how successful this is, but
hopefully it works quite well.

| or for information that might be
| available (local or distributed), but classified, contextualized and,
| most importantly, VERBALIZED in an unanticipated manner?

You are right. Because these tools are provided for a generic audience,
there's no way they can satisfy custom/proprietary contexts _well enough_; I
do think, however, that all the popular full-text search engines come close
enough in approximating what you are looking for, as long as they can, to a
certain degree, parse the contexts in question.

| It seems that clever algorithms plus enough brute force computing
| power have made obsolete the deeply nested systematic index structures
| of yonder. Having a rough idea of what your looking for is good
| enough to make you choose the Google over the Yahoo-directory route.
| However, the snag is that you can only find what you are able to
| represent in a string that matches or approximates something captured in
| the index
| data base. That's exactly why libraries have systematic catalogues: we
| can do deductive searches, starting with some generic top-level
| concept and then drill down to find the new and unexpected (rather
| than simply re-find stuff we knew we had somewhere, but just couldn't
| trace). To put it in a philosophical nutshell: if we decide to go the
| unstructured route and subscribe to what I'd like to call the 'Google
| paradigm' of knowledge representation then aren't we locking ourselves
| into the static configuration of knowledge as we have it here and now,
| expressed in the strings indexed by the machine? In fact, is an
| unsystematic index 're-presenting' knowledge in the first place? Is it
| not just simply defining its relational coordinates in the database,
| without any semantic surplus value generated that would allow us to
| retool and reconfigure our knowledge elements?

You touch on a number of issues here and I certainly agree with you for the
most part; for instance, I "doc" folders that are further broken down into
subject areas I read on--"softeng" for software engineering for instance,
"java" for purely java-related stuff and such; it makes my collection of
documents more valuable and easy to track down. You can however see that at
a certain level, the document areas begin to overlap; for instance, I could
have material that could very well be put in either the "softeng" or "java"
folders (and I could certainly make the argument that both folders have
unique-enough reasons to co-exist). Clearly, a search tool (that indexes
keywords I'm likely to use to drill down to what I'm looking for, like you
mention) is invaluable here as it saves me the effort of manually looking
the information, to a certain extent.

The other part of this is that the "libraries have a systematic catalogues"
comment is what you are trying to reproduce here; at least in part, in that
the indexing tool creates a "system" for locating your records. The indexes
collated are only as valuable as how much time they use in leading you to
what you are looking for; so that, I am not sure of this, but you might have
an inverse relationship for value and document collection size here. I want
to believe that for individual use, document collection size is small
enough, and the individual's domain is narrow/unique enough to make the
keywords used effectively lead to what you are searching for and to make the
indexes accrued worth having.

Emmanuel

---
Emmanuel OKYERE II
CTO - AKUABA, LLC
Phone/Fax:  703.815.4702
PGP Key ID: 0xA7FD6168
MSN: compubandit
AIM: compubndit
http://www.okyere.org/
--[2]------------------------------------------------------------------
         Date: Sun, 09 Jan 2005 09:12:06 +0000
         From: Willard McCarty <willard.mccarty_at_kcl.ac.uk>
         Subject: structured & unstructured searching
In Humanist 14.484 Jan Cristoph Meister points out one strong reason for
maintaining highly structured representations: "we can do deductive
searches, starting with some generic top-level
concept and then drill down to find the new and unexpected (rather than
simply re-find stuff we knew we had somewhere, but just couldn't trace)."
Certainly when searching a large collection that you've not assembled
yourself, a highly structured catalogue can be invaluable. I wonder,
however, whether the situation isn't rather different with your own
collection, particularly if as a matter of course it contains notes you've
made on the collected items. And I wonder too if the strategies for
searching do not change as collections get larger -- whether there are not
thresholds past which one technique becomes more valuable than another.
This is, I think, a subject to be explored by experiment rather than a priori.
In evaluating googling, I think we have to take care not to conceptualize
what happens as a single, non-interactive query with a singular set of
answers. Often when I google I begin with one character string and then
modify it or scrap it and substitute another, depending on the results.
Indeed, the results are very important in determining the searches that
follow. Often what happens is that my ideas change as the search
progresses. Some of the most valuable finds have been of things I have not
started out searching for.
I wouldn't ever advocate using one way of searching only. If we had done
that as hunters and gatherers, we would never have survived. There are many
ways to find and follow clues. But I do think that the Doctrine of Full and
Perfect Englightenment through Metadata needs to be subjected to the same
criterion, particularly since encoding the world's resources is an
exceedingly expensive project. Again, I would think that the level of
encoding is a pragmatic question rather than a principled one.
Yours,
WM
[NB: If you do not receive a reply within 24 hours please resend]
Dr Willard McCarty | Senior Lecturer | Centre for Computing in the
Humanities | King's College London | Kay House, 7 Arundel Street | London
WC2R 3DX | U.K. | +44 (0)20 7848-2784 fax: -2980 ||
willard.mccarty_at_kcl.ac.uk www.kcl.ac.uk/humanities/cch/wlm/
--[3]------------------------------------------------------------------
         Date: Sun, 09 Jan 2005 09:12:38 +0000
         From: David Sewell <dsewell_at_virginia.edu>
         Subject: Re: 18.474 indexing local machines
For people who store email on a Unix-ish machine where they are also able
to install software, a good email indexing solution might be "mairix":
          http://www.rc0.org.uk/mairix/
It handles the traditional Unix mbox format as well as MH and maildir ones.
There's a Debian Linux package as well as source code for rolling your own.
If you're comfortable installing *nix programs and setting basic options in
a configuration file, you won't have any problem getting it to work.
(Caveat: don't just try to do a quickstart by editing your config file
without reading the brief user documentation, which gives clear
explanations of available options.)
DS
--
David Sewell, Editorial and Technical Manager
Electronic Imprint, The University of Virginia Press
PO Box 400318, Charlottesville, VA 22904-4318 USA
Courier: 310 Old Ivy Way, Suite 302, Charlottesville VA 22903
Email: dsewell_at_virginia.edu   Tel: +1 434 924 9973
Received on Sun Jan 09 2005 - 04:44:02 EST

This archive was generated by hypermail 2.2.0 : Sun Jan 09 2005 - 04:44:07 EST