Humanist Discussion Group, Vol. 16, No. 667.
Centre for Computing in the Humanities, King's College London
www.kcl.ac.uk/humanities/cch/humanist/
Submit to: humanist@princeton.edu
Date: Tue, 06 May 2003 08:43:20 +0100
From: Wendell Piez <wapiez@mulberrytech.com>
Subject: Re: 16.656 help with the moving target
Hi Bob,
>The files need to be made much more useful, in addition to receiving
>further verification for accuracy in some instances. Some of them would
>perhaps be more effective in some sort of database configuration, although
>over the years I have avoided abandoning "flat file" presentation to avoid
>the vagrancies of incompatible updating in more complex formats.
Moving from a (structured) flat file format to a markup-based format is
generally straightforward, but significant. Markup has the advantage of far
greater self-containment, coming at the price of verbosity and legibility
tradeoffs. (It can be more legible; but it can also be less so, if your
parsing rules for your flat file are simple enough.) In the kind of dataset
you're looking at, you may find that a combination of old and new
approaches works -- e.g. bare markup to delimit token boundaries (thereby
making them accessible to markup-aware processors like XSLT engines), and
an external data set to track everything. This is sometimes called
"standoff markup".
> I have
>become proficient in using Dreamweaver for HTML conversion and creation
>purposes, but suspect that there are more efficient and effective ways to
>proceed, and that I probably should be jumping directly to XML at this
>stage and eliminating the middle element (HTML).
There are many questions here regarding how the data set is expected to be
used, but given a spec of the current data format(s) -- what fields are in
various positions -- it shouldn't be hard to render the stuff in XML. Even
in an XML table of HTML divs and spans... (on the way to TEI feature
structures, some might say).
>I'm wide open to suggestions. How does Wendell work directly in XML
...well, commonly with a plain ASCII text editor (with support for regular
expressions and a way to call the shell to run a parse or transformation),
but not only...
>, and is
>that something I should be doing -- or training my incipient staff to do?
>If not, do I have better alternatives than mentioned above for my
>particular purposes (web accessibility, linking of various sorts both
>within and between files, addition of images [e.g. paleographical
>features, odd forms], and the like)? I'm looking for shortcuts to the most
>effective path to the unforseen future, now that a relatively productive
>past has become passe.
Well, it's the "unforeseen" part that rubs. A well-designed generic
encoding -- whether it's a tag set in common use, like TEI, or a private
tag set optimized to local requirements -- certainly opens the door. Yet
behind it is not all plush and comfort.
In order to assess whether XML is even really suitable, an honest
consultant would still look very hard at *specifically* what you want to
accomplish with the data. Then would come, no doubt, a prototyping phase,
testing some of the "how" as well as the "what".
Often this stuff reminds me of the Monty Python routine, "How to Do It".
"How to Rid the World of All Known Diseases": first, you become a famous
doctor, etc. A programmer who knows text-munging could XMLize your data
easily enough. That *opens* options; unfortunately once those options are
opened, the real work: selecting from among them, picking the right tools
for the job, and implementing.
It is worth noting that my conversation with AM presupposes these issues
are addressed, if only implicitly by drawing bounds around the problem. I
write my papers in XML these days -- largely because what constitutes a
"paper" for these purposes is, although fairly capacious, also quite well
known. If not in general, then at least to *me* for my own purposes. Add to
that the network effects of all that XML out there -- I don't have to make
my own tools -- and for the XML-experienced developer, the up-front costs
get to be well worth it. But note I've already absorbed much of what keeps
the initial investment prohibitive to many new users.
I know this comes as a poor answer to your question. IMO the learnability
of markup technologies -- notwithstanding various improvements in some
respects -- is still an issue. (And one I have hopes of addressing in one
way or another.) The technologies present, it seems, significant
challenges, both because though simple at their heart, the related
standards and technologies present ever more fiddly bits that are just hard
to assimilate, and also because -- again simple at their heart -- it's a
family of approaches to handling electronic data that is notably at odds
with many of the assumptions that otherwise tend to govern development.
(Much of the reason why the technologies are so promising and, when done
well, effective, starts in their questioning of assumptions that are
usually more appropriate for an IT department than they are for an academic
project.) While this may seem to leave many projects out in the cold, it's
a salutary point. It's important to note, in the context of the
"consumptive humanities" thread, that these things aren't either/or, but
rather half-full/half-empty. What use to you is the ubiquity of free tools
if you can't come by instructions in their use?
Let me know off list if you want more particulars.
Cheers,
Wendell
======================================================================
Wendell Piez mailto:wapiez@mulberrytech.com
Mulberry Technologies, Inc. http://www.mulberrytech.com
17 West Jefferson Street Direct Phone: 301/315-9635
Suite 207 Phone: 301/315-9631
Rockville, MD 20850 Fax: 301/315-8285
----------------------------------------------------------------------
Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================
This archive was generated by hypermail 2b30 : Tue May 06 2003 - 03:53:39 EDT