database management, cont. (267)

Willard McCarty (MCCARTY@VM.EPAS.UTORONTO.CA)
Sat, 25 Mar 89 00:33:19 EST


Humanist Mailing List, Vol. 2, No. 765. Saturday, 25 Mar 1989.


(1) Date: Fri, 24 Mar 89 09:50:55 EST (27 lines)
From: Greg Goode <grgo@uhura.cc.rochester.edu>
Subject: Bio/bibliographical database

(2) Date: 24 March 1989, 20:42:59 EST (220 lines)
From: ROBERT E. SINKEWICZ (416) 926-7128 ROBERTS at UTOREPAS
Subject: Database Advice

(1) --------------------------------------------------------------------
Date: Fri, 24 Mar 89 09:50:55 EST
From: Greg Goode <grgo@uhura.cc.rochester.edu>
Subject: Bio/bibliographical database

As a possible idea for Paul Delany and his search on behalf of a
friend for a good microcomputer database -- for the IBM world, one
idea would be Pro-Cite. It's probably the top shelf
bibliographical database out there for IBM's/compatibles. It will
automatically format your entries into any one of a variety of
accepted styles, including MLA. And I believe you can design and
define your own styles, such that any bibliographical information
you enter will print in that style. It has the capability of very
large comment fields, which is where the body of *biographical*
data could go.

I believe it's upwards of $350 retail, and available from the
original vendor,

Pro-Cite bibliographical database management system
Personal Software Systems
412 Longshore Drive.
Ann Arbor, MI 48105
U.S.A. Phone: (313) 996-1580

--Greg Goode


(2) --------------------------------------------------------------224---
Date: 24 March 1989, 20:42:59 EST
From: ROBERT E. SINKEWICZ (416) 926-7128 ROBERTS at UTOREPAS
Subject: Database Advice

A recent posting from Paul Delany asking for advice on databases reminded
me of the sort of questions I have often been asked by colleagues. There
must be many advisors, consultants, local gurus, experienced database
users on HUMANIST who frequently get this sort of question. I have
therefore collected some of my own thoughts on the matter. If this sort of
summary seems useful to anyone, I would be happy to receive further
additions, corrections, suggestions on my brief summary and will try to
incorporate them and post the results as a file available on the LISTSERV.

Bob Sinkewicz
Pontifical Institute of Mediaeval Studies
Toronto
ROBERTS@UTOREPAS
ROBERTS@VM.EPAS.UTORONTO.CA
------------------------------------------------------------------------

CHOOSING A DATABASE
===================

BASIC QUESTIONS

1. What kind or kinds of data do you want to store?
- primarily numerical (financial, economic, demographic)
- primarily character (texts, brief biographies, bibliographies, lists
of places, names, etc.)
- a mix of the above two

2. How much data do you or will you have?
- Will the database be small, medium or large?
small = less than 5,000 records
medium = between 5,000-30,000 records
large = more than 30,000 records

3. What do you want to do with the data you have stored?
- mathematical statistics
- collation and manipulation of data in more general ways (that may
include some mathematical functions)
- linguistic, semantic, stylistic analysis of text data
- What are your requirements for character string manipulation
functions?
- What sorts of mathematical functions do you require?

4. How are you going to get the data into the database?
- Is the data already available in some electronic form?
- How much sophistication do you require at the data entry stage?
creation of data entry forms
different types of data validation
only numeric data in numeric columns
valid date format in date columns
unique and/or NOT NULL values in certain columns
automated triggers (data accessed automatically from
other tables under certain conditions)
- If there is a lot of text data, what sort of SGML do you require?

5. How do you want to get the data out of the database?
- graphs and charts
- table oriented reports (columns, sub-columns, groups, etc.)
- line oriented data (non-tabular: e.g., bibliographic citations)
- special formatting requirements on output (e.g., flexible
bibliographic style sheet formats)

6. What is your religious affiliation?
Apple
MS-DOS
UNIX

7. What can you afford?
- a basic XT clone / MAC Plus
- a good AT class machine at 10mhz / MAC SE
- an 80386 class DOS machine / MAC II
- SUN Workstation

8. How much time do you want to invest?
- in learning the basic operations of the software
- in learning the accompanying tools that may require
programming skills
- in data entry (Do the potential results really merit the time
or cost involved?)


DATABASE TYPES

1. Statistical Analysis

e.g. LOTUS 1-2-3 (not to be forgotten)
SAS
SPSSX

2. Text Oriented

A. Static Text
MTAS (UTORONTO-CCH)
TACT (UTORONTO-CCH)
WordCruncher
OCP

B. Dynamic Text
HyperCard
LOTUS Agenda

3. Relational (pseudo-/semi-/fully)

e.g. dBase III/IV
Ingres
Oracle
Advanced Revelation


BASIC CONSIDERATIONS

Relational databases are good for handling small items of information
about several distinct subjects stored in a number of separate tables. The
basic idea is that of a table divided into rows and columns. Only small
data items can be manipulated effectively. However, larger data items can
usually be stored and retrieved in some fashion. The normal limitation on
character data is between 240-254 characters per field or column. Many
software packages make provision for storing much larger chunks of
alphanumeric data but there are usually some awkward restrictions
regarding inputing the data (e.g., only through a customized C language
utility which you must write yourself), on manipulating the data (very
little or none at all), or on displaying or retrieving the data. Although
there may sometimes be good reasons for storing larger chunks of data like
this, the relational model was really not designed with this in mind.

The amount of data you have and the internal storage technique of the
software are often important considerations. A DOS machine will have
difficulties with a database of more than 30,000 records. Much of this has
to do with the lack of memory management capabilities in DOS. Some
software tries to get around this by using expanded (Ingres) or extended
(Oracle) memory. Even so, 30,000 records is pretty much the maximum a DOS
machine can handle. If your database approaches that size, anything less
than a 386 machine with a very fast hard disk (under 20ms) will be
excruciatingly slow. If you have more than 1,000 records, an AT class
machine at 10mhz and a 28ms hard disk is practically a must. There may
indeed be someone out there who has succeeded in stuffing 100,000 records
onto a 6mhz AT running dBase III. I am not saying that it is impossible,
but I don't think it can be recommended. Storage capacity is also a big
feature if you have a large database with a lot of variable length
character fields. Some software uses very inefficient storage techniques.
If you define a character field (column) as 254 bytes, some software will
use up 254 bytes of your hard disk even if you only put 5 characters in
that field in a given record (row). Even a large hard disk can be filled
up very fast.

It is important to realize that most relational database software is
designed for the environment of the business world. The output formatting
capabilities are no exception. If you require tabular formatted reports,
even very sophisticated ones, the software is certainly adequate and
sometimes excellent (though not always easy to use). However, if you
require some form of line oriented output, then you may encounter
difficulties, since many standard software packages either do not provide
for this at all or do so with varying degrees of difficulty. It is
possible to maintain a bibliography with a relational database and some
formatters or report writers will handle output in an acceptable format.
However, it is often better either to use dedicated bibliography
management software for large databases (e.g., Pro-Cite), or to rely on a
good word processor for smaller bibliographies (e.g., WordPerfect or
NotaBene).

In some situations it might be best to consider two software tools
rather than trying to force an RDBMS to do something it does not do well.
Your might for example use dBase III/IV in combination with a good word
processor or a simple text database tool like MTAS. There are ways of
tagging the data in a text database so that they will correspond to data
elements in your RDBMS.

Relational databases are measured against a standard of perfection
known as THE TWELVE COMMANDMENTS OF CODD. The degree of conformity to this
standard may have some bearing on your choice of database. If your
database will have more than one table, you should learn more about
relational databases and consider how many of the 12 rules would affect
the functional operation of your database.

Beware of the benchmarks that you may see in the trade literature. A
lot depends on the mix of your data and the design of your database.
Software that performs less than best on these "standard benchmarks" may
well be more suited to your specific needs and perform better with your
data sets than the top ranked software.

If you are planning to do any kind of statistical analysis on your
data, you must decide just how sophisticated your requirements are. Most
relational databases offer a good range of mathematical functions that
will meet the needs of many projects. However, if your work requires more
advanced types of statistical modelling, you will have to look at some of
the specialized packages such as SAS and SPSS. You can probably find some
local experts in this sort of thing in your Economics and Social Sciences
departments.

In practical terms (philosophical questions aside) text databases can
be divided into two types, namely, static and dynamic. If you have a dozen
files in a directory on your hard disk, you could consider the files in
that directory as a database of texts. You then need some software to
consult your database. If you are using UNIX or have a DOS version of the
basic UNIX text manipulation tools, you then have a simple and inexpensive
way of querying a set of texts in a flexible way. This toolbox approach
has many advantages that you may wish to explore.

If you have a lengthy text (or texts) that you wish to have access to
on an ad hoc basis (where in the text does x or y occur) or that you wish
to analyze stylistically or morphologically, you may want to look at other
software. Some of these packages work on text that requires no sort of
preparation or preprocessing (e.g., MTAS), but most software requires that
you insert markup codes in the text and then either index it (as with
WordCruncher or the Textbase in NotaBene) or run programmes on it to
generate concordances, word lists, etc. (e.g., Micro-OCP).

These text databases so far discussed I would call "static". The
"dynamic" type is perhaps best represented by Apple's HyperCard. This
combines some of the capabilities of the static text database with those
of the relational database. The basic model is that of the filecard
containing a variety of data items, including medium size chunks of text.
>From one "filecard" you can access other filecards with related or more
detailed data on the same subject. Other types of hypertext software
function somewhat differently but the basic feature is the ability to
follow dynamic links (often ordered hierarchically) between elements of
text.