Proud's Text Archive Report (224)

Willard McCarty (MCCARTY@VM.EPAS.UTORONTO.CA)
Thu, 16 Mar 89 00:08:30 EST


Humanist Mailing List, Vol. 2, No. 730. Thursday, 16 Mar 1989.

Date: Wed, 15 Mar 89 14:24 GMT
From: Oxford Text Archive <ARCHIVE@VAX.OXFORD.AC.UK>
Subject: Text Archive Report



[Judith Proud's Report on the Oxford Text Archive has now been
submitted to the British Library, and should be published in full
as a Research Report in the near future. Because of its interest
to the Humanist readership, I have prepared the following
condensed version of the Report's main findings, with the
author's blessing. Comments and views would be gratefully
received- LB]

The report has five main sections. Section 1, which I have not
summarised here, describes what machine readable texts are, the
uses to which they are typically put in the research environment
and the Archive itself. The aims and methods of the research project
are also outlined. Section 2 describes the origins and
nature of the Archive's holdings. Section 3 summarises the usage
made of the holdings to date. Section 4 addresses the question of
copyright as it affects the Archive. Section 5 makes
recommendations for the future.

Section 2: The Archive's Holdings

The rate of deposits in the Archive has shown little increase
over the last 5 years, which indicates a decline in real terms.
Of the 900 or so texts currently held, 60% are in category U,
i.e. generally available, 26% in category A (requiring depositor's
permission before they may be released)
and the remainder in categories X or 0 (not available outside
Oxford). In terms of volume, 30% of the texts are derived from other
major corpora (such as TLG or ICAME); 20% come from typesetting
tapes, 18% are deposited by individual researchers, 16% from the
OUCS KDEM service and 16% from other Archives and facilities such
as the OUCS Lasercomp.

Considerable effort has been put into investigating and improving the
bibliographical information held about the texts, which was found
to be highly deficient as a result of the particular
accessions policy of the Archive and the limited resources with
which it operates. Source editions have now been identified for about 50%
of the texts, compared with approximately 6% known at the start of the
project. Pilot studies were also carried out to assess the accuracy of
the holdings. In one such study, samples of texts taken from the
begining, middle and end of texts prepared on the KDEM
were checked: 31% had 0 errors; 46% had 1-3 errors; 19% had 4-20
errors; 4% had over 20 errors. Initial investigation of the
tagging schemes employed in the Archive's holdings has highlighted the
very varied nature of these schemes and the lack of adequate
documentation and standards to describe them. In a sample of
100 texts (excluding any from established corpora), 16
had no markup at all, 1 had fixed column refs, 6 had typesetting
codes, and the rest (77) had either a few COCOA tags or embedded
special characters to mark various features. In the latter case, there
was no clear distinction made between coding used to mark special
characters, variants, editorial comment, presentational features,
structural features etc. The report stresses
the importance of the success of the Text Encoding Initiative in
this connexion.

Section 3: Usage of the Archive

Usage of the texts has steadily increased from 28 orders p.a. in
1981, to 92 in 1988. (It should also be noted that the number of
texts per order varies greatly). Text Archive users are
geographically widespread (21 countries including N. America,
Israel, Japan, Australia as well as Europe). In 1988 for the first
time US orders equalled those received from the UK.

To assess usage in more detail a questionnaire was sent to all
who ordered since 1980 (about 400), generating 91 replies (about
25%). 65% of these had used texts succesfully in a research
project, the chief areas being in linguistics, lexical research,
computer aided language instruction, teaching of quantitative
methods, literary research, computational linguistics and
hardcore computer science. Details of many of the projects are
given in the report, which also includes a bibliography of
resulting publications. In some cases, OTA texts had been
successfully combined with texts from other sources. However,
many project descriptions given were rather vague and not
primarily research oriented.

35% of respondents had been unable to use the texts supplied for
a variety of reasons (technical difficulty, wrong texts,
insufficient documentation to understand text, text too
inaccurate) but most had simply not had enough time to work on
the texts due to other commitments.

Over 300 enquiries from people who didn't subsequently use the
Archive were also analysed. Many of these were general enquiries
only, and many others asked for specific texts known not to be
available or in formats not provided. A follow up questionnaire
addressed to these had a poorer response rate (18%). About 25%
of these had abandoned their project after contacting the
Archive, but nearly 30% had gone on to do research using texts
from elsewhere. 17% had done the research without using machine
readable texts at all.

Section 4 of the report discusses in considerable detail the
legal problems of copyright with reference to machine-readable
texts. It relates only to British law, under which a literary
work has artistic copyright for 50 years after author's death,
and a publisher has 25 years copyright in the typographic
arrangement of a publication.
During the 80s there has been a growing concern over the
applicability of the 1956 Act to electronic rights. Its 'Fair
dealing' section allows use of literary works for purposes of
research or private study: this is what underlies the Archive's
current Conditions of Use agreement. But increasingly texts are
being used for purposes barely describable as private study or
research, for example in teaching packages or the creation of new
critical editions. At the same time there is a growing awareness
among publishers of the potential market for electronic texts.
(Examples discussed include ETC, NeXT, OEP, and the reuse of
typesetting tapes being archived by Knowledge Warehouse).

The 1988 Copyright Act is due to become law in April 1989. It
introduces a new restriction into the fair dealing section which
specifically prohibits copying of copyright literary works for
use by several people at the same time for the same purpose ('copying'
is explicitly defined in the Act as including storage in electronic
form). This obviously has important consequences for the Oxford
Text Archive and many similar institutions. The clear definition of
copying given in the Act also means that for any work which is not strictly
private scholarly research, of which, as the report shows, there is
an ever-increasing amount, permission must be sought
from the copyright owner or his licensee before the text may be put into
machine-readable form.

As far as public domain texts are concerned, a publisher only has
copyright in the typographical arrangement of the text, that is its
physical appearance. As the copyright law in such matters prohibits
the making of a 'facsimile copy' of that physical appearance, the
making of a machine-readable version of such a text does not constitute
a contravention, nor does displaying it on a screen or even printing it
as long as the original appearance is not reproduced. The
copying of electronic material does however constitute a restricted act
under the 1988 Act, as a 'facsimile copy' results.

The Archive is, regrettably, not usually involved at the time
that texts are created, which is when proper licensing should
be arranged. A laissez faire attitude has prevailed, so
that in many cases it is not even known who converted a text let
alone whether they had permission to do so.

The originals of texts in the Archive span the full range of
possibilities from unpublished mss, copyright or out of copyright
editions of public domain works, out of copyright editions of
copyright works and copyright editions of copyrighted works.
Fortunately, only 25% of titles are copyright works. Some of
these were deposited by their author. Of the rest, about 200
titles, in only a quarter of the cases is it known that the
creator of the electronic text received permission to make it,
but not always to deposit it with us, nor necessarily from the
actual owner of the machine-readable rights. Considerable
investigation is needed to determine the status of these texts,
which should strictly be withdrawn from the Archive when the new
Act becomes law, until their status has been determined.

As regards commercial distribution of texts, the Report suggests
that the Archive could either continue to refuse to provide texts
for commercial purposes or seek appropriate licensing
arrangements with electronic copyright owners, i.e. usually the
depositors. These might include e.g. the payment of a royalty on
commercial applications of the text.

The use of texts for teaching purposes is also discussed in the
section on copyright. The use of short extracts of copyright works
for teaching purposes is no longer allowed under the new Act, unless
permission has been obtained from the copyright owner.

The developing relationship between researcher,
publisher and author is investigated. It is important that authors realise
the range of rights at their disposal before handing them all
over unconditionally to book publishers; that publishers realise that
electronic publication can stimulate book sales; that there should be
discussion and co-operation between researchers and electronic publishers
for the greater good of the academic's needs.

Section 5: Conclusions and recommendations

The main areas where improvement is needed all relate to
distribution of material. As a deposit agency, the Archive continues to
provide a unique and much appreciated role. The report suggests
that the rate of technological change makes this archival function
increasingly important.

As a distributor of texts however, the Archive's procedures amd
standards need considerable revision in light of the changing
expectations of the community it serves. A discussion of the
problems of funding these improvements draws attention to the
need for external involvement, and the importance of
collaboration with national and international ventures referred to
in the Report.

The Report recommends action in the following areas:
- documentation of existing holdings
- resolution of the copyright position of current holdings
- proof reading and correction
- formulation of standard encoding guidelines and conversion of
holdings to them
- extension and consolidation of the range of holdings

Of these, the copyright question is regarded as having the
highest priority. This should be followed by a cost/benefit
evaluation of improving the status of the holdings in different
topic areas. The report concludes by stressing again the
importance of international co-operation in disseminating
information about the holdings of other archives and establishing
a common code of practice.
-----------------------------------

LB/JKP 14 mar 89