coding strange languages, cont. (184)

Willard McCarty (MCCARTY@VM.EPAS.UTORONTO.CA)
Thu, 13 Apr 89 19:08:33 EDT


Humanist Mailing List, Vol. 2, No. 833. Thursday, 13 Apr 1989.


(1) Date: Wed, 12 Apr 89 21:45:18 -0400 (35 lines)
From: jonathan@eleazar.Dartmouth.EDU (Jonathan Altman)
Subject: coding for sanskrit, greek, etc.

(2) Date: Thu, 13 Apr 89 01:48:45 EDT (10 lines)
From: cbf%faulhaber.Berkeley.EDU@jade.berkeley.edu (Charles Faulhaber)
Subject: Re: coding "strange" languages (37)

(3) Date: Thu, 13 Apr 89 10:17 (84 lines)
From: Wujastyk (on GEC 4190 Rim-C at UCL) <UCGADKW@EUCLID.UCL.AC.UK>
Subject: Pali text archive

(4) Date: Thu, 13 Apr 89 12:39:57 EDT (25 lines)
From: elli@harvunxw.BITNET (Elli Mylonas)
Subject: coding for strange languages (21)

(1) --------------------------------------------------------------------
Date: Wed, 12 Apr 89 21:45:18 -0400
From: jonathan@eleazar.Dartmouth.EDU (Jonathan Altman)
Subject: coding for sanskrit, greek, etc.

There seems to be much discussion about the inability to agree on
standards and the problems with incompatible data formats that
results. I have a question which may sound stupid or oversimplistic,
but I believe is not. Who cares about formats, exactly? I do not, for
one. Since the hope of standardizing on one format seems to be very
low, I would rather discuss the issue of how best to convert between
formats.

Given the correct computer tools, I can convert most formats into any
other format. Most standard Unix OS versions, and especially those
derived from the Berkeley Standard distribution, provide me the tools
to do lots of different kinds of character manipulation. For example,
the issue of using different parts of the high ascii set (129- 256
codes) to hold accented characters is relatively easy to convert. I
can even, using the "od" program (octal dump) reverse engineer the
ascii-codes from a comparison with printed copy. The large hitch is in
"proprietary formats." As long as you can understand the design of the
format (and any ascii-character file should fit the description), you
can just do translations. Our Dante database is housed on a VAX
running Unix, and I have yet to be sent data in a format that we
couldn't handle (including IBM disks which had text in all upper case,
and used different accenting symbols), although there are some things
that are harder than others. This on a mainframe that doesn't even
HAVE the 256 character ascii. We only have 128. In addition, we have
created data that has correct encoding for Macs, pc's, DEC Rainbows,
IBM mainframes. What could be improved would be the tools to do this,
mostly in making them easier to use. Having said all that, I'm
impressed at how much of a Unix snob I've managed to become in just a
few short years.

Jonathan Altman
(2) --------------------------------------------------------------21----
Date: Thu, 13 Apr 89 01:48:45 EDT
From: cbf%faulhaber.Berkeley.EDU@jade.berkeley.edu (Charles Faulhaber)
Subject: Re: coding "strange" languages (37)

I would hope that the Text Encoding Initiative would be looking
at funny character set as they set about their endeavors. And what
is the ISO doing? This looks like a good candidate for a session
in Toronto.

Charles Faulhaber
(3) --------------------------------------------------------------89----
Date: Thu, 13 Apr 89 10:17
From: Wujastyk (on GEC 4190 Rim-C at UCL) <UCGADKW@EUCLID.UCL.AC.UK>
Subject: Pali text archive

Dear Mathieu,

I did not altogether realize that what you have in mind is the
creation of a large Pali text archive, not just a couple of volumes
for your own private study. This changes things a bit. If you
are to undertake such a task, it is important that you should
work as far as possible in the full knowledge of the current
developments affecting text archive creation. You are not alone
in doing this kind of work: many scholars in different fields are
busy creating data banks of material in different languages.
I wonder if you are aware of the recent formation of the Text
Encoding Initiative? If not, you *should* find out about this.
Join the Association for Computing and the Humanities, etc.
Talk to the other HUMANISTS at Toronto. You probably have already,
but some of what you say makes me think that you are about to
undertake a large project with some important sources of help
and advice perhaps still unexplored.

The Text Encoding Initiative has yet to make its recommendations, but
some elements of what will be said are at least adumbrated. You should
definitely try to use a coding scheme that is compatible with SGML,
for instance. I know that an SGML "header" can be written to
accommodate the coding scheme used, but I am not certain whether
an SGML document can legitimately contain characters from an 8 bit
character set. If not, then (since the TEI has not reported yet,
and an SGML scheme is hard to implement from scratch) you would
probably not go far wrong if you used the coding scheme of Plain TeX
for the accented characters of your text (see chapter 9 of The TeXbook,
by Don Knuth). This scheme is unambiguous, clearly documented, and 7-bit:
three strong advantages. It would also allow you easily to format and
print your text extremely clearly, for optimal ease in proofreading.
(Of course you could use any coding you like for your own private
data input.)

On a more personal note, if you are working for a doctorate, I would
not recommend that you spend a lot of time typing Pali texts. If, as
you say, you have done three volumes already, that is quite enough to
begin to get substantial results from linguistic computing. Stop
typing! Learn Icon (or get MicroOCP or Wordcruncher, or whatever),
and produce some concrete results concerning the linguistic nature
of the texts. If you spend two years typing more texts in, a) no
one will thank you, b) someone else will do it again, faster and better,
ignoring what you have already done, c) you will be no closer to
your academic goals, d) you will be a good typist, granted.

If the Pali texts are to become available on disk, then I am
convinced that it must be as a result of an initiative of some weight,
with serious money and institutional backing. And a well-operated KDEM
would be far more efficient than you alone typing. There are few enough
South Asian texts printed in roman transliteration in good clear editions;
it seems obtuse not to take full advantage of modern scanning methods for
those texts that lend themselves to this approach. I have long thought
of the PTS edition of the Pali canon as a perfect candidate for KDEM
scanning. It's just that the PTS committee has so far lacked the will
to initiate the job, or to license others to do it.

If the Bangkok digital version of the canon is not of a high standard (I have
no knowledge whatever of the results of this project), then perhaps
a more fruitful approach would be to scan the PTS editions by KDEM, and
then run them through collating software against the Bangkok data files.
I don't know how different the recensions are, but this might show
up as an interesting aspect of such a comparison. It would also be
instructive to find out why, exactly, the Bangkok data bank is
inadequate. After all, they did the task in a manner identical to the
one you propose to use.

If the data is to be typed, then (as with the TLG project) everything
must be typed twice, by independent typists. Then parsing software
must be written to compare the two versions, and to spot gross errors.

This sort of project is not to be undertaken lightly, if it is to be
of long term value.

I am sorry if I sound a bit stern, but I am worried that you are
about to spend a lot of your time doing something that will not serve
your own, or others, long term aims.


Dominik

(4) --------------------------------------------------------------28----
Date: Thu, 13 Apr 89 12:39:57 EDT
From: elli@harvunxw.BITNET (Elli Mylonas)
Subject: coding for strange languages (21)

I come late to this discussion, but the comment about standards in general
got me interested. I work extensively with beta code and (even worse)
its precursor, alpha code, when working with Greek texts.

The goal in developing a standard is not so much in creating something
that will work on any machine at any time; so that a character set
can be read by all word processors. The goal is to create a standard
that all (ha!) software can translate into and out of. If you ask the
software builders, you will see that they each have their proprietary
format, which allows them to do whatever they do as well as they do it.
And they are extremely unwilling to switch.
What has to happen is that people have to create products that acknowledge
the existence of a) some standard and b) other products.
Beta code, ugly though it may be, *is* a de facto standard. It is also
capable of handling a lot of strange cases, and of being expended.
So what we need is systems that can read in , and when moving to
some other system, write out beta code (in the case of Greek).
This will enable not only CGA and VGA compatible machines to read Greek,
but also Macs, Unix boxes, and whatever comes up in the future that
will be bigger and better than what we have now.