Sanskrit coding (201)

Willard McCarty (MCCARTY@VM.EPAS.UTORONTO.CA)
Tue, 11 Apr 89 20:17:08 EDT

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Willard McCarty: "humanities computing centres, cont. (32)"
Previous message: Willard McCarty: "electronic communications (44)"

Humanist Mailing List, Vol. 2, No. 820. Tuesday, 11 Apr 1989.

Date: Tue, 11 Apr 89 14:11
From: Wujastyk (on GEC 4190 Rim-D at UCL) <UCGADKW@EUCLID.UCL.AC.UK>
Subject: Sanskrit coding scheme

Mathieu Boisvert has raised the very important issue of standardization
of character sets--in this case, for Indian languages in scholarly
transliteration. Mathieu sent an almost identical note to me, and
to Humanist. Here is my reply; as I say at the end of the letter,
I think this discussion would benefit from being carried out on
HUMANIST. I am perfectly prepared to be shouted down, though.
----------------------------
Dear Mathieu,

Thank you for writing to me about the problem of Sanskrit/Pali
character set encoding.

Here, following your layout, is the scheme I have been using:

First for Capital letters:
N with a dot under (retroflex) char(225)
N with tilde (Palatal) char(165) (ASCII)
N with a dot above (Velar) char(227)
S with a dot under (retroflex) char(234)
S with a slash above (palatal) char(242)
R with a dot under (short r) char(230)

A long char(228)
I long char(235)
U long char(231)
T with a dot under (retro) char(233)
D with a dot under (retro) char(237)
H with a dot under char(253)

Lower case letters

a long char(224)
i long char(229)
u long char(245)
t with a dot under (retro) char(244)
d with a dot under (retro) char(238)
n with a dot under (retro) char(232)
s with a dot under (retro) char(241)
n with tilde (palatal) char(164) (ASCII)
s with slash above (palatal) char(239)
n with dot above (velar) char(243)
r with dot under (short) char(240)
r with dot under and dash above (long) char(246)
h with dot under char(247)
l with dot under char(157)
(the retroflex liquid vowel)
m with dot above (anusvara) char(236)

a with circumflex char(131) (ASCII)
i with circumflex char(140) (ASCII)
u with circumflex char(150) (ASCII)

The entries flagged "(ASCII)" are, of course, unchanged from the
layout of the IBM Extended ASCII scheme, i.e., code page 437. I
am curious about your reasons for creating a second set of lower
case circumflexed vowels, and nasals with a tilde.

I would like to hear your reasons for choosing the codings you
have.

Here are the reasons, such as they are, for the scheme I use:

1/ When I was setting up my character set, I was working with a
Toshiba P321SL printer, which has a limited number of
characters available above 127 for downloaded fonts. So I
decided not to try to implement the full range of accented
characters needed, for example, for Hindi or accented Vedic
texts. Gary Tubb has done a much fuller set that includes
these extra characters (email: gat@harvunxw.bitnet).

2/ I decided to avoid, if possible, usurping the graphic
characters, i.e., chars 176--223. A growing number of
software packages use these characters to provide character
based graphical and windowing interfaces, and I thought that
it would be wise to avoid them.

3/ I was also very keen to keep the French and German characters,
and in general all the characters from 128--156.

4/ Finally, I took a look at the IBM code page swapping system
that was released with DOS version 3.3. I have not read any
technical documentation about this, and I would like to. But
I have had a hard look at the code page layouts 863 (Canada-
French), 865 (Norway), 850 (Multilingual) and 860 (Portugal),
and have tried to divine what they are doing. It seems to me
that IBM feels free to change anything they want, above 127.
But they seem to prefer to change chars 128--175 first, and
only then 176--254. Code page 850 is the one with the most
radical changes, and even here the majority of the graphics
characters in the range 176--223 are left alone. The ones
that are changed are 181--184, 189--190, 198--199, 208--216,
221--222. This set, you will at once see, covers all of the
graphical characters consisting of a single line intersecting
at right angles with with two lines. All the plain single
line characters, and the plain double line characters are left
alone.

These reasons account for the character positions I decided to
use. As for the actual ordering of characters that I have used,
I simply have no recollection whatsoever of the reasons I had.
It all looks completely wacky to me now. If I were starting
again, I think I would at least put them in Sanskrit alphabetical
order. I do remember wondering if I could somehow place the
Sanskrit characters at positions that were 64 or 128 positions
displaced from their unaccented counterparts; this might have
helped with automatic sorting in the latin order, and to make the
text legible when the high bit was stripped, i.e., when reduced
to a 7 bit code. But I gave up on this.

What we *really* want is to set up an Indological code page.
Software that has built in support for code page swapping (like
PC-Write 3.0, which supports 437 and 850) can make the use of a
code page much nicer, and can adjust such features as upper/lower
case conversion, searching, and sorting. But it is unlikely that
any software manufacturer would provide support for anything
unofficial. And it is impossible for an outsider to communicate
in any intelligent manner with IBM. So the situation is probably
that such code page would be of use mainly for document exchange.

As you say, it is perfectly easy, if tedious, to convert the
character set scheme of a document simply by using some form of
search and replace facility in a word processor. One can also
use a text filter, such as Ronald Gans's excellent Xword, to
perform a batch conversion, or even write a simple text filter in
Icon or any language (Corre's forthcoming book, "Icon Programming
for Humanists" has an example of exactly this).

To be honest, I think it is unlikely that an Indological code
page can be made to stick, unless it is backed by an influential
body such as the AOS, the RAS, or the DMG. There are many
Indologists who have already invested a lot of time and energy in
databases, text banks, and programs that rely on a particular
coding scheme. Gary Tubb, Ronald Emmerick, Paul Griffiths, Peter
Schreiner, Paul Kuepferle, K. R. Norman, yourself, myself, to
mention just a few. And these people will see little reason to
change unless there are substantial, guaranteed advantages.
Maybe I am being unduly pessimistic; I hope so. If only these
people were all available on the network, we might get somewhere.
But most of them are not, and there is little chance of them all
meeting in one place to thrash out the issues.

Wait a minute.

What about the next World Sanskrit Conference in Vienna? Perhaps
that is the answer. A workshop could be held in order to
establish a code page, and an agreement (if reached) coming from
such a meeting would carry the requisite weight. I'll give it
some thought.

On another matter, since you are converting lots of Pali texts
into digital form, I should like to make sure that you know of
the Kern Institute's plans for the same project. And are you
also aware that the job has already been done in Bangkok? Are
you entering text from the PTS editions? If so, you probably
shouldn't be, since it is my understanding that the PTS is at
present opposed to the conversion of their publications into
machine readable form. They are frightened--wrongly, I am
convinced--that their sales revenue from the books will drop if
the texts are available in electronic form. I have remonstrated
with individual members of the PTS committee about this.

But if you have gained their agreement, then the second issue is
whether the texts could not be more efficiently entered using a
Kurzweil data entry machine (KDEM). A KDEM would be perfectly
able to read the PTS editions. In fact a three-volume PTS text
(I'm afraid I can't remember which) was successfully scanned by
the Oxford KDEM only last year.

Finally, as long as you are on the network, I should like to
carry out this discussion through HUMANIST. I think that there
are several points here of general interest, and we might pick up
some helpful points from our Humanist colleagues. Is this all
right with you?

Yours,

Dominik

-------------------------------------------------------------------------------
Dominik Wujastyk, | Janet: wujastyk@uk.ac.ucl.euclid
Wellcome Institute for | Bitnet/Earn/Ean/Uucp: wujastyk@euclid.ucl.ac.uk
the History of Medicine, | Internet/Arpa/Csnet: dow@wjh12.harvard.edu
183 Euston Road, | or: wujastyk%euclid@nss.cs.ucl.ac.uk
London NW1 2BP, England. | Phone: London 387-4477 ext.3013
-------------------------------------------------------------------------------

Next message: Willard McCarty: "humanities computing centres, cont. (32)"
Previous message: Willard McCarty: "electronic communications (44)"