11.0538 encoding Asian (CJKV) languages

Humanist Discussion Group (humanist@kcl.ac.uk)
Tue, 27 Jan 1998 20:02:51 +0000 (GMT)

Humanist Discussion Group, Vol. 11, No. 538.
Centre for Computing in the Humanities, King's College London
<http://www.princeton.edu/~mccarty/humanist/>
<http://www.kcl.ac.uk/humanities/cch/humanist/>

[1] From: Christian Wittern <cwitter@gwdg.de> (38)
Subject: Re: 11.0530 Japanese & Chinese encoding & texts

[2] From: "David L. Gants" <dgants@english.uga.edu> (14)
Subject: Re: 11.0530 Japanese & Chinese encoding & texts

--[1]------------------------------------------------------------------
Date: Sun, 25 Jan 1998 22:24:53 +0000
From: Christian Wittern <cwitter@gwdg.de>
Subject: Re: 11.0530 Japanese & Chinese encoding & texts

Peter Evans writes:
> >> From: Peter Evans <peterev@alles.or.jp>
>
> In Vol. 11 [does anyone bind them?] No. 526, Steve McCarty writes that
>
> >Because Chinese characters can exceed 20 strokes, they each
> >require twice as many bytes as plain ASCII text.
>
> Er, no. Because there are a lot more than 256-minus-32 of them, eight bits
> don't provide enough permutations to go around. But sixteen do.

Kind of. How many Chinese characters do you think are out there?
Nowbody finished counting yet, but the largest Codeset I know of
encodes ~75000 characters.
>
> >Therefore from the viewpoint of Japanese word-processing, where English
> >is a subset of Japanese, ASCII letters take half a space.
>
> Because ASCII is a subset of the various kinds of JIS character set, the
> letters represented by ASCII take two bytes: either the combination of a
> null byte and the ASCII byte, or as "full-width" roman letters, and for all
> I know in some other way too. Thus if you use Buerg's LIST or similar to
> look into a file with only English text (no "foreign" accented characters,
> let alone Japanese) created with the Japanese edition of Word for Windows,
> you'll immediately notice that the text is s p a c e d by nulls.

The behaviour you are decribing here is a property of Japanese Word,
rather than Japanese encoding in general. Full width Roman letters
are encoded in Japanese with an # mark betwenn, like #T#h#i#s# #i#s#
#f#u#l#l#w#i#d#t#h.
To speak of ASCII to denominate the 256 codepoints that can be
adressed by one byte is rather misleading, since this encodes only
128 codepoints. In ShiftJis, which is the most commonly used encoding
on small computer's operating systems (DOS, MacOS, Windows), The
ASCII range is actually mapped to itself with no spaces or hash-marks
in between, so writing English in ShiftJis looks exactly like
ASCII-english.

All the best, Christian Wittern

Christian Wittern Visit the Database of Chinese Buddhist texts
University of Goettingen at http://www.gwdg.de/~cwitter

--[2]------------------------------------------------------------------
Date: Tue, 27 Jan 1998 14:39:51 -0500 (EST)
From: "David L. Gants" <dgants@english.uga.edu>
Subject: Re: 11.0530 Japanese & Chinese encoding & texts

>> From: Peter Evans <peterev@alles.or.jp>

A correction and a tip.

I mentioned the forthcoming book "Understanding CJKV Language Processing"
["CJKV" meaning Japanese, Korean, Vietnamese]. Sorry, that should have
been *Understanding CJKV Information Processing*.

The good news is that the nitty-gritty (and it's gritty indeed) is amassed
in a wonderful file of plain ASCII (181 kB uncompressed) that is free for
the downloading:
<ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf>. Despite first
appearances, this also deals with Vietnamese and Mongolian. For other
material on and resources for the encoding and processing of east Asian
languages, peruse and pursue the links from
<http://www.ora.com/people/authors/lunde/cjk_inf.html>.
:::::::::::::::::::::::::::::::::
Peter Evans <peterev@alles.or.jp>

-------------------------------------------------------------------------
Humanist Discussion Group
Information at <http://www.kcl.ac.uk/humanities/cch/humanist/>
<http://www.princeton.edu/~mccarty/humanist/>
=========================================================================