[tei-council] classy measurements

Sun Nov 20 20:14:17 EST 2005

Syd Bauman <Syd_Bauman at Brown.edu> writes:

> unit symbols
> ---- -------
> First, there is the question as to what to recommend people use as
> the symbol for a variety of common units. In general, our goal is to
> depend on outside standards wherever reasonable. Symbols for units
> seems like a great place for TEI to say only what's necessary, and
> point to other standards for the rest. The problem is, there are lots
> of standards out there, and they often disagree, usually subtly. At a
> minimum, there's
>
[...]

> Several of these were invented without consideration of machine
> interchange of data; several were invented specifically to address
> machine interchange of data *using limited character sets*. But none
> of them seem to address how to perform machine interchange of data
> *using Unicode* (even Unicode itself; I can pretty easily find the
> code-points and names of various characters, but not much about their
> semantics and intended use).
>
> This is a bit of a problem, because we (alright, I) would like some
> guidance on which, if any, of the Unicode characters for unit symbols
> should be used. E.g., Unicode seems to suggest that the double-prime
> character (U+2033) be used for inches or seconds, and there are
> characters for degrees Celsius, ounces, and angstroms. Also there are
> a lot of characters that look like they are designed for specific
> unit use, but I haven't been able to find out for sure. The names are
> a bit odd, and they're in the CJK compatibility block. E.g., U+3392
> looks like it would be used for megahertz. (See, e.g.,
> http://www.fileformat.info/info/unicode/char/3392/index.htm)

As a general principle, compatibility characters are in Unicode for
compatibility with pre-existing standards, for the purpose of allowing
text encoded in these other encodings converted back and forth to
Unicode.   The should *never* be used in text that started life in
Unicode.  For that reason, we should not recommend the use of any of
these at all.  

Now with respect to the problem you have here, that is, assigning
single codepoints to units of measurement, this is something the
Unicode consortium deliberately avoided, except for these
compatibility characters.  Therefore you will find for each of these
compatibility characters a mapping to a codepoint sequence that should
be used in lieu of this single codepoint.  In the case you cite,
e.g. U+3392, this is the sequence 004D M 0048 H 007A z, which expands
to "MHz".  

I think this was a sensible decision and am very glad the UTC avoided
the mistake of some other standard bodies to assign codepoints to
standard units, it makes our life much easier in this respect.  It
would therefore enough to us to recommend the SI endorsed abbreviation
of the unit, IMHO.

> Furthermore, this specification is very detailed. E.g., in some cases
> it forces differentiation of which standard a unit comes from. So,
> e.g., there are three different symbols for "inch":
>
>  [in_i] = inch, international = 2.54 cm
>  [in_us] = the inch as used in USA from 1893 to 1959 = m/39.37
>  [in_br] = the British imperial inch = 2.539998 cm
>
> Not only are these symbols ugly and cumbersome, I think the *vast*
> majority of TEI users do not care which inch they're using. So to be
> forced to pick one would be an annoying burden.
>
>
> Unless someone knows more about this or has a better idea, I'm going
> to suggest that for now the documentation for this class should list
> some of the most common symbols in the "suggested values include"
> list, and suggest users refer to a standard, and list a bunch of
> possible standards (perhaps in the bibliography).

If you look at the Unicode codepoint area U+3380 to 33DF you see a
pretty long list -- do you really want to include them all?  I think
it would be better to just state the general principle in the tagdoc
and point to a list where these codes are enumerated.

>
> Later, after we've finished straightening out the classes, I think 2
> or 3 of us should come up with a concrete proposal for how to encode
> units in a Unicode environment. How to put such a proposal into a
> tagdoc file is another problem entirely (which I foreshadowed with
> "problems with our declaration system", above :-), because the values
> of ident= of <valItem> need to be xsd:Names. This suggestion defers
> having to deal with this, too.

See above.  none of our business, luckily.

Christian

-- 

 Christian Wittern 
 Institute for Research in Humanities, Kyoto University
 47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN