[tei-council] classy measurements

Syd Bauman Syd_Bauman at Brown.edu
Sat Nov 19 18:29:09 EST 2005


At the class meeting in Oxford we decided to create a new attribute
class "att.measurement" to have the attributes unit=, quantity=, and
commodity= for <measure>. This all seems like a very good idea on the
surface, but digging a bit deeper it reveals places where our goals
compete with one another, and problems with our declaration system.


unit symbols
---- -------
First, there is the question as to what to recommend people use as
the symbol for a variety of common units. In general, our goal is to
depend on outside standards wherever reasonable. Symbols for units
seems like a great place for TEI to say only what's necessary, and
point to other standards for the rest. The problem is, there are lots
of standards out there, and they often disagree, usually subtly. At a
minimum, there's

* SI (also ISO 1000?), *the* international system of units, aka
  "metric system", by BIPM

* ISO 31, in particular ISO 31-0

* ISO 2955, about which I know very little except that some people
  think it's obsolete

* ANSI X3.50, the American extension of 2955, which also does not
  seem to be readily available 

* ASTM 1238 and its super-set HL7

* ENV 12435 (which I *think* is the European equivalent of the
  American HL7)

* UCUM, The Unified Code for Units of Measures

* Unicode / ISO 10646

Several of these were invented without consideration of machine
interchange of data; several were invented specifically to address
machine interchange of data *using limited character sets*. But none
of them seem to address how to perform machine interchange of data
*using Unicode* (even Unicode itself; I can pretty easily find the
code-points and names of various characters, but not much about their
semantics and intended use).

This is a bit of a problem, because we (alright, I) would like some
guidance on which, if any, of the Unicode characters for unit symbols
should be used. E.g., Unicode seems to suggest that the double-prime
character (U+2033) be used for inches or seconds, and there are
characters for degrees Celsius, ounces, and angstroms. Also there are
a lot of characters that look like they are designed for specific
unit use, but I haven't been able to find out for sure. The names are
a bit odd, and they're in the CJK compatibility block. E.g., U+3392
looks like it would be used for megahertz. (See, e.g.,
http://www.fileformat.info/info/unicode/char/3392/index.htm)

I also think it very important to provide guidance on how to make
explicit the difference between standard and binary values, if the
encoder so desires. (I.e., "MB" vs. "MiB". See, e.g.,
http://www.iec.ch/zone/si/si_bytes.htm if interested.)


The Unified Code for Units of Measures is specifically designed to
'unify' the other standards that address machine interchange.
Furthermore, it does include the binary prefixes. Thus, it initially
looks like a really good candidate. However, it is expressly intended
for use with limited character sets. In particular, it only takes
characters from 7-bit ASCII. While I understand that no one may be
crying over the loss of U+3392 for megahertz, the inability to use mu
for micro or omega for ohm seems silly.

Furthermore, this specification is very detailed. E.g., in some cases
it forces differentiation of which standard a unit comes from. So,
e.g., there are three different symbols for "inch":

 [in_i] = inch, international = 2.54 cm
 [in_us] = the inch as used in USA from 1893 to 1959 = m/39.37
 [in_br] = the British imperial inch = 2.539998 cm

Not only are these symbols ugly and cumbersome, I think the *vast*
majority of TEI users do not care which inch they're using. So to be
forced to pick one would be an annoying burden.


Unless someone knows more about this or has a better idea, I'm going
to suggest that for now the documentation for this class should list
some of the most common symbols in the "suggested values include"
list, and suggest users refer to a standard, and list a bunch of
possible standards (perhaps in the bibliography).

Later, after we've finished straightening out the classes, I think 2
or 3 of us should come up with a concrete proposal for how to encode
units in a Unicode environment. How to put such a proposal into a
tagdoc file is another problem entirely (which I foreshadowed with
"problems with our declaration system", above :-), because the values
of ident= of <valItem> need to be xsd:Names. This suggestion defers
having to deal with this, too.


Regularization vs. Normalization
-------------- --- -------------
Second is the question of whether these attributes are used for
regularization or normalization. My suggestion is that the Guidelines
explicitly state that these attributes can be used either for
regularization:

  The following has been recommended in neuralgia of the uterus: Mix
  together <measure commodity="belladonna extract" unit="[gr]"
  quantity="1.5">one plus a half grain of alcoholic extract of
  belladonna</measure> and ...

or normalization:

  The following has been recommended in neuralgia of the uterus: Mix
  together <measure commodity="belladonna extract" unit="mg"
  quantity="97.2">one plus a half grain of alcoholic extract of
  belladonna</measure> and ...

and that we have no recommendation for doing both simultaneously.


type=, anyone?
------ -------
If the class subcommittee decided whether the new attributes would
replace type= of <measure> or be added in addition to it, this
decision was not recorded in the minutes. Unless someone can think of
a good reason to drop type= of <measure>, I'm going to suggest we
keep it on the somewhat feeble grounds that it can't hurt much, and
it may be quite helpful if we've overlooked something.


I plan to have these three suggestions wrapped into the tagdoc
available on Sourceforge within a few hours.




More information about the tei-council mailing list