[tei-council] datatype issues (part 1)

Syd Bauman Syd_Bauman at Brown.edu
Mon Sep 12 14:28:51 EDT 2005


For the record, I'm leaning towards ISO numeric codes for sex=.

However, the argument in favor of more memorable codes is quite a
strong one.


LB> Yes, it defeats the point of the exercise. Think dates. You can
LB> say <date>wibble</date> if you like, but once you say <date
LB> value="xxxx">wibble</date> your value MUST conform to what the
LB> relevant standard says it should be.

Yes indeed, which is possibly a bit of a problem for some TEI
users, as they may well be interested in calendars other than the
Gregorian, which is all that ISO 8601 purports to represent.
(However, lots of people think it's OK to use it for proleptic
Gregorian, too.)


LB> Similarly, with the (one or two) places where sex rears its head
LB> **as a coded attribute value**. It's supposed to be a normalised
LB> value: ISO has normalized in this particular way. If you really
LB> want to retain the ability to say sex="some-arbitrary-code-i-
LB> just-invented" that's a different attribute. We could have both
LB> sex and ISOsex I suppose, but maybe that's going too far.

This seems like circular reasoning. The reason sex= is supposed to be
a normalized value is because the TEI has decided it would be a
useful thing. If TEI (i.e., us, right here, right now) decide that it
is not so useful to normalize against ISO 5218, that's fine. We can
normalize against our own definitions of "m", "f", "u", and "x"
(although some might call that 'regularization', since the
definition, although external to the instance, is not external to the
TEI).

I'm not aware that anybody has recommended arbitrary codes get made
up by the user.


LB> My argument is that iff we are going to go to the trouble of
LB> normalising attribute values, and there is a pre-existing
LB> international norm, then we really have to have much better
LB> reasons not to follow it than to say "it's not intuitive".

I'm inclined to agree. "It's not intuitive" is probably not a
sufficient argument on its own.


LB> Says who? If you're arabic or chinese, why is "m" more intuitive
LB> than "1"? (or "u" than "0")?

That's not really fair, in that if you're Arabic or Chinese you'll
either be dealing with an equally non-intuitive element name (e.g.
"person") and attribute name (i.e. "sex"), or you will have
internationalized your schema, including these values.


LB> We could have much the same argument about whether we should
LB> replace the required values of the xsd "boolean" datatype ((which
LB> are "true" and "false") with "0" and "1" or "yes" and "no" ...

Well, yes, but there the XSD values are just as intuitive, so there's
really no argument at all. (And actually, since xsd:boolean takes the
lexical values "true", "false", "1", and "0", if I were to do
anything to change it I'd probably want to restrict it so only one
pair of opposites was permitted.)


LB> p.s. you could in your ODD application redefine tei.data.sex as a
LB> different set of values, of course, ...

Absolutely. Which makes me think this isn't really all that
important.


JC> I'm assuming that when slightly cryptic ISO standards like this
JC> are use for attribute values that there will be a (simplified)
JC> discussion of it in the guidelines and perhaps a pointer to
JC> further information on it?

I think that's a requirement.


SR> If you want your archival XML to have ISO values for sex, but
SR> your editors seem "mfu", then you have to use an alternate
SR> authoring DTD, and impose a transformation in your workflow.

In many many cases this is going to be a really good idea, for lots
of less sexy reasons than this one.


SB> Part of me really wants to just use
SB>   "not known" | "male" | "female" | "not specified"
SB> and avoid the question.

BTW, those are the values that 0, 1, 2, and 9 map to in 5218. I.e.,
use the entire normalized value, not the look-up code to it.


LR> There is also the possibility that we keep the TEI values for our
LR> users and use <equiv> to mark the link with ISO values.

SR> Why set the TEI up as a standards-making body in areas for which
SR> standards already exist? As Lou said, the point of @sex is to
SR> produce a standardized form, not something that is readable for
SR> humans.

That seems like it might be a bit of a slippery slope. I mean, one
of the main selling points of XML is that it is human readable. If a
user wanted to store this information in a manner she could never
directly read, she could use a proprietary database instead.
(Although I readily admit that in this particular case the values
aren't really not human readable, rather they just require the user
to take an extra step to ascertain meaning; a step they might have to
take in cases of "u" and "x", anyway.)


LR> There are cases where we act as an interface towards communities
LR> which may not like to manipulate formats like numbers for sex.

Right.




More information about the tei-council mailing list