[tei-council] xml-colon-thing

Thu Nov 11 18:45:28 EST 2004

> The Character set workgroup has proposed and the Council has (I
> think) accepted that the global lang attribute should be replaced
> by a global xml:lang attribute;

I don't believe Council has approved this; at least, the minutes
don't reflect any such decision.

> ... the Standoff WG has proposed and the Council is (I think)
> still ruminating about the notion that the global "id" attribute
> should be replaced by a global xml:id attribute.

Council explicitly approved replacing id= with xml:id= at the 2004-05
meeting in Ghent.

> I am proposing to add xml:lang as an *alternative* to lang in the
> definition for tei.global.attributes rather than as a replacement
> for it. (And similarly xml:lang).

This seems like it could be quite problematic for anyone trying to
write software that could do something intelligent with unmodified
TEI P5 files. With this proposal authors of such software would need
to be able to process 
* ID/IDREF using id=, and
* XPointers using xml:id=, and
* lang= pointing to <lang> via ID/IDREF, and
* xml:lang= being an RFC 3066 (or successor) language tag &
  potentially pointing to <lang> via ident=
Seems a bit scary. In addition, if I use one methodology and you use
the other, we've got yet another barrier to interchange.

> I think it is a good idea to hedge our bets about the take up of
> W3C recommendations in this way,

I'm not entirely sure why. It's not as if there are dozens of
processing systems out there that can do the right thing with lang=,
but won't be able to handle xml:lang=. Same for id=, really, with the
obvious exception of validators.

We didn't hedge any bets about the take up of RelaxNG, and it doesn't
have a multinational cartel -- I mean consortium -- of software
companies backing it.

Of course, if it's IP issues you're concerned with ... sigh.

> and it will definitely make the job of moving legacy data to P5 a
> whole lot easier.

My first reaction was "no it won't", but that's not true. It could,
at least for the case of xml:lang=.

For xml:id=, I can't imagine that there won't be software to perform
instance conversion. Heck, if your project has constrained its data
such that an equals sign ("=") is not permitted except when used as a
VI, the IDREF -> XPointer conversion can be done with a one-line Perl
script[1]. (IDREFS is a little harder.) Even for those projects that
do use "=" in places other than as a VI (i.e., in content, attribute
values, comments, PIs, and CDATA marked sections), it can't be that
difficult an XSLT stylesheet, can it? Perhaps my low level of mastery
of XSLT is shining through here, but especially in XSLT2 such a thing
should be reasonably easy, no?

However, if this "choice of lang= or xml:lang= but not both" proposal
is not adopted, then those converting legacy data will have to go
about creating a look-up table that will convert their local lang=
values (which are by definition arbitrary IDREFs) into xml:lang=
values (which will be by definition RFC 3066 or its successor
language tags). While many many projects will be able to create a
single table capable of being used for all their P4 files, it is
conceivable that a different look-up table would need to be created
for each instance to be converted. Could be a pain.

Of course, many projects actually deliberately choose to use RFC 3066
language tags as the value of id= of <language> (and thus of lang=)
already, and thus wouldn't have to do anything special.

<aside>I just checked the WWP text-base, and of the over 850 valid
occurrences of id= of <language>[2], it seems at first glance as
though only 3 are not already valid RFC 3066 language tags. They are
"British", "German", and "zzx", which is our local code for "I don't
know".</aside>

The main problem I foresee pertains to people who have stretched the
meaning of "language" a wee bit, and have things like
  <eg lang="xml">
and
  <formula lang="TeX">
in their files.

> Not least for us!

Who's us? While the TEI-C certainly has lots of data lying around in
P4 format, is there a strong reason to migrate it all to P5? It
appears we have a history of leaving things in older formats -- there
are still files on the website in Waterloo GML and LaTeX. (Although
to somebody's credit (probably Lou's), plain-text equivalents are
usually provided as well.)

Notes
-----
[1] s{(code
      |copyOf
      |depPtr
      |end
      |follow
      |from
      |grpPtr
      |hand
      |lang
      |location
      |mergedin
      |new
      |next
      |old
      |origin
      |parent
      |prev
      |render
      |resp
      |sameAs
      |scheme
      |script
      |since
      |start
      |target
      |to
      |value
      |who)=(["'])([^'" \t\n\r\f]+['"])}
     {\1=\2#\3}xg;
[2] Embarrassingly there were a few *invalid* specifications, which I
    should probably go fix shortly. Sigh.