[tei-council] idno, xml:lang, ref and att.pointing
Stuart A. Yeates
syeates at gmail.com
Thu Sep 15 01:42:03 EDT 2011
I'm currently doing some work with automatic language detection (as
per my thesis), and am seeing interesting features in headers. The
features are worst (or perhaps more consistent) with the idno tag.
The underlying problem is that this tag is most commonly used with
non-linguistic text (i.e. URLs, ISBNs, DOIs, etc), but current TEI
practice doesn't include using xml:lang="" (meaning unknown) or
xml:lang="zxx" (meaning non-linguistic content) for such text. The
character string "http:" (for example) is arguably English, but when
it appears in script which doesn't include the letter 'h' is clearly
wrong and ends up corrupting the language model of the language I'm
Supplemental issues are: (a) that URLs are being used with no
indication of whether they're being URL-encoded and (b) that the ref
and idno tags are used in practice to do very similar things, but idno
doesn't have access to att.pointing.
I could like to suggest that the definition of idno is updated to
make it clear that
is syntatic sugar for
when XXX (matched case insensitively) is a standard or commonly used
URI scheme. See
Ideally I like to switch to:
<idno type="XXX"><ref url="XXX:YYYYYYY"/></idno> or
<idno type="XXX" url="XXX:YYYYYYY"/>
for representing this. But that may be a little disruptive.
Recommend that ISBNs, ISSNs, etc, be represented as URNs and fit
within the above. See
Recommend the use of xml:lang="" or xml:lang="zxx" for content that is
of unknown linguistic content and non-linguistic content respectively.
More information about the tei-council