[tei-council] idno, xml:lang, ref and att.pointing

Stuart A. Yeates syeates at gmail.com
Thu Sep 15 01:42:03 EDT 2011


I'm currently doing some work with automatic language detection (as
per my thesis), and am seeing interesting features in headers. The
features are worst (or perhaps more consistent) with the idno tag.

The underlying problem is that this tag is most commonly used with
non-linguistic text (i.e. URLs, ISBNs, DOIs, etc), but current TEI
practice doesn't include using xml:lang="" (meaning unknown) or
xml:lang="zxx" (meaning non-linguistic content) for such text. The
character string "http:" (for example) is arguably English, but when
it appears in script which doesn't include the letter 'h' is clearly
wrong and ends up corrupting the language model of the language I'm
building.

Supplemental issues are: (a) that URLs are being used with no
indication of whether they're being URL-encoded and (b) that the ref
and idno tags are used in practice to do very similar things, but idno
doesn't have access to att.pointing.

I could like to suggest that the definition of idno is updated to

(I)

make it clear that

<idno type="XXX">XXX:YYYYYYY</idno>

is syntatic sugar for

<ref url="XXX:YYYYYYY"/>

when XXX (matched case insensitively) is a standard or commonly used
URI scheme. See
https://secure.wikimedia.org/wikipedia/en/wiki/URI_scheme

Ideally I like to switch to:

<idno type="XXX"><ref url="XXX:YYYYYYY"/></idno> or
<idno type="XXX" url="XXX:YYYYYYY"/>

for representing this. But that may be a little disruptive.

(II)

Recommend that ISBNs, ISSNs, etc, be represented as URNs and fit
within the above. See
https://secure.wikimedia.org/wikipedia/en/wiki/Uniform_Resource_Name

(III)

Recommend the use of xml:lang="" or xml:lang="zxx" for content that is
of unknown linguistic content and non-linguistic content respectively.

cheers
stuart


More information about the tei-council mailing list