[tei-council] idno, xml:lang, ref and att.pointing

Piotr Bański bansp at o2.pl
Fri Sep 16 05:57:53 EDT 2011


I'm happy that these issues got raised, interesting discussion.

Looking at the end of the thread, I think I agree with Kevin's and
Laurent's sentiment for treating certain elements as marking islands /
black boxes for linguistic processing.

And then I would agree with Stuart's (III) to mark non-linguistic
content in cases where you would otherwise expect linguistic content.

What I'm thinking about is e.g.:
* defining, a.o., <idno> as an 'island' (an extralinguistic label, in
this case; you don't want to translate it or it will lose its
identificational property);
* using xml:lang="zxx" on items (<p>, <span>, etc.) that you would be
tempted to process linguistically otherwise.

One more comment:
> make it clear that
>
> <idno type="XXX">XXX:YYYYYYY</idno>
>
> is syntatic sugar for
>
> <ref url="XXX:YYYYYYY"/>
>
> when XXX (matched case insensitively) is a standard or commonly used
> URI scheme. See
> https://secure.wikimedia.org/wikipedia/en/wiki/URI_scheme

I don't see how <idno> could ever serve as syntactic sugar for a <ref>
-- <idno> has the extra semantics of being a label, while <ref> is a
pointer from a (possibly zero-length) span of text.

And going even further (sigh), at some point we might want to pick on
the name of @url, especially when/if we want it to store URNs...

Best,

  Piotr

On 15/09/11 07:42, Stuart A. Yeates wrote:
> I'm currently doing some work with automatic language detection (as
> per my thesis), and am seeing interesting features in headers. The
> features are worst (or perhaps more consistent) with the idno tag.
> 
> The underlying problem is that this tag is most commonly used with
> non-linguistic text (i.e. URLs, ISBNs, DOIs, etc), but current TEI
> practice doesn't include using xml:lang="" (meaning unknown) or
> xml:lang="zxx" (meaning non-linguistic content) for such text. The
> character string "http:" (for example) is arguably English, but when
> it appears in script which doesn't include the letter 'h' is clearly
> wrong and ends up corrupting the language model of the language I'm
> building.
> 
> Supplemental issues are: (a) that URLs are being used with no
> indication of whether they're being URL-encoded and (b) that the ref
> and idno tags are used in practice to do very similar things, but idno
> doesn't have access to att.pointing.
> 
> I could like to suggest that the definition of idno is updated to
> 
> (I)
> 
> make it clear that
> 
> <idno type="XXX">XXX:YYYYYYY</idno>
> 
> is syntatic sugar for
> 
> <ref url="XXX:YYYYYYY"/>
> 
> when XXX (matched case insensitively) is a standard or commonly used
> URI scheme. See
> https://secure.wikimedia.org/wikipedia/en/wiki/URI_scheme
> 
> Ideally I like to switch to:
> 
> <idno type="XXX"><ref url="XXX:YYYYYYY"/></idno> or
> <idno type="XXX" url="XXX:YYYYYYY"/>
> 
> for representing this. But that may be a little disruptive.
> 
> (II)
> 
> Recommend that ISBNs, ISSNs, etc, be represented as URNs and fit
> within the above. See
> https://secure.wikimedia.org/wikipedia/en/wiki/Uniform_Resource_Name
> 
> (III)
> 
> Recommend the use of xml:lang="" or xml:lang="zxx" for content that is
> of unknown linguistic content and non-linguistic content respectively.
> 
> cheers
> stuart
> _______________________________________________
> tei-council mailing list
> tei-council at lists.village.Virginia.EDU
> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
> 
> PLEASE NOTE: postings to this list are publicly archived
> 


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4054 bytes
Desc: S/MIME Cryptographic Signature
Url : http://lists.village.Virginia.EDU/pipermail/tei-council/attachments/20110916/2219d944/attachment-0001.bin 


More information about the tei-council mailing list