[tei-council] idno, xml:lang, ref and att.pointing
Piotr Bański
bansp at o2.pl
Fri Sep 16 05:57:53 EDT 2011
I'm happy that these issues got raised, interesting discussion.
Looking at the end of the thread, I think I agree with Kevin's and
Laurent's sentiment for treating certain elements as marking islands /
black boxes for linguistic processing.
And then I would agree with Stuart's (III) to mark non-linguistic
content in cases where you would otherwise expect linguistic content.
What I'm thinking about is e.g.:
* defining, a.o., <idno> as an 'island' (an extralinguistic label, in
this case; you don't want to translate it or it will lose its
identificational property);
* using xml:lang="zxx" on items (<p>, <span>, etc.) that you would be
tempted to process linguistically otherwise.
One more comment:
> make it clear that
>
> <idno type="XXX">XXX:YYYYYYY</idno>
>
> is syntatic sugar for
>
> <ref url="XXX:YYYYYYY"/>
>
> when XXX (matched case insensitively) is a standard or commonly used
> URI scheme. See
> https://secure.wikimedia.org/wikipedia/en/wiki/URI_scheme
I don't see how <idno> could ever serve as syntactic sugar for a <ref>
-- <idno> has the extra semantics of being a label, while <ref> is a
pointer from a (possibly zero-length) span of text.
And going even further (sigh), at some point we might want to pick on
the name of @url, especially when/if we want it to store URNs...
Best,
Piotr
On 15/09/11 07:42, Stuart A. Yeates wrote:
> I'm currently doing some work with automatic language detection (as
> per my thesis), and am seeing interesting features in headers. The
> features are worst (or perhaps more consistent) with the idno tag.
>
> The underlying problem is that this tag is most commonly used with
> non-linguistic text (i.e. URLs, ISBNs, DOIs, etc), but current TEI
> practice doesn't include using xml:lang="" (meaning unknown) or
> xml:lang="zxx" (meaning non-linguistic content) for such text. The
> character string "http:" (for example) is arguably English, but when
> it appears in script which doesn't include the letter 'h' is clearly
> wrong and ends up corrupting the language model of the language I'm
> building.
>
> Supplemental issues are: (a) that URLs are being used with no
> indication of whether they're being URL-encoded and (b) that the ref
> and idno tags are used in practice to do very similar things, but idno
> doesn't have access to att.pointing.
>
> I could like to suggest that the definition of idno is updated to
>
> (I)
>
> make it clear that
>
> <idno type="XXX">XXX:YYYYYYY</idno>
>
> is syntatic sugar for
>
> <ref url="XXX:YYYYYYY"/>
>
> when XXX (matched case insensitively) is a standard or commonly used
> URI scheme. See
> https://secure.wikimedia.org/wikipedia/en/wiki/URI_scheme
>
> Ideally I like to switch to:
>
> <idno type="XXX"><ref url="XXX:YYYYYYY"/></idno> or
> <idno type="XXX" url="XXX:YYYYYYY"/>
>
> for representing this. But that may be a little disruptive.
>
> (II)
>
> Recommend that ISBNs, ISSNs, etc, be represented as URNs and fit
> within the above. See
> https://secure.wikimedia.org/wikipedia/en/wiki/Uniform_Resource_Name
>
> (III)
>
> Recommend the use of xml:lang="" or xml:lang="zxx" for content that is
> of unknown linguistic content and non-linguistic content respectively.
>
> cheers
> stuart
> _______________________________________________
> tei-council mailing list
> tei-council at lists.village.Virginia.EDU
> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>
> PLEASE NOTE: postings to this list are publicly archived
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4054 bytes
Desc: S/MIME Cryptographic Signature
Url : http://lists.village.Virginia.EDU/pipermail/tei-council/attachments/20110916/2219d944/attachment-0001.bin
More information about the tei-council
mailing list