[tei-council] Internationalised domains

Kevin Hawkins kevin.s.hawkins at ultraslavonic.info
Thu Oct 6 23:59:35 EDT 2011


I'm coming around to Stuart's point of view on this, and a reading of 
RFC 3986 makes me think I was quite mistaken about the form of URIs. 
Let me try to summarize the situation.  There are a few separate matters:

A) Attribute values in of the data.pointer macro must follow RFC 3986. 
Section 2 of this document mandates use of a subset of US-ASCII for 
URIs, and section 1.2.1 and 2.1 says you can use percent encoding for 
characters outside of this subset.  For some reason, when Stuart 
validated some non-US-ASCII data on 22 September actually doesn't 
prevent use of these characters in such attributes in his TEI documents. 
  Not sure whether this is a fault of the validation tool(s), the 
schema(s), or ultimately the appropriate TEI ODD, but someone who 
understands these things better than me should see whether an ODD needs 
changing.  If it's really a tool problem, it's not ideal to have to add 
a note to P5 reminding them about percent encoding, but P5 does have 
occasional notes about common tools, so I wouldn't stand in the way of it.

B) A number of attributes (such as many in att.global-linking) take 1–∞ 
occurrences of data.pointer separated by whitespace.  This is fine as 
long as you use percent-encoding as in (A), but if you don't, processors 
can't tell apart whitespace that is supposed to be within a URI and 
whitespace that is supposed to separate data.pointer values.  I think 
people just need to follow the rules of RFC 3986 here.  I don't see any 
way we can validate the value of these attributes to make sure that 
whitespace is properly escaped, though we could possibly test for proper 
percent encoding of other characters.

C) There are attributes in TEI such as @key which need not take a URI as 
the attribute value but which some people like to put URIs in.  In such 
cases, there is nothing in P5 which prescribes -- and certainly nothing 
that would validate -- a URI in such an attribute to verify whether it 
follows RFC 3986 by properly using percent encoding.  If you had a set 
of documents with URIs in @key, this would obviously be useful; 
otherwise, as in Stuart's most recent message below, you wouldn't know 
whether the URI needs further escaping before processing.  However, it 
seems to me that if someone wants to use a URI here, they should create 
a customization of the TEI in which the datatype for @key or other such 
attributes is changed to be data.pointer.  And then, as (A), we hope a 
validation tool would catch this.

D) As far as I can tell, Punycode offers an alternative to percent 
encoding for representing strings that don't conform to RFC 3986.  Both 
syntaxes are opaque to humans, so I don't think there's really an 
advantage to one over the other -- that is, no advantage to recommending 
RFC 3492 (Punycode) over RFC 3986 (which prescribes percent encoding). 
But I fear that I have missed the point of Punycode here.

--Kevin

On 10/6/11 7:42 PM, Stuart A. Yeates wrote:
> I've just come across a very interesting example:
>
> https://secure.wikimedia.org/wikipedia/en/wiki/%25 /
> https://secure.wikimedia.org/wikipedia/en/wiki/%
>
> This is a wiki page about the percent symbol. If you don't know
> whether or not it's escaped, you can end up with a malformed URL, and
> the obvious heuristic ('if it has a percent sign it's escaped') is
> wrong.
>
> cheers
> stuart
>
> On Tue, Sep 27, 2011 at 1:07 PM, Stuart A. Yeates<syeates at gmail.com>  wrote:
>> I don't know about anyone else's TEI, but here is a selection of URLs
>> with escaped characters, drawn from the TEI we already have live at
>> the NZETC. None of these URLs were coined by us, but by other
>> universities, wikimedia and various units of government. At least one
>> contains escaped spaces. Based on URLs like these, getting escaping
>> right is a priority for us. I've separated each by a double newline in
>> case email mangles them.
>>
>> http://www.jps.auckland.ac.nz/document/Volume_2_1893/Volume_2%2C_No.1%2C_March_1893/The_genealogy_of_the_Pomare_Family_of_Tahiti%2C_from_the_papers_of_the_Rev._J._M._Orsmond%2C_with_notes_thereon_by_S._Percy_Smith%2C_p_25-_42/p1?action=null
>>
>> http://en.wikisource.org/w/index.php?title=Catholic_Encyclopedia_(1913)/Bartolom%C3%A9_Esteban_Murillo&amp;oldid=2142578
>>
>> http://paperspast.natlib.govt.nz/cgi-bin/paperspast?a=d&cl=search&d=TO18961128.2.30&srpos=11&e=-------100--1----2%22the+angel+isafrel%22--http://www.austlit.edu.au/run?ex=ShowAgent&
>>
>> http://www.natlib.govt.nz/about-us/friends-advisors/komiti-maori/?searchterm=te%20komiti%20maori
>>
>> http://en.wikipedia.org/wiki/The_March_%281945%29
>>
>> http://www.nzhistory.net.nz/search?keys=%22john+a.+lee%22&amp;op.x=7&amp;op.y=16&amp;op=Search
>>
>> cheers
>> stuart
>>
>>
>> On Tue, Sep 27, 2011 at 4:44 AM, Martin Holmes<mholmes at uvic.ca>  wrote:
>>> This only applies if people are silly enough to use whitespace in URIs.
>>> And if they're linking to a resource with such a URI, and they (for
>>> instance) copy-paste it from a browser URI box, it'll come with
>>> percent-escapes anyway.
>>>
>>> I really don't think this is an issue, but if we want to add a note to
>>> the effect that URIs containing whitespace should be appropriately
>>> escaped, I think that would be enough.
>>>
>>> Cheers,
>>> Martin
>>>
>>> On 11-09-25 12:41 PM, Stuart A. Yeates wrote:
>>>> Full UTF-8 in the file part of URIs would seem to be a disaster for
>>>> us. Without whitespace being escaped we can't have whitespace
>>>> separated lists of URLs, as the definition of @corresp as "1–∞
>>>> occurrences of data.pointer separated by whitespace" no longer works?
>>>>
>>>> cheers
>>>> stuart
>>>>
>>>> On Fri, Sep 23, 2011 at 9:03 AM, Martin Holmes<mholmes at uvic.ca>    wrote:
>>>>> I think I see the source of the confusion. Older W3C drafts seem to have
>>>>> explicitly addressed the issue of encoding URIs in US-ASCII:
>>>>>
>>>>> <http://www.w3.org/TR/2001/WD-charmod-20010126/#sec-URIs>
>>>>>
>>>>> but that section seems to have disappeared from the current draft:
>>>>>
>>>>> <http://www.w3.org/TR/charmod/>
>>>>>
>>>>> which, on a quick reading, leaves me with the impression that UTF-8,
>>>>> UTF-16 etc. are acceptable encodings.
>>>>>
>>>>> Cheers,
>>>>> Martin
>>>>>
>>>>> On 11-09-22 12:25 PM, Stuart A. Yeates wrote:
>>>>>> I was nuder the impression that non-latin-1 wasn't allowed in
>>>>>> data.pointer (and looking through the relevant standards I still can't
>>>>>> see how it is), but such things seem to validate, so I guess you are.
>>>>>>
>>>>>> So I'd like to apologize for for my misunderstanding and and withdraw
>>>>>> my suggestion.
>>>>>>
>>>>>> cheers
>>>>>> stuart
>>>>>>
>>>>>> On Thu, Sep 22, 2011 at 4:03 AM, Kevin Hawkins
>>>>>> <kevin.s.hawkins at ultraslavonic.info>      wrote:
>>>>>>> I still don't see why Stuart wouldn't simply put this in the TEI:
>>>>>>>
>>>>>>> <name sameAs=""http://موقع.وزارة-الاتصالات.مصر/">Egyptian Ministry of
>>>>>>>     >      Communication and Information Technology</name>
>>>>>>>
>>>>>>> <idno>http://موقع.وزارة-الاتصالات.مصر/</idno>
>>>>>>>
>>>>>>> and be done with it.  Generation of percent encoding and Punycode would
>>>>>>> be done by XSLT that produces whatever is used by the delivery system.
>>>>>>>
>>>>>>> --Kevin
>>>>>>>
>>>>>>> On 9/20/2011 11:27 PM, Stuart A. Yeates wrote:
>>>>>>>> The situations I am trying to avoid are:
>>>>>>>>
>>>>>>>> <name sameas="urn:example:%D9%85%D9%88%D9%82%D8%B9.%D9%88%D8%B2%D8%A7%D8%B1%D8%A9-%D8%A7%D9%84%D8%A7%D8%AA%D8%B5%D8%A7%D9%84%D8%A7%D8%AA.%D9%85%D8%B5%D8%B1"
>>>>>>>> copyOf="urn:example:xn--4gbrim.xn----rmckbbajlc6dj7bxne2c.xn--wgbh1c"
>>>>>>>> corresp="http://xn--4gbrim.xn----rmckbbajlc6dj7bxne2c.xn--wgbh1c/"
>>>>>>>> key="http://موقع.وزارة-الاتصالات.مصر/">Egyptian Ministry of
>>>>>>>> Communication and Information Technology</name>
>>>>>>>>
>>>>>>>> and
>>>>>>>>
>>>>>>>> <idno>http://xn--4gbrim.xn----rmckbbajlc6dj7bxne2c.xn--wgbh1c/</idno>
>>>>>>>> vs<idno>http://موقع.وزارة-الاتصالات.مصر/</idno>
>>>>>>>>
>>>>>>>> etc
>>>>>>>>
>>>>>>>> URLs are already using punycode in the domain part and percent
>>>>>>>> escaping in the file part (at least when they're used in data.pointer
>>>>>>>> ), and XML has some pretty strong dependencies on URLs, so neither can
>>>>>>>> be prohibited without serious consequences.
>>>>>>>>
>>>>>>>> Both punycode and percent encoding are mappings of UTF-8, they can be
>>>>>>>> converted back and forth with a 1:1 mapping. They are not violations
>>>>>>>> of the "use UTF-8" rule.
>>>>>>>>
>>>>>>>> cheers
>>>>>>>> stuart
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Sep 21, 2011 at 10:21 AM, Martin Holmes<mholmes at uvic.ca>        wrote:
>>>>>>>>> I agree. I think punycode is a temporary solution to problems with
>>>>>>>>> Internet infrastructure and user-agent limitations; if it's to be used,
>>>>>>>>> it should be generated during output processing, rather than being part
>>>>>>>>> of the core document. TEI XML should be in UTF-8, I think.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Martin
>>>>>>>>>
>>>>>>>>> On 11-09-20 03:12 PM, Kevin Hawkins wrote:
>>>>>>>>>> I guess what I'm saying is that Punycode is prescribed for use with the
>>>>>>>>>> Domain Name System, but our TEI documents might outlive DNS or be used
>>>>>>>>>> in a system that uses doesn't use DNS.  After all, even URIs (as
>>>>>>>>>> prescribed in RFC 3986) give DNS as an example of a name registry
>>>>>>>>>> mechanism, not the only one.
>>>>>>>>>>
>>>>>>>>>> We tie ourselves to a few external standards (maintained by the W3C)
>>>>>>>>>> which may become obsolete at some point, but I'm not sure whether we
>>>>>>>>>> should add systems maintained by ICANN to the list.
>>>>>>>>>>
>>>>>>>>>> --Kevin
>>>>>>>>>>
>>>>>>>>>> On 9/20/2011 2:31 PM, Stuart A. Yeates wrote:
>>>>>>>>>>> Punycode is already required (and happens automatically with modern
>>>>>>>>>>> tools and formats) for URIs. View the source of the (UTF-8) web page
>>>>>>>>>>> of my example website to see what I mean.
>>>>>>>>>>>
>>>>>>>>>>> The issue is when people put URIs and in free text fields where the
>>>>>>>>>>> tools are unaware that these are URIs and expect them to 'just work'.
>>>>>>>>>>>
>>>>>>>>>>> cheers
>>>>>>>>>>> stuart
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Sep 21, 2011 at 1:26 AM, Kevin Hawkins
>>>>>>>>>>> <kevin.s.hawkins at ultraslavonic.info>           wrote:
>>>>>>>>>>>> I'm not sure about prescribing use of RFC 3492.  This seems to me like
>>>>>>>>>>>> prescribing use of US-ASCII with character entity references instead of
>>>>>>>>>>>> UTF-8 within XML documents to ensure that we can use our documents with
>>>>>>>>>>>> a full range of software toolS -- something that fewer and fewer people
>>>>>>>>>>>> support doing.
>>>>>>>>>>>>
>>>>>>>>>>>> On 9/20/2011 4:49 AM, Stuart A. Yeates wrote:
>>>>>>>>>>>>> Currently domain names in TEI can occur in typed fields (such as
>>>>>>>>>>>>> data.pointer) or in many other fields where type checking is more
>>>>>>>>>>>>> relaxed (or non-existent). I would like to propose the following note
>>>>>>>>>>>>> to appear somewhere in the standard (I'm thinking the data.pointer
>>>>>>>>>>>>> page, but I'm open to suggestions). The URL in the example is perhaps
>>>>>>>>>>>>> the best-known punycode URL (see
>>>>>>>>>>>>> http://en.wikipedia.org/wiki/Masr_%28domain_name%29 ), but if Arabic
>>>>>>>>>>>>> script causes problems in the publishing process I can probably find a
>>>>>>>>>>>>> more Latin-esque one.
>>>>>>>>>>>>>
>>>>>>>>>>>>> cheers
>>>>>>>>>>>>> stuart
>>>>>>>>>>>>>
>>>>>>>>>>>>> ----
>>>>>>>>>>>>>
>>>>>>>>>>>>> Internationalised domains containing non-ASCII characters should
>>>>>>>>>>>>> always be escaped using RFC 3492 syntax ("punycode") Thus
>>>>>>>>>>>>> http://موقع.وزارة-الاتصالات.مصر/ is written
>>>>>>>>>>>>> http://xn--4gbrim.xn----rmckbbajlc6dj7bxne2c.xn--wgbh1c/ Such escaping
>>>>>>>>>>>>> permits internationalised domains to be used with a full range of
>>>>>>>>>>>>> software tools.
>>>>>>>>>>>>>
>>>>>>>>>>>>> ----
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> tei-council mailing list
>>>>>>>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>>>>>>>
>>>>>>>>>>>>> PLEASE NOTE: postings to this list are publicly archived
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> tei-council mailing list
>>>>>>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>>>>>>
>>>>>>>>>>>> PLEASE NOTE: postings to this list are publicly archived
>>>>>>>>>> _______________________________________________
>>>>>>>>>> tei-council mailing list
>>>>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>>>>
>>>>>>>>>> PLEASE NOTE: postings to this list are publicly archived
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Martin Holmes
>>>>>>>>> University of Victoria Humanities Computing and Media Centre
>>>>>>>>> (mholmes at uvic.ca)
>>>>>>>>> _______________________________________________
>>>>>>>>> tei-council mailing list
>>>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>>>
>>>>>>>>> PLEASE NOTE: postings to this list are publicly archived
>>>>>>>> _______________________________________________
>>>>>>>> tei-council mailing list
>>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>>
>>>>>>>> PLEASE NOTE: postings to this list are publicly archived
>>>>>>> _______________________________________________
>>>>>>> tei-council mailing list
>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>
>>>>>>> PLEASE NOTE: postings to this list are publicly archived
>>>>>> _______________________________________________
>>>>>> tei-council mailing list
>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>
>>>>>> PLEASE NOTE: postings to this list are publicly archived
>>>>>
>>>>> --
>>>>> Martin Holmes
>>>>> University of Victoria Humanities Computing and Media Centre
>>>>> (mholmes at uvic.ca)
>>>>> _______________________________________________
>>>>> tei-council mailing list
>>>>> tei-council at lists.village.Virginia.EDU
>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>
>>>>> PLEASE NOTE: postings to this list are publicly archived
>>>
>>> --
>>> Martin Holmes
>>> University of Victoria Humanities Computing and Media Centre
>>> (mholmes at uvic.ca)
>>> _______________________________________________
>>> tei-council mailing list
>>> tei-council at lists.village.Virginia.EDU
>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>
>>> PLEASE NOTE: postings to this list are publicly archived
>>
> _______________________________________________
> tei-council mailing list
> tei-council at lists.village.Virginia.EDU
> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>
> PLEASE NOTE: postings to this list are publicly archived


More information about the tei-council mailing list