[tei-council] Internationalised domains

Stuart A. Yeates syeates at gmail.com
Thu Oct 6 19:42:47 EDT 2011


I've just come across a very interesting example:

https://secure.wikimedia.org/wikipedia/en/wiki/%25 /
https://secure.wikimedia.org/wikipedia/en/wiki/%

This is a wiki page about the percent symbol. If you don't know
whether or not it's escaped, you can end up with a malformed URL, and
the obvious heuristic ('if it has a percent sign it's escaped') is
wrong.

cheers
stuart

On Tue, Sep 27, 2011 at 1:07 PM, Stuart A. Yeates <syeates at gmail.com> wrote:
> I don't know about anyone else's TEI, but here is a selection of URLs
> with escaped characters, drawn from the TEI we already have live at
> the NZETC. None of these URLs were coined by us, but by other
> universities, wikimedia and various units of government. At least one
> contains escaped spaces. Based on URLs like these, getting escaping
> right is a priority for us. I've separated each by a double newline in
> case email mangles them.
>
> http://www.jps.auckland.ac.nz/document/Volume_2_1893/Volume_2%2C_No.1%2C_March_1893/The_genealogy_of_the_Pomare_Family_of_Tahiti%2C_from_the_papers_of_the_Rev._J._M._Orsmond%2C_with_notes_thereon_by_S._Percy_Smith%2C_p_25-_42/p1?action=null
>
> http://en.wikisource.org/w/index.php?title=Catholic_Encyclopedia_(1913)/Bartolom%C3%A9_Esteban_Murillo&amp;oldid=2142578
>
> http://paperspast.natlib.govt.nz/cgi-bin/paperspast?a=d&cl=search&d=TO18961128.2.30&srpos=11&e=-------100--1----2%22the+angel+isafrel%22--http://www.austlit.edu.au/run?ex=ShowAgent&
>
> http://www.natlib.govt.nz/about-us/friends-advisors/komiti-maori/?searchterm=te%20komiti%20maori
>
> http://en.wikipedia.org/wiki/The_March_%281945%29
>
> http://www.nzhistory.net.nz/search?keys=%22john+a.+lee%22&amp;op.x=7&amp;op.y=16&amp;op=Search
>
> cheers
> stuart
>
>
> On Tue, Sep 27, 2011 at 4:44 AM, Martin Holmes <mholmes at uvic.ca> wrote:
>> This only applies if people are silly enough to use whitespace in URIs.
>> And if they're linking to a resource with such a URI, and they (for
>> instance) copy-paste it from a browser URI box, it'll come with
>> percent-escapes anyway.
>>
>> I really don't think this is an issue, but if we want to add a note to
>> the effect that URIs containing whitespace should be appropriately
>> escaped, I think that would be enough.
>>
>> Cheers,
>> Martin
>>
>> On 11-09-25 12:41 PM, Stuart A. Yeates wrote:
>>> Full UTF-8 in the file part of URIs would seem to be a disaster for
>>> us. Without whitespace being escaped we can't have whitespace
>>> separated lists of URLs, as the definition of @corresp as "1–∞
>>> occurrences of data.pointer separated by whitespace" no longer works?
>>>
>>> cheers
>>> stuart
>>>
>>> On Fri, Sep 23, 2011 at 9:03 AM, Martin Holmes<mholmes at uvic.ca>  wrote:
>>>> I think I see the source of the confusion. Older W3C drafts seem to have
>>>> explicitly addressed the issue of encoding URIs in US-ASCII:
>>>>
>>>> <http://www.w3.org/TR/2001/WD-charmod-20010126/#sec-URIs>
>>>>
>>>> but that section seems to have disappeared from the current draft:
>>>>
>>>> <http://www.w3.org/TR/charmod/>
>>>>
>>>> which, on a quick reading, leaves me with the impression that UTF-8,
>>>> UTF-16 etc. are acceptable encodings.
>>>>
>>>> Cheers,
>>>> Martin
>>>>
>>>> On 11-09-22 12:25 PM, Stuart A. Yeates wrote:
>>>>> I was nuder the impression that non-latin-1 wasn't allowed in
>>>>> data.pointer (and looking through the relevant standards I still can't
>>>>> see how it is), but such things seem to validate, so I guess you are.
>>>>>
>>>>> So I'd like to apologize for for my misunderstanding and and withdraw
>>>>> my suggestion.
>>>>>
>>>>> cheers
>>>>> stuart
>>>>>
>>>>> On Thu, Sep 22, 2011 at 4:03 AM, Kevin Hawkins
>>>>> <kevin.s.hawkins at ultraslavonic.info>    wrote:
>>>>>> I still don't see why Stuart wouldn't simply put this in the TEI:
>>>>>>
>>>>>> <name sameAs=""http://موقع.وزارة-الاتصالات.مصر/">Egyptian Ministry of
>>>>>>    >    Communication and Information Technology</name>
>>>>>>
>>>>>> <idno>http://موقع.وزارة-الاتصالات.مصر/</idno>
>>>>>>
>>>>>> and be done with it.  Generation of percent encoding and Punycode would
>>>>>> be done by XSLT that produces whatever is used by the delivery system.
>>>>>>
>>>>>> --Kevin
>>>>>>
>>>>>> On 9/20/2011 11:27 PM, Stuart A. Yeates wrote:
>>>>>>> The situations I am trying to avoid are:
>>>>>>>
>>>>>>> <name sameas="urn:example:%D9%85%D9%88%D9%82%D8%B9.%D9%88%D8%B2%D8%A7%D8%B1%D8%A9-%D8%A7%D9%84%D8%A7%D8%AA%D8%B5%D8%A7%D9%84%D8%A7%D8%AA.%D9%85%D8%B5%D8%B1"
>>>>>>> copyOf="urn:example:xn--4gbrim.xn----rmckbbajlc6dj7bxne2c.xn--wgbh1c"
>>>>>>> corresp="http://xn--4gbrim.xn----rmckbbajlc6dj7bxne2c.xn--wgbh1c/"
>>>>>>> key="http://موقع.وزارة-الاتصالات.مصر/">Egyptian Ministry of
>>>>>>> Communication and Information Technology</name>
>>>>>>>
>>>>>>> and
>>>>>>>
>>>>>>> <idno>http://xn--4gbrim.xn----rmckbbajlc6dj7bxne2c.xn--wgbh1c/</idno>
>>>>>>> vs<idno>http://موقع.وزارة-الاتصالات.مصر/</idno>
>>>>>>>
>>>>>>> etc
>>>>>>>
>>>>>>> URLs are already using punycode in the domain part and percent
>>>>>>> escaping in the file part (at least when they're used in data.pointer
>>>>>>> ), and XML has some pretty strong dependencies on URLs, so neither can
>>>>>>> be prohibited without serious consequences.
>>>>>>>
>>>>>>> Both punycode and percent encoding are mappings of UTF-8, they can be
>>>>>>> converted back and forth with a 1:1 mapping. They are not violations
>>>>>>> of the "use UTF-8" rule.
>>>>>>>
>>>>>>> cheers
>>>>>>> stuart
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Sep 21, 2011 at 10:21 AM, Martin Holmes<mholmes at uvic.ca>      wrote:
>>>>>>>> I agree. I think punycode is a temporary solution to problems with
>>>>>>>> Internet infrastructure and user-agent limitations; if it's to be used,
>>>>>>>> it should be generated during output processing, rather than being part
>>>>>>>> of the core document. TEI XML should be in UTF-8, I think.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Martin
>>>>>>>>
>>>>>>>> On 11-09-20 03:12 PM, Kevin Hawkins wrote:
>>>>>>>>> I guess what I'm saying is that Punycode is prescribed for use with the
>>>>>>>>> Domain Name System, but our TEI documents might outlive DNS or be used
>>>>>>>>> in a system that uses doesn't use DNS.  After all, even URIs (as
>>>>>>>>> prescribed in RFC 3986) give DNS as an example of a name registry
>>>>>>>>> mechanism, not the only one.
>>>>>>>>>
>>>>>>>>> We tie ourselves to a few external standards (maintained by the W3C)
>>>>>>>>> which may become obsolete at some point, but I'm not sure whether we
>>>>>>>>> should add systems maintained by ICANN to the list.
>>>>>>>>>
>>>>>>>>> --Kevin
>>>>>>>>>
>>>>>>>>> On 9/20/2011 2:31 PM, Stuart A. Yeates wrote:
>>>>>>>>>> Punycode is already required (and happens automatically with modern
>>>>>>>>>> tools and formats) for URIs. View the source of the (UTF-8) web page
>>>>>>>>>> of my example website to see what I mean.
>>>>>>>>>>
>>>>>>>>>> The issue is when people put URIs and in free text fields where the
>>>>>>>>>> tools are unaware that these are URIs and expect them to 'just work'.
>>>>>>>>>>
>>>>>>>>>> cheers
>>>>>>>>>> stuart
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 21, 2011 at 1:26 AM, Kevin Hawkins
>>>>>>>>>> <kevin.s.hawkins at ultraslavonic.info>         wrote:
>>>>>>>>>>> I'm not sure about prescribing use of RFC 3492.  This seems to me like
>>>>>>>>>>> prescribing use of US-ASCII with character entity references instead of
>>>>>>>>>>> UTF-8 within XML documents to ensure that we can use our documents with
>>>>>>>>>>> a full range of software toolS -- something that fewer and fewer people
>>>>>>>>>>> support doing.
>>>>>>>>>>>
>>>>>>>>>>> On 9/20/2011 4:49 AM, Stuart A. Yeates wrote:
>>>>>>>>>>>> Currently domain names in TEI can occur in typed fields (such as
>>>>>>>>>>>> data.pointer) or in many other fields where type checking is more
>>>>>>>>>>>> relaxed (or non-existent). I would like to propose the following note
>>>>>>>>>>>> to appear somewhere in the standard (I'm thinking the data.pointer
>>>>>>>>>>>> page, but I'm open to suggestions). The URL in the example is perhaps
>>>>>>>>>>>> the best-known punycode URL (see
>>>>>>>>>>>> http://en.wikipedia.org/wiki/Masr_%28domain_name%29 ), but if Arabic
>>>>>>>>>>>> script causes problems in the publishing process I can probably find a
>>>>>>>>>>>> more Latin-esque one.
>>>>>>>>>>>>
>>>>>>>>>>>> cheers
>>>>>>>>>>>> stuart
>>>>>>>>>>>>
>>>>>>>>>>>> ----
>>>>>>>>>>>>
>>>>>>>>>>>> Internationalised domains containing non-ASCII characters should
>>>>>>>>>>>> always be escaped using RFC 3492 syntax ("punycode") Thus
>>>>>>>>>>>> http://موقع.وزارة-الاتصالات.مصر/ is written
>>>>>>>>>>>> http://xn--4gbrim.xn----rmckbbajlc6dj7bxne2c.xn--wgbh1c/ Such escaping
>>>>>>>>>>>> permits internationalised domains to be used with a full range of
>>>>>>>>>>>> software tools.
>>>>>>>>>>>>
>>>>>>>>>>>> ----
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> tei-council mailing list
>>>>>>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>>>>>>
>>>>>>>>>>>> PLEASE NOTE: postings to this list are publicly archived
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> tei-council mailing list
>>>>>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>>>>>
>>>>>>>>>>> PLEASE NOTE: postings to this list are publicly archived
>>>>>>>>> _______________________________________________
>>>>>>>>> tei-council mailing list
>>>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>>>
>>>>>>>>> PLEASE NOTE: postings to this list are publicly archived
>>>>>>>>
>>>>>>>> --
>>>>>>>> Martin Holmes
>>>>>>>> University of Victoria Humanities Computing and Media Centre
>>>>>>>> (mholmes at uvic.ca)
>>>>>>>> _______________________________________________
>>>>>>>> tei-council mailing list
>>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>>
>>>>>>>> PLEASE NOTE: postings to this list are publicly archived
>>>>>>> _______________________________________________
>>>>>>> tei-council mailing list
>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>
>>>>>>> PLEASE NOTE: postings to this list are publicly archived
>>>>>> _______________________________________________
>>>>>> tei-council mailing list
>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>
>>>>>> PLEASE NOTE: postings to this list are publicly archived
>>>>> _______________________________________________
>>>>> tei-council mailing list
>>>>> tei-council at lists.village.Virginia.EDU
>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>
>>>>> PLEASE NOTE: postings to this list are publicly archived
>>>>
>>>> --
>>>> Martin Holmes
>>>> University of Victoria Humanities Computing and Media Centre
>>>> (mholmes at uvic.ca)
>>>> _______________________________________________
>>>> tei-council mailing list
>>>> tei-council at lists.village.Virginia.EDU
>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>
>>>> PLEASE NOTE: postings to this list are publicly archived
>>
>> --
>> Martin Holmes
>> University of Victoria Humanities Computing and Media Centre
>> (mholmes at uvic.ca)
>> _______________________________________________
>> tei-council mailing list
>> tei-council at lists.village.Virginia.EDU
>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>
>> PLEASE NOTE: postings to this list are publicly archived
>


More information about the tei-council mailing list