[tei-council] Internationalised domains

Stuart A. Yeates syeates at gmail.com
Mon Sep 26 20:07:08 EDT 2011


I don't know about anyone else's TEI, but here is a selection of URLs
with escaped characters, drawn from the TEI we already have live at
the NZETC. None of these URLs were coined by us, but by other
universities, wikimedia and various units of government. At least one
contains escaped spaces. Based on URLs like these, getting escaping
right is a priority for us. I've separated each by a double newline in
case email mangles them.

http://www.jps.auckland.ac.nz/document/Volume_2_1893/Volume_2%2C_No.1%2C_March_1893/The_genealogy_of_the_Pomare_Family_of_Tahiti%2C_from_the_papers_of_the_Rev._J._M._Orsmond%2C_with_notes_thereon_by_S._Percy_Smith%2C_p_25-_42/p1?action=null

http://en.wikisource.org/w/index.php?title=Catholic_Encyclopedia_(1913)/Bartolom%C3%A9_Esteban_Murillo&oldid=2142578

http://paperspast.natlib.govt.nz/cgi-bin/paperspast?a=d&cl=search&d=TO18961128.2.30&srpos=11&e=-------100--1----2%22the+angel+isafrel%22--http://www.austlit.edu.au/run?ex=ShowAgent&

http://www.natlib.govt.nz/about-us/friends-advisors/komiti-maori/?searchterm=te%20komiti%20maori

http://en.wikipedia.org/wiki/The_March_%281945%29

http://www.nzhistory.net.nz/search?keys=%22john+a.+lee%22&op.x=7&op.y=16&op=Search

cheers
stuart


On Tue, Sep 27, 2011 at 4:44 AM, Martin Holmes <mholmes at uvic.ca> wrote:
> This only applies if people are silly enough to use whitespace in URIs.
> And if they're linking to a resource with such a URI, and they (for
> instance) copy-paste it from a browser URI box, it'll come with
> percent-escapes anyway.
>
> I really don't think this is an issue, but if we want to add a note to
> the effect that URIs containing whitespace should be appropriately
> escaped, I think that would be enough.
>
> Cheers,
> Martin
>
> On 11-09-25 12:41 PM, Stuart A. Yeates wrote:
>> Full UTF-8 in the file part of URIs would seem to be a disaster for
>> us. Without whitespace being escaped we can't have whitespace
>> separated lists of URLs, as the definition of @corresp as "1–∞
>> occurrences of data.pointer separated by whitespace" no longer works?
>>
>> cheers
>> stuart
>>
>> On Fri, Sep 23, 2011 at 9:03 AM, Martin Holmes<mholmes at uvic.ca>  wrote:
>>> I think I see the source of the confusion. Older W3C drafts seem to have
>>> explicitly addressed the issue of encoding URIs in US-ASCII:
>>>
>>> <http://www.w3.org/TR/2001/WD-charmod-20010126/#sec-URIs>
>>>
>>> but that section seems to have disappeared from the current draft:
>>>
>>> <http://www.w3.org/TR/charmod/>
>>>
>>> which, on a quick reading, leaves me with the impression that UTF-8,
>>> UTF-16 etc. are acceptable encodings.
>>>
>>> Cheers,
>>> Martin
>>>
>>> On 11-09-22 12:25 PM, Stuart A. Yeates wrote:
>>>> I was nuder the impression that non-latin-1 wasn't allowed in
>>>> data.pointer (and looking through the relevant standards I still can't
>>>> see how it is), but such things seem to validate, so I guess you are.
>>>>
>>>> So I'd like to apologize for for my misunderstanding and and withdraw
>>>> my suggestion.
>>>>
>>>> cheers
>>>> stuart
>>>>
>>>> On Thu, Sep 22, 2011 at 4:03 AM, Kevin Hawkins
>>>> <kevin.s.hawkins at ultraslavonic.info>    wrote:
>>>>> I still don't see why Stuart wouldn't simply put this in the TEI:
>>>>>
>>>>> <name sameAs=""http://موقع.وزارة-الاتصالات.مصر/">Egyptian Ministry of
>>>>>    >    Communication and Information Technology</name>
>>>>>
>>>>> <idno>http://موقع.وزارة-الاتصالات.مصر/</idno>
>>>>>
>>>>> and be done with it.  Generation of percent encoding and Punycode would
>>>>> be done by XSLT that produces whatever is used by the delivery system.
>>>>>
>>>>> --Kevin
>>>>>
>>>>> On 9/20/2011 11:27 PM, Stuart A. Yeates wrote:
>>>>>> The situations I am trying to avoid are:
>>>>>>
>>>>>> <name sameas="urn:example:%D9%85%D9%88%D9%82%D8%B9.%D9%88%D8%B2%D8%A7%D8%B1%D8%A9-%D8%A7%D9%84%D8%A7%D8%AA%D8%B5%D8%A7%D9%84%D8%A7%D8%AA.%D9%85%D8%B5%D8%B1"
>>>>>> copyOf="urn:example:xn--4gbrim.xn----rmckbbajlc6dj7bxne2c.xn--wgbh1c"
>>>>>> corresp="http://xn--4gbrim.xn----rmckbbajlc6dj7bxne2c.xn--wgbh1c/"
>>>>>> key="http://موقع.وزارة-الاتصالات.مصر/">Egyptian Ministry of
>>>>>> Communication and Information Technology</name>
>>>>>>
>>>>>> and
>>>>>>
>>>>>> <idno>http://xn--4gbrim.xn----rmckbbajlc6dj7bxne2c.xn--wgbh1c/</idno>
>>>>>> vs<idno>http://موقع.وزارة-الاتصالات.مصر/</idno>
>>>>>>
>>>>>> etc
>>>>>>
>>>>>> URLs are already using punycode in the domain part and percent
>>>>>> escaping in the file part (at least when they're used in data.pointer
>>>>>> ), and XML has some pretty strong dependencies on URLs, so neither can
>>>>>> be prohibited without serious consequences.
>>>>>>
>>>>>> Both punycode and percent encoding are mappings of UTF-8, they can be
>>>>>> converted back and forth with a 1:1 mapping. They are not violations
>>>>>> of the "use UTF-8" rule.
>>>>>>
>>>>>> cheers
>>>>>> stuart
>>>>>>
>>>>>>
>>>>>> On Wed, Sep 21, 2011 at 10:21 AM, Martin Holmes<mholmes at uvic.ca>      wrote:
>>>>>>> I agree. I think punycode is a temporary solution to problems with
>>>>>>> Internet infrastructure and user-agent limitations; if it's to be used,
>>>>>>> it should be generated during output processing, rather than being part
>>>>>>> of the core document. TEI XML should be in UTF-8, I think.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Martin
>>>>>>>
>>>>>>> On 11-09-20 03:12 PM, Kevin Hawkins wrote:
>>>>>>>> I guess what I'm saying is that Punycode is prescribed for use with the
>>>>>>>> Domain Name System, but our TEI documents might outlive DNS or be used
>>>>>>>> in a system that uses doesn't use DNS.  After all, even URIs (as
>>>>>>>> prescribed in RFC 3986) give DNS as an example of a name registry
>>>>>>>> mechanism, not the only one.
>>>>>>>>
>>>>>>>> We tie ourselves to a few external standards (maintained by the W3C)
>>>>>>>> which may become obsolete at some point, but I'm not sure whether we
>>>>>>>> should add systems maintained by ICANN to the list.
>>>>>>>>
>>>>>>>> --Kevin
>>>>>>>>
>>>>>>>> On 9/20/2011 2:31 PM, Stuart A. Yeates wrote:
>>>>>>>>> Punycode is already required (and happens automatically with modern
>>>>>>>>> tools and formats) for URIs. View the source of the (UTF-8) web page
>>>>>>>>> of my example website to see what I mean.
>>>>>>>>>
>>>>>>>>> The issue is when people put URIs and in free text fields where the
>>>>>>>>> tools are unaware that these are URIs and expect them to 'just work'.
>>>>>>>>>
>>>>>>>>> cheers
>>>>>>>>> stuart
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Sep 21, 2011 at 1:26 AM, Kevin Hawkins
>>>>>>>>> <kevin.s.hawkins at ultraslavonic.info>         wrote:
>>>>>>>>>> I'm not sure about prescribing use of RFC 3492.  This seems to me like
>>>>>>>>>> prescribing use of US-ASCII with character entity references instead of
>>>>>>>>>> UTF-8 within XML documents to ensure that we can use our documents with
>>>>>>>>>> a full range of software toolS -- something that fewer and fewer people
>>>>>>>>>> support doing.
>>>>>>>>>>
>>>>>>>>>> On 9/20/2011 4:49 AM, Stuart A. Yeates wrote:
>>>>>>>>>>> Currently domain names in TEI can occur in typed fields (such as
>>>>>>>>>>> data.pointer) or in many other fields where type checking is more
>>>>>>>>>>> relaxed (or non-existent). I would like to propose the following note
>>>>>>>>>>> to appear somewhere in the standard (I'm thinking the data.pointer
>>>>>>>>>>> page, but I'm open to suggestions). The URL in the example is perhaps
>>>>>>>>>>> the best-known punycode URL (see
>>>>>>>>>>> http://en.wikipedia.org/wiki/Masr_%28domain_name%29 ), but if Arabic
>>>>>>>>>>> script causes problems in the publishing process I can probably find a
>>>>>>>>>>> more Latin-esque one.
>>>>>>>>>>>
>>>>>>>>>>> cheers
>>>>>>>>>>> stuart
>>>>>>>>>>>
>>>>>>>>>>> ----
>>>>>>>>>>>
>>>>>>>>>>> Internationalised domains containing non-ASCII characters should
>>>>>>>>>>> always be escaped using RFC 3492 syntax ("punycode") Thus
>>>>>>>>>>> http://موقع.وزارة-الاتصالات.مصر/ is written
>>>>>>>>>>> http://xn--4gbrim.xn----rmckbbajlc6dj7bxne2c.xn--wgbh1c/ Such escaping
>>>>>>>>>>> permits internationalised domains to be used with a full range of
>>>>>>>>>>> software tools.
>>>>>>>>>>>
>>>>>>>>>>> ----
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> tei-council mailing list
>>>>>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>>>>>
>>>>>>>>>>> PLEASE NOTE: postings to this list are publicly archived
>>>>>>>>>> _______________________________________________
>>>>>>>>>> tei-council mailing list
>>>>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>>>>
>>>>>>>>>> PLEASE NOTE: postings to this list are publicly archived
>>>>>>>> _______________________________________________
>>>>>>>> tei-council mailing list
>>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>>
>>>>>>>> PLEASE NOTE: postings to this list are publicly archived
>>>>>>>
>>>>>>> --
>>>>>>> Martin Holmes
>>>>>>> University of Victoria Humanities Computing and Media Centre
>>>>>>> (mholmes at uvic.ca)
>>>>>>> _______________________________________________
>>>>>>> tei-council mailing list
>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>
>>>>>>> PLEASE NOTE: postings to this list are publicly archived
>>>>>> _______________________________________________
>>>>>> tei-council mailing list
>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>
>>>>>> PLEASE NOTE: postings to this list are publicly archived
>>>>> _______________________________________________
>>>>> tei-council mailing list
>>>>> tei-council at lists.village.Virginia.EDU
>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>
>>>>> PLEASE NOTE: postings to this list are publicly archived
>>>> _______________________________________________
>>>> tei-council mailing list
>>>> tei-council at lists.village.Virginia.EDU
>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>
>>>> PLEASE NOTE: postings to this list are publicly archived
>>>
>>> --
>>> Martin Holmes
>>> University of Victoria Humanities Computing and Media Centre
>>> (mholmes at uvic.ca)
>>> _______________________________________________
>>> tei-council mailing list
>>> tei-council at lists.village.Virginia.EDU
>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>
>>> PLEASE NOTE: postings to this list are publicly archived
>
> --
> Martin Holmes
> University of Victoria Humanities Computing and Media Centre
> (mholmes at uvic.ca)
> _______________________________________________
> tei-council mailing list
> tei-council at lists.village.Virginia.EDU
> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>
> PLEASE NOTE: postings to this list are publicly archived


More information about the tei-council mailing list