[tei-council] Internationalised domains

Martin Holmes mholmes at uvic.ca
Mon Sep 26 11:44:12 EDT 2011


This only applies if people are silly enough to use whitespace in URIs. 
And if they're linking to a resource with such a URI, and they (for 
instance) copy-paste it from a browser URI box, it'll come with 
percent-escapes anyway.

I really don't think this is an issue, but if we want to add a note to 
the effect that URIs containing whitespace should be appropriately 
escaped, I think that would be enough.

Cheers,
Martin

On 11-09-25 12:41 PM, Stuart A. Yeates wrote:
> Full UTF-8 in the file part of URIs would seem to be a disaster for
> us. Without whitespace being escaped we can't have whitespace
> separated lists of URLs, as the definition of @corresp as "1–∞
> occurrences of data.pointer separated by whitespace" no longer works?
>
> cheers
> stuart
>
> On Fri, Sep 23, 2011 at 9:03 AM, Martin Holmes<mholmes at uvic.ca>  wrote:
>> I think I see the source of the confusion. Older W3C drafts seem to have
>> explicitly addressed the issue of encoding URIs in US-ASCII:
>>
>> <http://www.w3.org/TR/2001/WD-charmod-20010126/#sec-URIs>
>>
>> but that section seems to have disappeared from the current draft:
>>
>> <http://www.w3.org/TR/charmod/>
>>
>> which, on a quick reading, leaves me with the impression that UTF-8,
>> UTF-16 etc. are acceptable encodings.
>>
>> Cheers,
>> Martin
>>
>> On 11-09-22 12:25 PM, Stuart A. Yeates wrote:
>>> I was nuder the impression that non-latin-1 wasn't allowed in
>>> data.pointer (and looking through the relevant standards I still can't
>>> see how it is), but such things seem to validate, so I guess you are.
>>>
>>> So I'd like to apologize for for my misunderstanding and and withdraw
>>> my suggestion.
>>>
>>> cheers
>>> stuart
>>>
>>> On Thu, Sep 22, 2011 at 4:03 AM, Kevin Hawkins
>>> <kevin.s.hawkins at ultraslavonic.info>    wrote:
>>>> I still don't see why Stuart wouldn't simply put this in the TEI:
>>>>
>>>> <name sameAs=""http://موقع.وزارة-الاتصالات.مصر/">Egyptian Ministry of
>>>>    >    Communication and Information Technology</name>
>>>>
>>>> <idno>http://موقع.وزارة-الاتصالات.مصر/</idno>
>>>>
>>>> and be done with it.  Generation of percent encoding and Punycode would
>>>> be done by XSLT that produces whatever is used by the delivery system.
>>>>
>>>> --Kevin
>>>>
>>>> On 9/20/2011 11:27 PM, Stuart A. Yeates wrote:
>>>>> The situations I am trying to avoid are:
>>>>>
>>>>> <name sameas="urn:example:%D9%85%D9%88%D9%82%D8%B9.%D9%88%D8%B2%D8%A7%D8%B1%D8%A9-%D8%A7%D9%84%D8%A7%D8%AA%D8%B5%D8%A7%D9%84%D8%A7%D8%AA.%D9%85%D8%B5%D8%B1"
>>>>> copyOf="urn:example:xn--4gbrim.xn----rmckbbajlc6dj7bxne2c.xn--wgbh1c"
>>>>> corresp="http://xn--4gbrim.xn----rmckbbajlc6dj7bxne2c.xn--wgbh1c/"
>>>>> key="http://موقع.وزارة-الاتصالات.مصر/">Egyptian Ministry of
>>>>> Communication and Information Technology</name>
>>>>>
>>>>> and
>>>>>
>>>>> <idno>http://xn--4gbrim.xn----rmckbbajlc6dj7bxne2c.xn--wgbh1c/</idno>
>>>>> vs<idno>http://موقع.وزارة-الاتصالات.مصر/</idno>
>>>>>
>>>>> etc
>>>>>
>>>>> URLs are already using punycode in the domain part and percent
>>>>> escaping in the file part (at least when they're used in data.pointer
>>>>> ), and XML has some pretty strong dependencies on URLs, so neither can
>>>>> be prohibited without serious consequences.
>>>>>
>>>>> Both punycode and percent encoding are mappings of UTF-8, they can be
>>>>> converted back and forth with a 1:1 mapping. They are not violations
>>>>> of the "use UTF-8" rule.
>>>>>
>>>>> cheers
>>>>> stuart
>>>>>
>>>>>
>>>>> On Wed, Sep 21, 2011 at 10:21 AM, Martin Holmes<mholmes at uvic.ca>      wrote:
>>>>>> I agree. I think punycode is a temporary solution to problems with
>>>>>> Internet infrastructure and user-agent limitations; if it's to be used,
>>>>>> it should be generated during output processing, rather than being part
>>>>>> of the core document. TEI XML should be in UTF-8, I think.
>>>>>>
>>>>>> Cheers,
>>>>>> Martin
>>>>>>
>>>>>> On 11-09-20 03:12 PM, Kevin Hawkins wrote:
>>>>>>> I guess what I'm saying is that Punycode is prescribed for use with the
>>>>>>> Domain Name System, but our TEI documents might outlive DNS or be used
>>>>>>> in a system that uses doesn't use DNS.  After all, even URIs (as
>>>>>>> prescribed in RFC 3986) give DNS as an example of a name registry
>>>>>>> mechanism, not the only one.
>>>>>>>
>>>>>>> We tie ourselves to a few external standards (maintained by the W3C)
>>>>>>> which may become obsolete at some point, but I'm not sure whether we
>>>>>>> should add systems maintained by ICANN to the list.
>>>>>>>
>>>>>>> --Kevin
>>>>>>>
>>>>>>> On 9/20/2011 2:31 PM, Stuart A. Yeates wrote:
>>>>>>>> Punycode is already required (and happens automatically with modern
>>>>>>>> tools and formats) for URIs. View the source of the (UTF-8) web page
>>>>>>>> of my example website to see what I mean.
>>>>>>>>
>>>>>>>> The issue is when people put URIs and in free text fields where the
>>>>>>>> tools are unaware that these are URIs and expect them to 'just work'.
>>>>>>>>
>>>>>>>> cheers
>>>>>>>> stuart
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Sep 21, 2011 at 1:26 AM, Kevin Hawkins
>>>>>>>> <kevin.s.hawkins at ultraslavonic.info>         wrote:
>>>>>>>>> I'm not sure about prescribing use of RFC 3492.  This seems to me like
>>>>>>>>> prescribing use of US-ASCII with character entity references instead of
>>>>>>>>> UTF-8 within XML documents to ensure that we can use our documents with
>>>>>>>>> a full range of software toolS -- something that fewer and fewer people
>>>>>>>>> support doing.
>>>>>>>>>
>>>>>>>>> On 9/20/2011 4:49 AM, Stuart A. Yeates wrote:
>>>>>>>>>> Currently domain names in TEI can occur in typed fields (such as
>>>>>>>>>> data.pointer) or in many other fields where type checking is more
>>>>>>>>>> relaxed (or non-existent). I would like to propose the following note
>>>>>>>>>> to appear somewhere in the standard (I'm thinking the data.pointer
>>>>>>>>>> page, but I'm open to suggestions). The URL in the example is perhaps
>>>>>>>>>> the best-known punycode URL (see
>>>>>>>>>> http://en.wikipedia.org/wiki/Masr_%28domain_name%29 ), but if Arabic
>>>>>>>>>> script causes problems in the publishing process I can probably find a
>>>>>>>>>> more Latin-esque one.
>>>>>>>>>>
>>>>>>>>>> cheers
>>>>>>>>>> stuart
>>>>>>>>>>
>>>>>>>>>> ----
>>>>>>>>>>
>>>>>>>>>> Internationalised domains containing non-ASCII characters should
>>>>>>>>>> always be escaped using RFC 3492 syntax ("punycode") Thus
>>>>>>>>>> http://موقع.وزارة-الاتصالات.مصر/ is written
>>>>>>>>>> http://xn--4gbrim.xn----rmckbbajlc6dj7bxne2c.xn--wgbh1c/ Such escaping
>>>>>>>>>> permits internationalised domains to be used with a full range of
>>>>>>>>>> software tools.
>>>>>>>>>>
>>>>>>>>>> ----
>>>>>>>>>> _______________________________________________
>>>>>>>>>> tei-council mailing list
>>>>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>>>>
>>>>>>>>>> PLEASE NOTE: postings to this list are publicly archived
>>>>>>>>> _______________________________________________
>>>>>>>>> tei-council mailing list
>>>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>>>
>>>>>>>>> PLEASE NOTE: postings to this list are publicly archived
>>>>>>> _______________________________________________
>>>>>>> tei-council mailing list
>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>
>>>>>>> PLEASE NOTE: postings to this list are publicly archived
>>>>>>
>>>>>> --
>>>>>> Martin Holmes
>>>>>> University of Victoria Humanities Computing and Media Centre
>>>>>> (mholmes at uvic.ca)
>>>>>> _______________________________________________
>>>>>> tei-council mailing list
>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>
>>>>>> PLEASE NOTE: postings to this list are publicly archived
>>>>> _______________________________________________
>>>>> tei-council mailing list
>>>>> tei-council at lists.village.Virginia.EDU
>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>
>>>>> PLEASE NOTE: postings to this list are publicly archived
>>>> _______________________________________________
>>>> tei-council mailing list
>>>> tei-council at lists.village.Virginia.EDU
>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>
>>>> PLEASE NOTE: postings to this list are publicly archived
>>> _______________________________________________
>>> tei-council mailing list
>>> tei-council at lists.village.Virginia.EDU
>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>
>>> PLEASE NOTE: postings to this list are publicly archived
>>
>> --
>> Martin Holmes
>> University of Victoria Humanities Computing and Media Centre
>> (mholmes at uvic.ca)
>> _______________________________________________
>> tei-council mailing list
>> tei-council at lists.village.Virginia.EDU
>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>
>> PLEASE NOTE: postings to this list are publicly archived

-- 
Martin Holmes
University of Victoria Humanities Computing and Media Centre
(mholmes at uvic.ca)


More information about the tei-council mailing list