[tei-council] [Fwd: Re: comments on CH from Brett Zamir]

Wed Dec 26 12:00:36 EST 2007

Brett Zamir wrote:
> In case you wanted my cc to also go to the Council list, I see it was 
> returned to me...
>
I forwarded your helpful reply to the list for you, since I think it 
would be useful to get more input on some of the questions you raise.

>
> If a whole paragraph has no manual returns added to it, some poor 
> quality text editors might not be able to wrap the text for you, but 
> will instead cause the text to go off the screen. However, most text 
> editors are able to "wrap" the text onto the next line(s) so that 
> everything is visible on screen (if, as is usual, this is desired).
>

On this topic however, I think I can answer authoritatively: the XML 
source of the Guidelines has been through many processes and edited by 
many tools in its history. Occasionally the source text gets put through 
a global XSLT filter which may also change its appearance drastically. 
Personally, I use emacs, which does not wrap automatically unless you 
ask it to; others use Oxygen which does (but adds huge indents which I 
find annoying). If working on a chapter I will typically tell emacs to 
wrap at or around 72 characters per line for ease of screen reading, but 
this has never seemed to me to be particularly important, since the 
source text is not what we expect people to read. One reason for not 
reformatting the whole lot is that it would introduce a lot of 
uninteresting changes into the svn repository which might be hard to 
distinguish from real changes (a bit like your double deblanking) -- and 
would have to be done all over again periodically.

>> 2) I find this paragraph under "Compatibility characters" confusing 
>> (I'm not sure if it is actually correct or not):
>>
>> However, by the time the Unicode standard
>> was first being debated, it had become common practice to include
>> single glyphs representing the more common ligatures in the
>> repertoires of some typesetting devices and high-end printers, and
>> for the coded character sets built into those devices to use a
>> single code point for such glyphs, even though they represent two
>> distinct abstract characters.
>>
>> The context I thought was about items which "should not have been 
>> regarded as abstract characters in their own right", so I don't 
>> understand the last part of the above paragraph, as it seems to me it 
>> ought perhaps to be saying the opposite.
>>
>> [It doesnt seem wrong to me: maybe "in their own right" is not quite 
>> the right phrase?
>> LB]
> My apologies... I hadn't paid close enough attention to what a 
> ligature was. I thought that this must be talking about Unicode 
> including multiple glyphs for the same abstract character (like the 
> single- and double- story 'a') or that it was a handwriting unit 
> smaller than a letter.
>
> I might still suggest clarifying the clause "even though they present 
> two distinct abstract characters" to read something like "even though 
> they are comprised of, and ought only to be represented as, two 
> distinct abstract characters".

"comprised of" (rather than "comprised") is a barbarism, but more 
significantly I fear I disagree with "ought only to be" -- that seems to 
be quite wrong in the current climate of uncertainty which you go on to 
discuss! Some "characters made up of two characters" are regarded as 
ligatures and others as digraphs. It would be nice if Unicode uniformly 
had a single encoding point for all of the latter, and none of the 
former. But the world ain't perfect, and there are some cases where a 
single unicode encoding point corresponds with something we'd prefer to 
regard as a ligature (a glymph combining two distinct characters) rather 
than as a digraph (a character visibily combining two glyphs) but which 
it would simply be too much of an uphill struggle against existing 
practice so to do. Although in modern English ae ligature is purely a 
glyphic variation on "ae", in Old English or modern Icelandic ae 
ligature stands for a completely distinct letter (ash) and some would 
argue that it is just plain wrong to regard it as "ae".
>
> And as a separate issue, the Guidelines state later in the same 
> context as above, "Such ligatures should not be confused with 
> digraphs...as in the French word "cœur"...Where a digraph occurs in a 
> source text, it should normally be encoded using the appropriate 
> code-point for the single abstract character which it indeed 
> represents". While the Wikipedia article on Ligatures (at 
> http://en.wikipedia.org/wiki/Typographical_ligature ) might not be 
> authoritative, I wonder if the following statement there might nuance 
> the previous statement: "the use of the special Unicode ligature 
> characters is "discouraged" (*though it is unclear that this should be 
> extended to distinctive and well-established ligatures such as æ and 
> œ*)." Maybe it is debatable whether certain forms (still) are 
> ligatures or not and whether even these should be deprecated as well?
>> 3) >From the XML Standard at 
>> http://www.w3.org/TR/2006/REC-xml-20060816/:
>> "*Note that non-validating processors are not obligated to 
>> <http://www.w3.org/TR/2006/REC-xml-20060816/#include-if-valid> to 
>> read and process entity declarations occurring in parameter entities 
>> or in the external subset;* for such documents, the rule that an 
>> entity must be declared is a well-formedness constraint only if 
>> standalone='yes' <http://www.w3.org/TR/2006/REC-xml-20060816/#sec-rmd>."
>>
>> This seems to me to contradict this statement in this chapter:
>>
>> The XML standard requires a
>> non-validating parser to read and act on entity declarations
>> only if they are located within the document's internal subset
>> (which does not, of course, mean that the entity declarations
>> have to be manually merged into the document instance in advance
>> of processing: character entity sets, for instance, count as
>> being in the internal subset if they are placed there via a
>> parameter entity, as is normal TEI practice).
>>
>> So it DOES seem to me from the above that in such non-validating 
>> parsers, parameter entities in the internal subset will also not 
>> work--the entities will need to be included manually. Am I wrong?
>>
>> [I think these sentences are talking about different things and they 
>> dont contradict each other. The W3C statement says that a document is 
>> not well formed if it references undeclared entities (unless it 
>> explicitly says standlone="no"); the chapter is explaining that 
>> entity declarations dont necessarily appear in the doc instance. LB]
> What I'm disputing is this clause in the Guidelines: "character entity 
> sets, for instance, count as being in the internal subset if they are 
> placed there via a parameter entity". While they may count as being in 
> the internal subset as far as validating parsers giving these entities 
> priority over any subsequent externally-referenced ones, the context 
> in the Guidelines is for /non-validating/ parsers.
>
> When the W3C statement says "non-validating processors are not 
> obligated to to read and process entity declarations occurring in 
> parameter entities or in the external subset", the latter part of this 
> clause, in contrasting parameter entities with the *external* subset, 
> seems to me to imply that non-validating processors (which is also the 
> context of the Guidelines segment above) do not need to read entity 
> declarations referenced from parameter entities in the internal 
> subset, while the Guidelines state that "character entity 
> sets...*count as being in the internal subset* if they are placed 
> there via a parameter entity".
>
> In other words, the Guidelines seem to be implying that for 
> non-validating parsers, you don't have to merge entity declarations 
> into the internal subset and you can just use an (external) parameter 
> entity reference to them in the internal subset, while the W3C seems 
> to say that with such non-validating parsers, you DO have to manually 
> merge all entity declarations (at least if you want the entities to be 
> read) since an (external) parameter entity reference, even if placed 
> in the internal subset, will not necessarily be read.
>

Thanks for patiently explaining the point again: I think you are rightly 
identifying a misleading implication of the bald statement about how 
parameter entity references work, which implicitly assumes a validating 
environment. I'm less sure what to do about it. Maybe it would be better 
to approach this issue by describing exactly what is or is not 
well-formed: a document which contains references to (inaccessible) 
parameter entities *is* well formed whereas one that contains references 
to inaccessible character entities is not.