[tei-council] documenting xml:space

Martin Holmes mholmes at uvic.ca
Wed Oct 24 12:19:27 EDT 2012


On 12-10-24 08:08 AM, Rebecca Welzenbach wrote:
>     If you're able to clean up the prose now, I'd have a go at it, and avoid
>     the second new ticket.
>
>     Cheers,
>     Martin
>
> Good idea, Martin--I tried to do this last night and gave up (thus the
> cop-out proposal) but have done it now, and the revised proposed
> language is below. We're getting down to the wire now, but if it's
> acceptable I'll put it in ASAP.
>
> In reviewing this language closely I did find an important point of
> conflict between the Guidelines and John's argument. The guidelines say
> more than once that where whitespace characters are significant, XML
> "requires that a processor preserve all of them." This suggests to me
> that if I foolishly decide to use two spaces between all of my sentences
> in a <p>, I can assume that each space character will be preserved by
> default. To the contrary, John argues that in practice, many/most
> processors normalize (collapse and trim) whitespace by default--so the
> space will be preserved, but each whitespace character will not be.
>
> I've hedged on this in my proposed revision by dropping "all of" from
> the phrase "requires that a processor preserve all of them," but we
> should clarify this. Can we agree on which behavior is really most
> likely to be the "default"?

Both the XML 1.0 and 1.1 specs are identical in what they say about 
whitespace, and they would seem to agree with "all of":

"An XML processor MUST always pass all characters in a document that are 
not markup through to the application. A validating XML processor MUST 
also inform the application which of these characters constitute white 
space appearing in element content.
[...]
"...the value "preserve" indicates the intent that applications preserve 
all the white space."

<http://www.w3.org/TR/xml11/#sec-white-space>

The normalization of whitespace (collapsing whitespace sequences to 
single instances of whitespace) is not the behaviour typically required 
when you use "preserve"; one of the examples given in the spec is the 
(presumably XHTML) <pre> element, in which whitespace-based linebreaks 
and indenting are expected to be retained, so normalization would be 
wrong. So I think the existing "all of" should be retained.

Other than that, I think the text below is fine.

It might also be worth doing here what we've been doing elsewhere (e.g. 
with the language subcodes), and instead of trying to paraphrase and 
explain the specification, just point to it.

Cheers,
Martin


>
> Proposed revision to the piece of
> http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ST.html#index-body.1_div.1_div.3_div.1_div.1_div.4
> that deals with whitespace:
>
> The XML Recommendation defines /whitespace/ as a single term for the
> space, tab, and linebreak characters which may appear in a document. By
> default, XML processors treat whitespace in predictable ways, depending
> on where it occurs:
>
>   * When whitespace characters occur as part of a text node, within the
>     content of an element, XML generally considers them significant and
>     requires that a processor preserve them.
>   * When whitespace characters occur within an element that contains
>     mixed content, that is, an element that contains both element and
>     text nodes, XML assumes that they are significant and requires that
>     a processor preserve them.
>   * When whitespace characters occur between elements (not inside those
>     elements or mixed with text), XML generally assumes that they are
>     /not/ significant and may be ignored by an XML processor. This kind
>     of whitespace is most commonly introduced by an encoder or by XML
>     editing software to enhance the readability of the displayed text.
>     This should only happen at locations where the whitespace can be
>     reliably understood as insignificant (so there is no conflict with
>     significant whitespace), but not all processors can detect this
>     reliably.
>
> The function of the xml:space attribute is to indicate whether the
> default processing described above should be used (indicated by the
> value “default”) or whether whitespace should be preserved (indicated by
> the value “preserve”) everywhere within the element on which it is used.
> However, it is rarely necessary to do this: most TEI elements permit
> mixed content, and consequently the presence or absence of whitespace is
> usually significant in a TEI document. In most cases where whitespace
> may be desired in the output, this should be indicated using native TEI
> elements (such as <l>) to convey the structure of the text, with
> whitespace for display introduced in processing, rather than by
> introducing whitespace into the text and using xml:space=”preserve”. It
> is worth noting that while the value of "preserve" on xml:space
> indicates the encoder's intention that whitespace be preserved, not all
> processors will obey this.
>
>
> There are a few situations in which it may be essential to use
> xml:space=”preserve”, typically where complex markup is being used
> within the context of a tool that by default introduces whitespace in
> order to enhance display of the text. For example, when transcribing an
> inscription with the elements described in chapter 11 Representation of
> Primary Sources
> <http://www.tei-c.org/release/doc/tei-p5-doc/en/html/PH.html>, a single
> word may well gain several additional tags to mark parts of the word
> which are supplied or conjectural. Such tags do not interrupt the word
> however, and hence introducing space where they occur would be
> misleading. The value of preserve for the xml:space attribute on the
> parent div
> <http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-div.html> element
> may be used to indicate that all and only the spaces actually present in
> the XML source should be regarded as significant; an XML editor or other
> processor is not then permitted to introduce additional spaces.
>

-- 
Martin Holmes
University of Victoria Humanities Computing and Media Centre
(mholmes at uvic.ca)


More information about the tei-council mailing list