[tei-council] documenting xml:space

Rebecca Welzenbach rwelzenbach at gmail.com
Wed Oct 24 11:08:59 EDT 2012


>
> If you're able to clean up the prose now, I'd have a go at it, and avoid
> the second new ticket.
>
> Cheers,
> Martin
>
>
Good idea, Martin--I tried to do this last night and gave up (thus the
cop-out proposal) but have done it now, and the revised proposed language
is below. We're getting down to the wire now, but if it's acceptable I'll
put it in ASAP.

In reviewing this language closely I did find an important point of
conflict between the Guidelines and John's argument. The guidelines say
more than once that where whitespace characters are significant, XML "requires
that a processor preserve all of them." This suggests to me that if I
foolishly decide to use two spaces between all of my sentences in a <p>, I
can assume that each space character will be preserved by default. To the
contrary, John argues that in practice, many/most processors normalize
(collapse and trim) whitespace by default--so the space will be preserved,
but each whitespace character will not be.

I've hedged on this in my proposed revision by dropping "all of" from the
phrase "requires that a processor preserve all of them," but we should
clarify this. Can we agree on which behavior is really most likely to be
the "default"?

Proposed revision to the piece of
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ST.html#index-body.1_div.1_div.3_div.1_div.1_div.4that
deals with whitespace:

The XML Recommendation defines *whitespace* as a single term for the space,
tab, and linebreak characters which may appear in a document. By default,
XML processors treat whitespace in predictable ways, depending on where it
occurs:


   - When whitespace characters occur as part of a text node, within the
   content of an element, XML generally considers them significant and
   requires that a processor preserve them.
   - When whitespace characters occur within an element that contains mixed
   content, that is, an element that contains both element and text nodes, XML
   assumes that they are significant and requires that a processor preserve
   them.
   - When whitespace characters occur between elements (not inside those
   elements or mixed with text), XML generally assumes that they are
*not* significant
   and may be ignored by an XML processor. This kind of whitespace is most
   commonly introduced by an encoder or by XML editing software to enhance the
   readability of the displayed text. This should only happen at locations
   where the whitespace can be reliably understood as insignificant (so there
   is no conflict with significant whitespace), but not all processors can
   detect this reliably.

The function of the xml:space attribute is to indicate whether the default
processing described above should be used (indicated by the value
“default”) or whether whitespace should be preserved (indicated by the
value “preserve”) everywhere within the element on which it is used.
However, it is rarely necessary to do this: most TEI elements permit mixed
content, and consequently the presence or absence of whitespace is usually
significant in a TEI document. In most cases where whitespace may be
desired in the output, this should be indicated using native TEI elements
(such as <l>) to convey the structure of the text, with whitespace for
display introduced in processing, rather than by introducing whitespace
into the text and using xml:space=”preserve”. It is worth noting that while
the value of "preserve" on xml:space indicates the encoder's intention that
whitespace be preserved, not all processors will obey this.


There are a few situations in which it may be essential to use
xml:space=”preserve”, typically where complex markup is being used within
the context of a tool that by default introduces whitespace in order to
enhance display of the text. For example, when transcribing an inscription
with the elements described in chapter 11 Representation of Primary
Sources<http://www.tei-c.org/release/doc/tei-p5-doc/en/html/PH.html>,
a single word may well gain several additional tags to mark parts of the
word which are supplied or conjectural. Such tags do not interrupt the word
however, and hence introducing space where they occur would be misleading.
The value of preserve for the xml:space attribute on the parent
div<http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-div.html>
element
may be used to indicate that all and only the spaces actually present in
the XML source should be regarded as significant; an XML editor or other
processor is not then permitted to introduce additional spaces.


More information about the tei-council mailing list