[tei-council] soft hyphens (again)
lou.burnard at oucs.ox.ac.uk
Sun May 23 17:49:54 EDT 2010
Although largely complete, I think the record of our Dublin discussion
of this topic leaves out a few critical details. It also talks
repeatedly about hyphenation as the problem, whereas the real problem is
just hyphenation at the end of a line, in the specific case where you
are also encoding the ends of lines. The discussion moved on to talk
about the related problems caused by strange practices in other
languages when words are broken across ends of lines -- involving not
only hyphenation, but also changes in orthography. The recommendation
made (which doesnt appear in your notes) was to use <choice> for these
more complex cases, which is what led to the comment about where to put
the <lb> if you didnt want to repeat it. I don't recall Brett's comment
about a "standoff choice" and am not sure what that would look like.
On the question you raise about values for @type, there is already a
(carefully worded) recommendation in the Guidelines: "The type attribute
may be used to characterize the line break in any respect, but its most
common use is to specify that the presence of the line break does not
imply the end of the word in which it is embedded. A value such as
inWord or nobreak is recommended for this purpose, but encoders are free
to choose whichever values are appropriate. "
(I say "carefully worded" because quite a lot of virtual ink was spilled
on the topic over on the Epidoc list some time back)
If we are using @type to characterise the linebreak from a tokenization
point of view, there really can be only three possible states: "inWord"
(i.e. the tokeniser needs to combine the string before the linebreak
with the string after it to form a single token) "betweenWords" (i.e.
the string before the linebreak and the string after it are two separate
tokens) and "uncertain" (i.e. it could be either of the other two and
we're unwilling or unable to decide).
If we use @rend on linebreak at all it can only be to say something
about how the linebreak was or should be rendered. It's by no means
certain that an <lb type="inWord"/> will always be rendered by means of
a hyphen (hard or soft) and a linebreak: it might be some other
character, or it might be nothing at all. Furthermore, some encoders
will prefer to retain the hyphen as a character in the text while others
will prefer to discard it and have it be generated by rendering
software. (Similar case for <q> and quote marks) When the topic was
discussed on TEI-L I both options were strongly advocated -- those who
care about this topic really do care about it.
Does this help?
More information about the tei-council