[tei-council] soft hyphens (again)

Lou Burnard lou.burnard at oucs.ox.ac.uk
Sun May 23 17:49:54 EDT 2010


Hi Kevin!

Although largely complete, I think the record of our Dublin discussion 
of this topic leaves out a few critical details. It also talks 
repeatedly about hyphenation as the problem, whereas the real problem is 
just hyphenation at the end of a line, in the specific case where you 
are also encoding the ends of lines. The discussion  moved on to talk 
about the related problems caused by strange practices in other 
languages when words are broken across ends of lines -- involving not 
only hyphenation, but also changes in orthography. The recommendation 
made (which doesnt appear in your notes) was to use <choice> for these 
more complex cases, which is what led to the comment about where to put 
the <lb> if you didnt want to repeat it. I don't recall Brett's comment 
about a "standoff choice" and am not sure what that would look like.

On the question you raise about values for @type, there is already a 
(carefully worded) recommendation in the Guidelines: "The type attribute 
may be used to characterize the line break in any respect, but its most 
common use is to specify that the presence of the line break does not 
imply the end of the word in which it is embedded. A value such as 
inWord or nobreak is recommended for this purpose, but encoders are free 
to choose whichever values are appropriate. "

(I say "carefully worded" because quite a lot of virtual ink was spilled 
on the topic over on the Epidoc list some time back)

If we are using @type to characterise the linebreak from a tokenization 
point of view, there really can be only three possible states: "inWord" 
(i.e. the tokeniser needs to combine the string before the linebreak 
with the string after it to form a single token) "betweenWords" (i.e. 
the string before the linebreak and the string after it are two separate 
tokens) and "uncertain" (i.e. it could be either of the other two and 
we're unwilling or unable to decide).

If we use @rend on linebreak at all it can only be to say something 
about how the linebreak was or should be rendered. It's by no means 
certain that an <lb type="inWord"/> will always be rendered by means of 
a hyphen (hard or soft) and a linebreak: it might be some other 
character, or it might be nothing at all.  Furthermore, some encoders 
will prefer to retain the hyphen as a character in the text while others 
will prefer to discard it and have it be generated by rendering 
software. (Similar case for <q> and quote marks) When the topic was 
discussed on TEI-L I both options were strongly advocated -- those who 
care about this topic really do care about it.

Does this help?

best wishes

Lou


More information about the tei-council mailing list