[tei-council] soft hyphens (again)

Sun Jun 6 12:35:18 EDT 2010

Hello Lou and Fellow Council Members,

Apologies for the delay in responding.  An early summer vacation and 
transitioning back to Michigan from Dublin has left me weeks behind in 
email.

On 5/23/2010 5:49 PM, Lou Burnard wrote:
> Hi Kevin!
>
> Although largely complete, I think the record of our Dublin discussion
> of this topic leaves out a few critical details. It also talks
> repeatedly about hyphenation as the problem, whereas the real problem is
> just hyphenation at the end of a line, in the specific case where you
> are also encoding the ends of lines. The discussion  moved on to talk
> about the related problems caused by strange practices in other
> languages when words are broken across ends of lines -- involving not
> only hyphenation, but also changes in orthography. The recommendation
> made (which doesnt appear in your notes) was to use<choice>  for these
> more complex cases, which is what led to the comment about where to put
> the<lb>  if you didnt want to repeat it. I don't recall Brett's comment
> about a "standoff choice" and am not sure what that would look like.

Lou and the rest of you should embellish and correct my minutes. 
Please!  This was the whole idea of doing them in the wiki.  I've tried 
to address Lou's complaints, but please double-check my work.

> On the question you raise about values for @type, there is already a
> (carefully worded) recommendation in the Guidelines: "The type attribute
> may be used to characterize the line break in any respect, but its most
> common use is to specify that the presence of the line break does not
> imply the end of the word in which it is embedded. A value such as
> inWord or nobreak is recommended for this purpose, but encoders are free
> to choose whichever values are appropriate. "
>
> (I say "carefully worded" because quite a lot of virtual ink was spilled
> on the topic over on the Epidoc list some time back)

Ah yes, I had not seen this.

> If we are using @type to characterise the linebreak from a tokenization
> point of view, there really can be only three possible states: "inWord"
> (i.e. the tokeniser needs to combine the string before the linebreak
> with the string after it to form a single token) "betweenWords" (i.e.
> the string before the linebreak and the string after it are two separate
> tokens) and "uncertain" (i.e. it could be either of the other two and
> we're unwilling or unable to decide).

Agreed.

> If we use @rend on linebreak at all it can only be to say something
> about how the linebreak was or should be rendered. It's by no means
> certain that an<lb type="inWord"/>  will always be rendered by means of
> a hyphen (hard or soft) and a linebreak: it might be some other
> character, or it might be nothing at all.  Furthermore, some encoders
> will prefer to retain the hyphen as a character in the text while others
> will prefer to discard it and have it be generated by rendering
> software. (Similar case for<q>  and quote marks) When the topic was
> discussed on TEI-L I both options were strongly advocated -- those who
> care about this topic really do care about it.
>
> Does this help?

Yes!

Kevin