[tei-council] soft hyphens (again)

Kevin Hawkins kevin.s.hawkins at ultraslavonic.info
Tue Jun 15 11:25:44 EDT 2010

I'm resending this message with fewer typos and more context.


I brought the folks revising the *Best Practices for TEI in Libraries* 
up to speed on our hyphenation discussion.  Perry Willett raised a good 
point: if we have encoding like:

(A) This is not a run-<lb type="betweenWords"/>on sentence.

(B) UTF-8 is a char-<lb type="inWord"/>acter encoding for Unicode.

(C) Some people say TEI is a mark-<lb type="uncertain"/>up language.

One might read (C) as if the encoder is sure whether a line break really 
occurs here.  We're using an attribute of one element to describe the 
character that appears before it.

Lou suggested these three type values (see excerpt from his message 
below), but I think we might need a better value for @type in (C). 

The Best Practices team is trying to finish revising the prose by the 
beginning of July so that we have a stable basis for creating ODD specs. 
  So I'd be grateful for a response soon (especially from Lou).


> On the question you raise about values for @type, there is already a
> (carefully worded) recommendation in the Guidelines: "The type attribute
> may be used to characterize the line break in any respect, but its most
> common use is to specify that the presence of the line break does not
> imply the end of the word in which it is embedded. A value such as
> inWord or nobreak is recommended for this purpose, but encoders are free
> to choose whichever values are appropriate. "
> (I say "carefully worded" because quite a lot of virtual ink was spilled
> on the topic over on the Epidoc list some time back)
> If we are using @type to characterise the linebreak from a tokenization
> point of view, there really can be only three possible states: "inWord"
> (i.e. the tokeniser needs to combine the string before the linebreak
> with the string after it to form a single token) "betweenWords" (i.e.
> the string before the linebreak and the string after it are two separate
> tokens) and "uncertain" (i.e. it could be either of the other two and
> we're unwilling or unable to decide).

More information about the tei-council mailing list