[tei-council] soft hyphens (again)

Martin Holmes mholmes at uvic.ca
Tue Jun 15 11:51:57 EDT 2010

Looking again at this, I'm struck also by what strikes you: the <lb/> is 
not the hyphen, and if we're trying to encode information about the 
hyphen, the <lb/> seems to be the wrong place to do it.

I'd really prefer a container tag which encloses the hyphen and implies 
(for at least one rendering scenario) a linebreak. But I don't know what 
that tag should be. In the case of the three examples:

(A) This is not a run-<lb/>on sentence.

(This is just a line-break, isn't it? The hyphen is incidental.)

(B) UTF-8 is a char<someTag type="breakInWord">-</someTag>acter encoding 
for Unicode.

(The processor can decide whether to render the hyphen, depending on 
whether it's rendering the linebreaks or not; tokenizers can ignore the 
tag completely.)

(C) Some people say TEI is a mark-<lb/>up language. OR

(C) Some people say TEI is a mark<someTag 
type="breakInWord">-</someTag>up language.

(If it's impossible to decide, then perhaps some kind of <choice> 
structure would be appropriate.)


On 10-06-15 08:25 AM, Kevin Hawkins wrote:
> I'm resending this message with fewer typos and more context.
> ***
> I brought the folks revising the *Best Practices for TEI in Libraries*
> up to speed on our hyphenation discussion.  Perry Willett raised a good
> point: if we have encoding like:
> (A) This is not a run-<lb type="betweenWords"/>on sentence.
> (B) UTF-8 is a char-<lb type="inWord"/>acter encoding for Unicode.
> (C) Some people say TEI is a mark-<lb type="uncertain"/>up language.
> One might read (C) as if the encoder is sure whether a line break really
> occurs here.  We're using an attribute of one element to describe the
> character that appears before it.
> Lou suggested these three type values (see excerpt from his message
> below), but I think we might need a better value for @type in (C).
> Suggestions?
> The Best Practices team is trying to finish revising the prose by the
> beginning of July so that we have a stable basis for creating ODD specs.
>    So I'd be grateful for a response soon (especially from Lou).
> Kevin
>> On the question you raise about values for @type, there is already a
>> (carefully worded) recommendation in the Guidelines: "The type attribute
>> may be used to characterize the line break in any respect, but its most
>> common use is to specify that the presence of the line break does not
>> imply the end of the word in which it is embedded. A value such as
>> inWord or nobreak is recommended for this purpose, but encoders are free
>> to choose whichever values are appropriate. "
>> (I say "carefully worded" because quite a lot of virtual ink was spilled
>> on the topic over on the Epidoc list some time back)
>   >
>> If we are using @type to characterise the linebreak from a tokenization
>> point of view, there really can be only three possible states: "inWord"
>> (i.e. the tokeniser needs to combine the string before the linebreak
>> with the string after it to form a single token) "betweenWords" (i.e.
>> the string before the linebreak and the string after it are two separate
>> tokens) and "uncertain" (i.e. it could be either of the other two and
>> we're unwilling or unable to decide).
> _______________________________________________
> tei-council mailing list
> tei-council at lists.village.Virginia.EDU
> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
> .

Martin Holmes
University of Victoria Humanities Computing and Media Centre
(mholmes at uvic.ca)
Half-Baked Software, Inc.
(mholmes at halfbakedsoftware.com)
martin at mholmes.com

More information about the tei-council mailing list