[tei-council] soft hyphens (again)
lou.burnard at oucs.ox.ac.uk
Tue Jun 15 12:17:59 EDT 2010
These boil down to saying "use two different ways of marking up the
hyphen", one for when *it* is to be treated as a separator, one when it
isn't. Which is sort of where we came in, several months ago, when
But the issue currently on the table is what to do about LINEBREAKS. As
I said in an earlier post, it isn't necessarily a hyphen character which
is used to mark where a word (despite appearances) runs on to the next
line. It may be something else entirely. It may be nothing at all.
Martin Holmes wrote:
> Looking again at this, I'm struck also by what strikes you: the <lb/> is
> not the hyphen, and if we're trying to encode information about the
> hyphen, the <lb/> seems to be the wrong place to do it.
> I'd really prefer a container tag which encloses the hyphen and implies
> (for at least one rendering scenario) a linebreak. But I don't know what
> that tag should be. In the case of the three examples:
> (A) This is not a run-<lb/>on sentence.
> (This is just a line-break, isn't it? The hyphen is incidental.)
> (B) UTF-8 is a char<someTag type="breakInWord">-</someTag>acter encoding
> for Unicode.
> (The processor can decide whether to render the hyphen, depending on
> whether it's rendering the linebreaks or not; tokenizers can ignore the
> tag completely.)
> (C) Some people say TEI is a mark-<lb/>up language. OR
> (C) Some people say TEI is a mark<someTag
> type="breakInWord">-</someTag>up language.
> (If it's impossible to decide, then perhaps some kind of <choice>
> structure would be appropriate.)
> On 10-06-15 08:25 AM, Kevin Hawkins wrote:
>> I'm resending this message with fewer typos and more context.
>> I brought the folks revising the *Best Practices for TEI in Libraries*
>> up to speed on our hyphenation discussion. Perry Willett raised a good
>> point: if we have encoding like:
>> (A) This is not a run-<lb type="betweenWords"/>on sentence.
>> (B) UTF-8 is a char-<lb type="inWord"/>acter encoding for Unicode.
>> (C) Some people say TEI is a mark-<lb type="uncertain"/>up language.
>> One might read (C) as if the encoder is sure whether a line break really
>> occurs here. We're using an attribute of one element to describe the
>> character that appears before it.
>> Lou suggested these three type values (see excerpt from his message
>> below), but I think we might need a better value for @type in (C).
>> The Best Practices team is trying to finish revising the prose by the
>> beginning of July so that we have a stable basis for creating ODD specs.
>> So I'd be grateful for a response soon (especially from Lou).
>>> On the question you raise about values for @type, there is already a
>>> (carefully worded) recommendation in the Guidelines: "The type attribute
>>> may be used to characterize the line break in any respect, but its most
>>> common use is to specify that the presence of the line break does not
>>> imply the end of the word in which it is embedded. A value such as
>>> inWord or nobreak is recommended for this purpose, but encoders are free
>>> to choose whichever values are appropriate. "
>>> (I say "carefully worded" because quite a lot of virtual ink was spilled
>>> on the topic over on the Epidoc list some time back)
>>> If we are using @type to characterise the linebreak from a tokenization
>>> point of view, there really can be only three possible states: "inWord"
>>> (i.e. the tokeniser needs to combine the string before the linebreak
>>> with the string after it to form a single token) "betweenWords" (i.e.
>>> the string before the linebreak and the string after it are two separate
>>> tokens) and "uncertain" (i.e. it could be either of the other two and
>>> we're unwilling or unable to decide).
>> tei-council mailing list
>> tei-council at lists.village.Virginia.EDU
More information about the tei-council