[tei-council] soft hyphens (again)
kevin.s.hawkins at ultraslavonic.info
Tue Jun 15 14:22:37 EDT 2010
I vote for (1) (either string is okay with me) in the interest of
backwards compatibility. But we should keep in mind (2) for a future
major version of the TEI.
On 6/15/2010 12:06 PM, Lou wrote:
> Yes, I agree that @type=uncertain isn't much use if what we really mean
> is "@type="I'm_uncertain_whether_this_is_a_word_break_or_not"
> The problem is that we've appropriated the very general typology which
> @type provides for a very specific function. If we wanted to
> characterise linebreaks in some *other* respect than whether or not they
> coincided with word boundaries, we'd need to add other values
> ("authorial" and "scribal" suggest themselves as possible candidates)
> and then "uncertain" becomes *really* uncertain... and in any case we
> couldn't support multiple values -- it might be both "inWord" and
> "authorial" (or whatever). Yuk. It's yet another example of for why
> @type shouldn't be global and why its values should be mutually exclusive.
> I can only think of two possible solutions to this. No, make it three.
> 1. come up with a better word than "uncertain" for the third case (wbsu
> or wordBreakStatusUnknown?)
> 2. use a different attribute @wordBreaking = "true|false|unknown"
> 3. redefine the semantics of @type="wordBreaking" to mean just "this is
> probably a word breaker but possibly not"
> Solutions 1 and 3 have the advantage of leaving things more or less
> unchanged for the only people I am aware of who actually care about this
> problem i.e. epigraphers. Solution 2 has the advantage of being more
> explicit and elegant, but I wouldn't want it to replace the status quo
> for obvious reasons of backwards compatibility.
> Kevin Hawkins wrote:
>> I'm resending this message with fewer typos and more context.
>> I brought the folks revising the *Best Practices for TEI in Libraries*
>> up to speed on our hyphenation discussion. Perry Willett raised a good
>> point: if we have encoding like:
>> (A) This is not a run-<lb type="betweenWords"/>on sentence.
>> (B) UTF-8 is a char-<lb type="inWord"/>acter encoding for Unicode.
>> (C) Some people say TEI is a mark-<lb type="uncertain"/>up language.
>> One might read (C) as if the encoder is sure whether a line break
>> really occurs here. We're using an attribute of one element to
>> describe the character that appears before it.
>> Lou suggested these three type values (see excerpt from his message
>> below), but I think we might need a better value for @type in (C).
>> The Best Practices team is trying to finish revising the prose by the
>> beginning of July so that we have a stable basis for creating ODD
>> specs. So I'd be grateful for a response soon (especially from Lou).
>>> On the question you raise about values for @type, there is already a
>>> (carefully worded) recommendation in the Guidelines: "The type attribute
>>> may be used to characterize the line break in any respect, but its most
>>> common use is to specify that the presence of the line break does not
>>> imply the end of the word in which it is embedded. A value such as
>>> inWord or nobreak is recommended for this purpose, but encoders are free
>>> to choose whichever values are appropriate. "
>>> (I say "carefully worded" because quite a lot of virtual ink was spilled
>>> on the topic over on the Epidoc list some time back)
>>> If we are using @type to characterise the linebreak from a tokenization
>>> point of view, there really can be only three possible states: "inWord"
>>> (i.e. the tokeniser needs to combine the string before the linebreak
>>> with the string after it to form a single token) "betweenWords" (i.e.
>>> the string before the linebreak and the string after it are two separate
>>> tokens) and "uncertain" (i.e. it could be either of the other two and
>>> we're unwilling or unable to decide).
>> tei-council mailing list
>> tei-council at lists.village.Virginia.EDU
More information about the tei-council