[tei-council] soft hyphens (again)

Tue Jun 15 12:06:52 EDT 2010

Yes, I agree that @type=uncertain isn't much use if what we really mean 
is "@type="I'm_uncertain_whether_this_is_a_word_break_or_not"

The problem is that we've appropriated the very general typology which 
@type provides for a very specific function. If we wanted to 
characterise linebreaks in some *other* respect than whether or not they 
coincided with word boundaries, we'd need to add other values 
("authorial" and "scribal" suggest themselves as possible candidates) 
and then "uncertain" becomes *really* uncertain... and in any case we 
couldn't support multiple values  -- it might be both "inWord" and 
"authorial" (or whatever). Yuk. It's yet another example of for why 
@type shouldn't be global and why its values should be mutually exclusive.

I can only think of two possible solutions to this. No, make it three.

1. come up with a better word than "uncertain" for the third case (wbsu 
or wordBreakStatusUnknown?)

2. use a different attribute @wordBreaking = "true|false|unknown"

3. redefine the semantics of @type="wordBreaking" to mean just "this is 
probably a word breaker but possibly not"

Solutions 1 and 3 have the advantage of leaving things more or less 
unchanged for the only people I am aware of who actually care about this 
problem i.e. epigraphers. Solution 2 has the advantage of being more 
explicit and elegant, but I wouldn't want it to replace the status quo 
for obvious reasons of backwards compatibility.

Votes?

Kevin Hawkins wrote:
> I'm resending this message with fewer typos and more context.
> 
> ***
> 
> I brought the folks revising the *Best Practices for TEI in Libraries* 
> up to speed on our hyphenation discussion.  Perry Willett raised a good 
> point: if we have encoding like:
> 
> (A) This is not a run-<lb type="betweenWords"/>on sentence.
> 
> (B) UTF-8 is a char-<lb type="inWord"/>acter encoding for Unicode.
> 
> (C) Some people say TEI is a mark-<lb type="uncertain"/>up language.
> 
> One might read (C) as if the encoder is sure whether a line break really 
> occurs here.  We're using an attribute of one element to describe the 
> character that appears before it.
> 
> Lou suggested these three type values (see excerpt from his message 
> below), but I think we might need a better value for @type in (C). 
> Suggestions?
> 
> The Best Practices team is trying to finish revising the prose by the 
> beginning of July so that we have a stable basis for creating ODD specs. 
>   So I'd be grateful for a response soon (especially from Lou).
> 
> Kevin
> 
>> On the question you raise about values for @type, there is already a
>> (carefully worded) recommendation in the Guidelines: "The type attribute
>> may be used to characterize the line break in any respect, but its most
>> common use is to specify that the presence of the line break does not
>> imply the end of the word in which it is embedded. A value such as
>> inWord or nobreak is recommended for this purpose, but encoders are free
>> to choose whichever values are appropriate. "
>>
>> (I say "carefully worded" because quite a lot of virtual ink was spilled
>> on the topic over on the Epidoc list some time back)
>  >
>> If we are using @type to characterise the linebreak from a tokenization
>> point of view, there really can be only three possible states: "inWord"
>> (i.e. the tokeniser needs to combine the string before the linebreak
>> with the string after it to form a single token) "betweenWords" (i.e.
>> the string before the linebreak and the string after it are two separate
>> tokens) and "uncertain" (i.e. it could be either of the other two and
>> we're unwilling or unable to decide).
> _______________________________________________
> tei-council mailing list
> tei-council at lists.village.Virginia.EDU
> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council