[tei-council] soft hyphens (again)
Lou
lou.burnard at oucs.ox.ac.uk
Tue Jun 15 12:06:52 EDT 2010
Yes, I agree that @type=uncertain isn't much use if what we really mean
is "@type="I'm_uncertain_whether_this_is_a_word_break_or_not"
The problem is that we've appropriated the very general typology which
@type provides for a very specific function. If we wanted to
characterise linebreaks in some *other* respect than whether or not they
coincided with word boundaries, we'd need to add other values
("authorial" and "scribal" suggest themselves as possible candidates)
and then "uncertain" becomes *really* uncertain... and in any case we
couldn't support multiple values -- it might be both "inWord" and
"authorial" (or whatever). Yuk. It's yet another example of for why
@type shouldn't be global and why its values should be mutually exclusive.
I can only think of two possible solutions to this. No, make it three.
1. come up with a better word than "uncertain" for the third case (wbsu
or wordBreakStatusUnknown?)
2. use a different attribute @wordBreaking = "true|false|unknown"
3. redefine the semantics of @type="wordBreaking" to mean just "this is
probably a word breaker but possibly not"
Solutions 1 and 3 have the advantage of leaving things more or less
unchanged for the only people I am aware of who actually care about this
problem i.e. epigraphers. Solution 2 has the advantage of being more
explicit and elegant, but I wouldn't want it to replace the status quo
for obvious reasons of backwards compatibility.
Votes?
Kevin Hawkins wrote:
> I'm resending this message with fewer typos and more context.
>
> ***
>
> I brought the folks revising the *Best Practices for TEI in Libraries*
> up to speed on our hyphenation discussion. Perry Willett raised a good
> point: if we have encoding like:
>
> (A) This is not a run-<lb type="betweenWords"/>on sentence.
>
> (B) UTF-8 is a char-<lb type="inWord"/>acter encoding for Unicode.
>
> (C) Some people say TEI is a mark-<lb type="uncertain"/>up language.
>
> One might read (C) as if the encoder is sure whether a line break really
> occurs here. We're using an attribute of one element to describe the
> character that appears before it.
>
> Lou suggested these three type values (see excerpt from his message
> below), but I think we might need a better value for @type in (C).
> Suggestions?
>
> The Best Practices team is trying to finish revising the prose by the
> beginning of July so that we have a stable basis for creating ODD specs.
> So I'd be grateful for a response soon (especially from Lou).
>
> Kevin
>
>> On the question you raise about values for @type, there is already a
>> (carefully worded) recommendation in the Guidelines: "The type attribute
>> may be used to characterize the line break in any respect, but its most
>> common use is to specify that the presence of the line break does not
>> imply the end of the word in which it is embedded. A value such as
>> inWord or nobreak is recommended for this purpose, but encoders are free
>> to choose whichever values are appropriate. "
>>
>> (I say "carefully worded" because quite a lot of virtual ink was spilled
>> on the topic over on the Epidoc list some time back)
> >
>> If we are using @type to characterise the linebreak from a tokenization
>> point of view, there really can be only three possible states: "inWord"
>> (i.e. the tokeniser needs to combine the string before the linebreak
>> with the string after it to form a single token) "betweenWords" (i.e.
>> the string before the linebreak and the string after it are two separate
>> tokens) and "uncertain" (i.e. it could be either of the other two and
>> we're unwilling or unable to decide).
> _______________________________________________
> tei-council mailing list
> tei-council at lists.village.Virginia.EDU
> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
More information about the tei-council
mailing list