[tei-council] soft hyphens (again)
gabriel.bodard at kcl.ac.uk
Wed Jun 16 06:07:35 EDT 2010
While (2) is attractive in terms of explicicity and elegance, I am
tempted to vote for (3) on the grounds that if you're really uncertain
about the status of a line-break there are other ways to express this
(<certainty> element inside <lb> anyone?--to resurrect a TEI-L query
that seems to have been met with defeaning indifference...)
Has anyone ever had any use-case for characterizing linebreaks (and cb,
pb, etc.) other than by whether they break works or not?
On 15/06/2010 17:06, Lou wrote:
> Yes, I agree that @type=uncertain isn't much use if what we really mean
> is "@type="I'm_uncertain_whether_this_is_a_word_break_or_not"
> The problem is that we've appropriated the very general typology which
> @type provides for a very specific function. If we wanted to
> characterise linebreaks in some *other* respect than whether or not they
> coincided with word boundaries, we'd need to add other values
> ("authorial" and "scribal" suggest themselves as possible candidates)
> and then "uncertain" becomes *really* uncertain... and in any case we
> couldn't support multiple values -- it might be both "inWord" and
> "authorial" (or whatever). Yuk. It's yet another example of for why
> @type shouldn't be global and why its values should be mutually exclusive.
> I can only think of two possible solutions to this. No, make it three.
> 1. come up with a better word than "uncertain" for the third case (wbsu
> or wordBreakStatusUnknown?)
> 2. use a different attribute @wordBreaking = "true|false|unknown"
> 3. redefine the semantics of @type="wordBreaking" to mean just "this is
> probably a word breaker but possibly not"
> Solutions 1 and 3 have the advantage of leaving things more or less
> unchanged for the only people I am aware of who actually care about this
> problem i.e. epigraphers. Solution 2 has the advantage of being more
> explicit and elegant, but I wouldn't want it to replace the status quo
> for obvious reasons of backwards compatibility.
> Kevin Hawkins wrote:
>> I'm resending this message with fewer typos and more context.
>> I brought the folks revising the *Best Practices for TEI in Libraries*
>> up to speed on our hyphenation discussion. Perry Willett raised a good
>> point: if we have encoding like:
>> (A) This is not a run-<lb type="betweenWords"/>on sentence.
>> (B) UTF-8 is a char-<lb type="inWord"/>acter encoding for Unicode.
>> (C) Some people say TEI is a mark-<lb type="uncertain"/>up language.
>> One might read (C) as if the encoder is sure whether a line break really
>> occurs here. We're using an attribute of one element to describe the
>> character that appears before it.
>> Lou suggested these three type values (see excerpt from his message
>> below), but I think we might need a better value for @type in (C).
>> The Best Practices team is trying to finish revising the prose by the
>> beginning of July so that we have a stable basis for creating ODD specs.
>> So I'd be grateful for a response soon (especially from Lou).
>>> On the question you raise about values for @type, there is already a
>>> (carefully worded) recommendation in the Guidelines: "The type attribute
>>> may be used to characterize the line break in any respect, but its most
>>> common use is to specify that the presence of the line break does not
>>> imply the end of the word in which it is embedded. A value such as
>>> inWord or nobreak is recommended for this purpose, but encoders are free
>>> to choose whichever values are appropriate. "
>>> (I say "carefully worded" because quite a lot of virtual ink was spilled
>>> on the topic over on the Epidoc list some time back)
>>> If we are using @type to characterise the linebreak from a tokenization
>>> point of view, there really can be only three possible states: "inWord"
>>> (i.e. the tokeniser needs to combine the string before the linebreak
>>> with the string after it to form a single token) "betweenWords" (i.e.
>>> the string before the linebreak and the string after it are two separate
>>> tokens) and "uncertain" (i.e. it could be either of the other two and
>>> we're unwilling or unable to decide).
>> tei-council mailing list
>> tei-council at lists.village.Virginia.EDU
> tei-council mailing list
> tei-council at lists.village.Virginia.EDU
Dr Gabriel BODARD
(Epigrapher, Digital Classicist, Pirate)
Centre for Computing in the Humanities
King's College London
26-29 Drury Lane
London WC2B 5RL
Email: gabriel.bodard at kcl.ac.uk
Tel: +44 (0)20 7848 1388
Fax: +44 (0)20 7848 2980
More information about the tei-council