[tei-council] soft hyphens (again)

Gabriel Bodard gabriel.bodard at kcl.ac.uk
Wed Jun 16 06:07:35 EDT 2010


While (2) is attractive in terms of explicicity and elegance, I am 
tempted to vote for (3) on the grounds that if you're really uncertain 
about the status of a line-break there are other ways to express this 
(<certainty> element inside <lb> anyone?--to resurrect a TEI-L query 
that seems to have been met with defeaning indifference...)

Has anyone ever had any use-case for characterizing linebreaks (and cb, 
pb, etc.) other than by whether they break works or not?

G

On 15/06/2010 17:06, Lou wrote:
> Yes, I agree that @type=uncertain isn't much use if what we really mean
> is "@type="I'm_uncertain_whether_this_is_a_word_break_or_not"
>
> The problem is that we've appropriated the very general typology which
> @type provides for a very specific function. If we wanted to
> characterise linebreaks in some *other* respect than whether or not they
> coincided with word boundaries, we'd need to add other values
> ("authorial" and "scribal" suggest themselves as possible candidates)
> and then "uncertain" becomes *really* uncertain... and in any case we
> couldn't support multiple values  -- it might be both "inWord" and
> "authorial" (or whatever). Yuk. It's yet another example of for why
> @type shouldn't be global and why its values should be mutually exclusive.
>
> I can only think of two possible solutions to this. No, make it three.
>
> 1. come up with a better word than "uncertain" for the third case (wbsu
> or wordBreakStatusUnknown?)
>
> 2. use a different attribute @wordBreaking = "true|false|unknown"
>
> 3. redefine the semantics of @type="wordBreaking" to mean just "this is
> probably a word breaker but possibly not"
>
>
> Solutions 1 and 3 have the advantage of leaving things more or less
> unchanged for the only people I am aware of who actually care about this
> problem i.e. epigraphers. Solution 2 has the advantage of being more
> explicit and elegant, but I wouldn't want it to replace the status quo
> for obvious reasons of backwards compatibility.
>
> Votes?
>
>
>
>
>
> Kevin Hawkins wrote:
>> I'm resending this message with fewer typos and more context.
>>
>> ***
>>
>> I brought the folks revising the *Best Practices for TEI in Libraries*
>> up to speed on our hyphenation discussion.  Perry Willett raised a good
>> point: if we have encoding like:
>>
>> (A) This is not a run-<lb type="betweenWords"/>on sentence.
>>
>> (B) UTF-8 is a char-<lb type="inWord"/>acter encoding for Unicode.
>>
>> (C) Some people say TEI is a mark-<lb type="uncertain"/>up language.
>>
>> One might read (C) as if the encoder is sure whether a line break really
>> occurs here.  We're using an attribute of one element to describe the
>> character that appears before it.
>>
>> Lou suggested these three type values (see excerpt from his message
>> below), but I think we might need a better value for @type in (C).
>> Suggestions?
>>
>> The Best Practices team is trying to finish revising the prose by the
>> beginning of July so that we have a stable basis for creating ODD specs.
>>    So I'd be grateful for a response soon (especially from Lou).
>>
>> Kevin
>>
>>> On the question you raise about values for @type, there is already a
>>> (carefully worded) recommendation in the Guidelines: "The type attribute
>>> may be used to characterize the line break in any respect, but its most
>>> common use is to specify that the presence of the line break does not
>>> imply the end of the word in which it is embedded. A value such as
>>> inWord or nobreak is recommended for this purpose, but encoders are free
>>> to choose whichever values are appropriate. "
>>>
>>> (I say "carefully worded" because quite a lot of virtual ink was spilled
>>> on the topic over on the Epidoc list some time back)
>>   >
>>> If we are using @type to characterise the linebreak from a tokenization
>>> point of view, there really can be only three possible states: "inWord"
>>> (i.e. the tokeniser needs to combine the string before the linebreak
>>> with the string after it to form a single token) "betweenWords" (i.e.
>>> the string before the linebreak and the string after it are two separate
>>> tokens) and "uncertain" (i.e. it could be either of the other two and
>>> we're unwilling or unable to decide).
>> _______________________________________________
>> tei-council mailing list
>> tei-council at lists.village.Virginia.EDU
>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>
> _______________________________________________
> tei-council mailing list
> tei-council at lists.village.Virginia.EDU
> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council

-- 
Dr Gabriel BODARD
(Epigrapher, Digital Classicist, Pirate)

Centre for Computing in the Humanities
King's College London
26-29 Drury Lane
London WC2B 5RL
Email: gabriel.bodard at kcl.ac.uk
Tel: +44 (0)20 7848 1388
Fax: +44 (0)20 7848 2980

http://www.digitalclassicist.org/
http://www.currentepigraphy.org/


More information about the tei-council mailing list