[tei-council] Content for <pb/> etc. [was: soft hyphens (again)]

Kevin Hawkins kevin.s.hawkins at ultraslavonic.info
Sun Jun 27 11:14:40 EDT 2010


Our discussion has wandered quite a bit, so let me try to summarize the 
ideas so far ...

I want to know how to encode a hyphen that occurs at a line, column, or 
page break and whose status with respect to breaking a word is unclear. 
  We had settled on doing this with <lb type="?"/>, <cb type="?"/>, and 
<pb type="?"/> but needed a value that doesn't imply that we are 
uncertain of whether a line break, column break, or page break occurs here.

Martin suggested we invent an element to surround the hyphen and put the 
@type on this element rather than on the following <lb/>, <cb/>, or 
<pb/>.  Lou noted this would lead to two different ways of encoding a 
hyphen and noted that there might be a different character or no 
character at all marking (or not marking) the point of continuation.

Lou suggested three options (which I've clarified):

1. come up with a better word than "uncertain" for the type attribute, 
such as "wbsu" or "wordBreakStatusUnknown"

2. create a new attribute @wordBreaking, whose value could be 
true|false|unknown

3. redefine the semantics of @type="inWord" to mean just "this is 
probably a word breaker but possibly between words"

I supported (1) for P5 but suggested a fourth option:

4. redefine the semantics of @type='betweenWords' to mean just "this is 
probably between words but possibly a word breaker"

Gabby suggested that we do to <lb/>, <cb/>, and <pb/> what has already 
been done to <gap> and <space>: allow <certainty>, <precision>, etc. as 
content.  That way if are *unsure whether a break actually occurs*, you 
could have something like:

<lb>
   <certainty locus="break" degree="0.5"/>
</lb>

leaving the following way to express that we're *uncertain of the type 
of hyphen*:

Some people say TEI is a mark-<lb type="uncertain"/>up language.

Elena supported Gabby's change to content models since it would also 
work to handle missing, corrected, and incorrect page or line numbers, 
but Lou, Martin, and Dot said to use <fw> for representing numbering as 
it appears in the source.

***

I still support (1).  Gabby asked whether anyone would characterize 
breaks by anything other than whether they break words, and I too am 
having trouble imagining being uncertain whether a line, column, or page 
break actually *occurs* in a source.  Therefore, I'm not too concerned 
with expanding the content model of <lb/>, <cb/>, and <pb/>.


More information about the tei-council mailing list