[tei-council] how to encode a hyphen at the end of a line, column, or page when you are encoding hyphens

Lou Burnard lou.burnard at oucs.ox.ac.uk
Mon Dec 20 12:35:43 EST 2010

Thanks for reminding us about this partially unresolved issue. If I may, 
I'll just nit pick on a few bits of your helpful summary, hoping that we 
can reach consensus on this issue quite quickly.

On 19/12/10 21:00, Kevin Hawkins wrote:

> word is unclear.  I also don't believe any changes have actually made to
> P5.  :(

There were some changes in wording following the last time but one this 
was raised (by Gabriel and other epidockers)

> == (A) encode using attributes on pb, lb, and cb  ==
> In Dublin we had settled on not leaving any character to represent the
> hyphen as character data but type= and rend= to convey this information.

I am not sure that we all agreed with that as a recommendation (i.e. 
that hyphens should always be removed).  I do think we agreed that if
that encoding policy were adopted, the Guidelines should provide a clear 
mechanism for recording where removed hyphens (etc.) had been.

> I find "inWord" and "nobreak" entirely non-intuitive

"inWord" seems fairly obvious to me. More significantly perhaps, it was 
the value which the Epidockers agreed on after a fairly heated debate.

Maybe "inToken" or "internal" ?

> I prefer these values for type=:
> * lexicalBoundary
> * noLexicalBoundary
> * uncertainLexicalBoundary

I am not comfortable with "lexical" here, because where I come from 
"lexical entries" may include multiple "tokens". If I treat "apple pie" 
as a lexical entry, and there happens to be a <lb/> between the "apple" 
and the "pie" I don't think I'd mark the <lb/> any different from any 
other. I think we should stick with the idea that line-end hyphenation 
(or not) is to do with simple minded  orthographic tokens, not tricky 
things like lexical items.

> However, these may not be expressive enough for everything you'd like to
> encode.  Paul Schaffner provided the following examples (which I've
> annotated):
> a) street<lb/>walker  -- line break between components of a usually
> non-hyphenated compound

Not sure what a "compound" is here. For me, the critical point is 
whether elsewhere in this text I find, or expect to find, 
"streetwalker" (in which case the <lb/> is "inWord") or "street walker" 
(in which case it isn't). And if I don't want to take a stand either 
way, then it is "undecided".

> b) bag-<lb/>lady -- line break and hyphen between components of a
> usually hyphenated compound
> c) win-<lb/>some -- line break and hyphen between syllables (or
> morphemes) in a single word
> d) iP-<lb/>hone -- line break and hyphen within a word but misplaced
> according to usual rules of breaking words across lines
> e) gentle-<lb/>man -- line break and hyphen inside of a something that
> may or may not be regarded as a compound
> f) abusive-<lb/tagger -- line break between words; hyphen included for
> unclear reasons

These all seem to evince a desire to do much more about characterising 
the text than seems appropriate for <lb/> (which is sort of Martin's 
point, I think) -- in my (possibly addled by snow) mind, it's a fairly 
simple issue: usually an automatic tokenisation can safely assume that 
the presence of an <lb/> should be treated in the same way as the 
presence of white space; the "inWord" attribute value just cancels that 

There may also be real white space hanging around on either side of the 
<lb/> of course; the tokeniser will then have to decide for itself what 
it  wants to do about that, but in principle I think
the sequence characters + whitespace + <lb type="inWord"/> + characters 
is probably an error (assuming that characters and whitespace are 
mutually exclusive for most tokenisation purposes!)

> As for values of rend=, we might have:
> * hyphen
> * duplicatedLetter (for cases like Old Irish, Dutch, and German)

I don't know enough about these languages to know whether just saying 
"duplicated letter" would be enough. I think during the discussion we 
felt that there were tricky cases which might need the full pomp and 
majesty of <choice> or some such

> == (B) Allow certainty, precision, etc. as content of pb, lb, and cb
> (cf. gap and space) ==

This seems to me an orthogonal issue. On the grounds of simplicity, I am 
quite reluctant to complexify the use of <lb/> in this way -- a 
gazillion stylesheet developers will not thank us for making it a 
non-empty element, especially if the only use case is the somewhat 
esoteric example cited so far. Uncertainty about the locus of a 
milestone can also be done by pointing at it, after all.

O blimey, it's started snowing again. Definitely time for another cup of 


More information about the tei-council mailing list