[tei-council] how to encode a hyphen at the end of a line, column, or page when you are encoding hyphens
lou.burnard at oucs.ox.ac.uk
Mon Dec 20 12:35:43 EST 2010
Thanks for reminding us about this partially unresolved issue. If I may,
I'll just nit pick on a few bits of your helpful summary, hoping that we
can reach consensus on this issue quite quickly.
On 19/12/10 21:00, Kevin Hawkins wrote:
> word is unclear. I also don't believe any changes have actually made to
> P5. :(
There were some changes in wording following the last time but one this
was raised (by Gabriel and other epidockers)
> == (A) encode using attributes on pb, lb, and cb ==
> In Dublin we had settled on not leaving any character to represent the
> hyphen as character data but type= and rend= to convey this information.
I am not sure that we all agreed with that as a recommendation (i.e.
that hyphens should always be removed). I do think we agreed that if
that encoding policy were adopted, the Guidelines should provide a clear
mechanism for recording where removed hyphens (etc.) had been.
> I find "inWord" and "nobreak" entirely non-intuitive
"inWord" seems fairly obvious to me. More significantly perhaps, it was
the value which the Epidockers agreed on after a fairly heated debate.
Maybe "inToken" or "internal" ?
> I prefer these values for type=:
> * lexicalBoundary
> * noLexicalBoundary
> * uncertainLexicalBoundary
I am not comfortable with "lexical" here, because where I come from
"lexical entries" may include multiple "tokens". If I treat "apple pie"
as a lexical entry, and there happens to be a <lb/> between the "apple"
and the "pie" I don't think I'd mark the <lb/> any different from any
other. I think we should stick with the idea that line-end hyphenation
(or not) is to do with simple minded orthographic tokens, not tricky
things like lexical items.
> However, these may not be expressive enough for everything you'd like to
> encode. Paul Schaffner provided the following examples (which I've
> a) street<lb/>walker -- line break between components of a usually
> non-hyphenated compound
Not sure what a "compound" is here. For me, the critical point is
whether elsewhere in this text I find, or expect to find,
"streetwalker" (in which case the <lb/> is "inWord") or "street walker"
(in which case it isn't). And if I don't want to take a stand either
way, then it is "undecided".
> b) bag-<lb/>lady -- line break and hyphen between components of a
> usually hyphenated compound
> c) win-<lb/>some -- line break and hyphen between syllables (or
> morphemes) in a single word
> d) iP-<lb/>hone -- line break and hyphen within a word but misplaced
> according to usual rules of breaking words across lines
> e) gentle-<lb/>man -- line break and hyphen inside of a something that
> may or may not be regarded as a compound
> f) abusive-<lb/tagger -- line break between words; hyphen included for
> unclear reasons
These all seem to evince a desire to do much more about characterising
the text than seems appropriate for <lb/> (which is sort of Martin's
point, I think) -- in my (possibly addled by snow) mind, it's a fairly
simple issue: usually an automatic tokenisation can safely assume that
the presence of an <lb/> should be treated in the same way as the
presence of white space; the "inWord" attribute value just cancels that
There may also be real white space hanging around on either side of the
<lb/> of course; the tokeniser will then have to decide for itself what
it wants to do about that, but in principle I think
the sequence characters + whitespace + <lb type="inWord"/> + characters
is probably an error (assuming that characters and whitespace are
mutually exclusive for most tokenisation purposes!)
> As for values of rend=, we might have:
> * hyphen
> * duplicatedLetter (for cases like Old Irish, Dutch, and German)
I don't know enough about these languages to know whether just saying
"duplicated letter" would be enough. I think during the discussion we
felt that there were tricky cases which might need the full pomp and
majesty of <choice> or some such
> == (B) Allow certainty, precision, etc. as content of pb, lb, and cb
> (cf. gap and space) ==
This seems to me an orthogonal issue. On the grounds of simplicity, I am
quite reluctant to complexify the use of <lb/> in this way -- a
gazillion stylesheet developers will not thank us for making it a
non-empty element, especially if the only use case is the somewhat
esoteric example cited so far. Uncertainty about the locus of a
milestone can also be done by pointing at it, after all.
O blimey, it's started snowing again. Definitely time for another cup of
More information about the tei-council