[tei-council] how to encode a hyphen at the end of a line, column, or page when you are encoding hyphens
Kevin Hawkins
kevin.s.hawkins at ultraslavonic.info
Sun Dec 19 16:00:56 EST 2010
Martin Mueller's message to TEI-L this week reminds me that, despite all
our discussion in Dublin and on tei-council between May 21 and June 28,
I'm not sure we've reached a conclusion on how to encode a hyphen (when
you not simply discarding hyphens in your encoding) that occurs at a
line, column, or page break and whose status with respect to breaking a
word is unclear. I also don't believe any changes have actually made to
P5. :( Let me summarize my understanding of where the discussion
stands. (It's quite a long summary, I'm afraid, but the point is to
keep you from having to spend hours pulling together past emails, as I
just have!)
There are three suggested directions (below, A, B, and C) for handling
this situation.
== (A) encode using attributes on pb, lb, and cb ==
In Dublin we had settled on not leaving any character to represent the
hyphen as character data but type= and rend= to convey this information.
P5 currently says the following in the note for the definition <lb/> (
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-lb.html ):
"The type attribute may be used to characterize the line break in any
respect, but its most common use is to specify that the presence of the
line break does not imply the end of the word in which it is embedded. A
value such as 'inWord' or 'nobreak' is recommended for this purpose, but
encoders are free to choose whichever values are appropriate."
I find "inWord" and "nobreak" entirely non-intuitive, but Lou explained
these values (and a third he suggested) as such:
* inWord: the tokeniser needs to combine the string before the linebreak
with the string after it to form a single token
* betweenWords: the string before the linebreak and the string after it
are two separate tokens
* wordBreakStatusUnknown: it could be either of the other two and we're
unwilling or unable to decide
I prefer these values for type=:
* lexicalBoundary
* noLexicalBoundary
* uncertainLexicalBoundary
However, these may not be expressive enough for everything you'd like to
encode. Paul Schaffner provided the following examples (which I've
annotated):
a) street<lb/>walker -- line break between components of a usually
non-hyphenated compound
b) bag-<lb/>lady -- line break and hyphen between components of a
usually hyphenated compound
c) win-<lb/>some -- line break and hyphen between syllables (or
morphemes) in a single word
d) iP-<lb/>hone -- line break and hyphen within a word but misplaced
according to usual rules of breaking words across lines
e) gentle-<lb/>man -- line break and hyphen inside of a something that
may or may not be regarded as a compound
f) abusive-<lb/tagger -- line break between words; hyphen included for
unclear reasons
As for values of rend=, we might have:
* hyphen
* duplicatedLetter (for cases like Old Irish, Dutch, and German)
or any other appropriate description of how the break is rendered.
Whatever value you give for rend=, you would not leave the hyphen,
duplicated character, etc. in the character data of the XML document.
== (B) Allow certainty, precision, etc. as content of pb, lb, and cb
(cf. gap and space) ==
Gabby suggested that we do to <lb/>, <cb/>, and <pb/> what has already
been done to <gap> and <space>: allow <certainty>, <precision>, etc. as
content. That way if you are *unsure whether a break actually occurs*,
you could have something like:
<lb>
<certainty locus="break" degree="0.5"/>
</lb>
leaving the following way to express that we're *uncertain of the type
of hyphen*:
<p>Some people say TEI is a mark-<lb type="uncertain"/>up language.</p>
Elena supported Gabby's change to content models since it would also
work to handle missing, corrected, and incorrect page or line numbers,
but Lou, Martin Holmes, and Dot said to use <fw> for representing
numbering -- especially errors in the numbering -- as it appears in the
source.
I'm having trouble imagining ever being uncertain whether a line,
column, or page break actually *occurs* in a source, so it seems that
the only reason you would ever want to use <certainty> etc. within
<lb/>, <pb/>, and <cb/> is with locus=. Is this troubling to anyone
else besides me? Otherwise, I'm okay with this solution.
== (C) Use <w>, <phr>, and other phrase-level elements to encode the
context of a page break, line break, or column break ==
Martin Holmes said:
"I still think that the linebreak tag is the wrong place to supply
information about whatever-it-is-that-is-being-broken (word, phrase or
whatever) and whatever-it-is-that-is-signalling-the-break (hyphen or
whatever). The linebreak tag says there is a linebreak in the text. The
context, and the glyph that precedes the linebreak, are not attributes
of the linebreak.
I think it would be better to encourage the use of <w>, <phr> and other
phrase-level tags to mark the context of the linebreak. Even if such
tags are not being used for any other purpose in a text -- or perhaps
_especially_ if they aren't -- they could be used for exactly this
purpose, and it's easy for a processor to detect when a
linebreak-signalling glyph or a linebreak tag occur within such contexts
and process accordingly."
Martin Holmes said he'd encode Paul Schaffner's examples (again, with my
annotations) like this:
a) <phr>street<lb/>walker</phr> -- line break between components of a
usually non-hyphenated compound
b) <phr type="hyphenated">bag-<lb/>lady</phr> -- line break and hyphen
between components of a usually hyphenated compound
c) <w>win-<lb/>some</w> -- line break and hyphen between syllables (or
morphemes) in a single word
d) <w>iP-<lb/>hone</w> -- line break and hyphen within a word but
misplaced according to usual rules of breaking words across lines
e) <choice><orig>gentle-man</orig><corr>gentleman</corr></choice> --
line break and hyphen inside of a something that may or may not be
regarded as a compound
f) <w>abusive</w>-<lb/><w>tagger</w> -- line break between words; hyphen
included for unclear reasons
Perhaps he also meant to include type="hyphenated" on <w> in (c), <w> in
(d), <orig> in (e).
We need to keep in mind that if we recommend encoding use of hyphens on
<w>, <phr>, or other phrase-level elements without specifying which
one(s), we will reduce the ability to interchange TEI documents. Martin
Mueller would surely not be happy with us for that.
***
I support (A) or (B) but with a new section of the Guidelines explaining
how to do this (replacing the brief information currently given at
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-lb.html ).
Let me also reiterate my request to Lou (and the rest of you) to edit
the minutes from Dublin at
http://wiki.tei-c.org/index.php/Draft_minutes_of_2010-04_Council_meeting#hyphenation_.28and_orthographical_changes_at_line_breaks.29
to set the record straight on what we actually discussed.
--Kevin
More information about the tei-council
mailing list