[tei-council] how to encode a hyphen at the end of a line, column, or page when you are encoding hyphens

Kevin Hawkins kevin.s.hawkins at ultraslavonic.info
Sun Dec 19 16:00:56 EST 2010


Martin Mueller's message to TEI-L this week reminds me that, despite all 
our discussion in Dublin and on tei-council between May 21 and June 28, 
I'm not sure we've reached a conclusion on how to encode a hyphen (when 
you not simply discarding hyphens in your encoding) that occurs at a 
line, column, or page break and whose status with respect to breaking a 
word is unclear.  I also don't believe any changes have actually made to 
P5.  :(  Let me summarize my understanding of where the discussion 
stands.  (It's quite a long summary, I'm afraid, but the point is to 
keep you from having to spend hours pulling together past emails, as I 
just have!)

There are three suggested directions (below, A, B, and C) for handling 
this situation.

== (A) encode using attributes on pb, lb, and cb  ==

In Dublin we had settled on not leaving any character to represent the 
hyphen as character data but type= and rend= to convey this information.

P5 currently says the following in the note for the definition <lb/> ( 
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-lb.html ):

"The type attribute may be used to characterize the line break in any 
respect, but its most common use is to specify that the presence of the 
line break does not imply the end of the word in which it is embedded. A 
value such as 'inWord' or 'nobreak' is recommended for this purpose, but 
encoders are free to choose whichever values are appropriate."

I find "inWord" and "nobreak" entirely non-intuitive, but Lou explained 
these values (and a third he suggested) as such:

* inWord: the tokeniser needs to combine the string before the linebreak 
with the string after it to form a single token

* betweenWords: the string before the linebreak and the string after it 
are two separate tokens

* wordBreakStatusUnknown: it could be either of the other two and we're 
unwilling or unable to decide

I prefer these values for type=:

* lexicalBoundary
* noLexicalBoundary
* uncertainLexicalBoundary

However, these may not be expressive enough for everything you'd like to 
encode.  Paul Schaffner provided the following examples (which I've 
annotated):

a) street<lb/>walker  -- line break between components of a usually 
non-hyphenated compound

b) bag-<lb/>lady -- line break and hyphen between components of a 
usually hyphenated compound

c) win-<lb/>some -- line break and hyphen between syllables (or 
morphemes) in a single word

d) iP-<lb/>hone -- line break and hyphen within a word but misplaced 
according to usual rules of breaking words across lines

e) gentle-<lb/>man -- line break and hyphen inside of a something that 
may or may not be regarded as a compound

f) abusive-<lb/tagger -- line break between words; hyphen included for 
unclear reasons

As for values of rend=, we might have:

* hyphen
* duplicatedLetter (for cases like Old Irish, Dutch, and German)

or any other appropriate description of how the break is rendered. 
Whatever value you give for rend=, you would not leave the hyphen, 
duplicated character, etc. in the character data of the XML document.

== (B) Allow certainty, precision, etc. as content of pb, lb, and cb 
(cf. gap and space) ==

Gabby suggested that we do to <lb/>, <cb/>, and <pb/> what has already 
been done to <gap> and <space>: allow <certainty>, <precision>, etc. as 
content.  That way if you are *unsure whether a break actually occurs*, 
you could have something like:

<lb>
   <certainty locus="break" degree="0.5"/>
</lb>

leaving the following way to express that we're *uncertain of the type 
of hyphen*:

<p>Some people say TEI is a mark-<lb type="uncertain"/>up language.</p>

Elena supported Gabby's change to content models since it would also 
work to handle missing, corrected, and incorrect page or line numbers, 
but Lou, Martin Holmes, and Dot said to use <fw> for representing 
numbering -- especially errors in the numbering -- as it appears in the 
source.

I'm having trouble imagining ever being uncertain whether a line, 
column, or page break actually *occurs* in a source, so it seems that 
the only reason you would ever want to use <certainty> etc. within 
<lb/>, <pb/>, and <cb/> is with locus=.  Is this troubling to anyone 
else besides me?  Otherwise, I'm okay with this solution.

== (C) Use <w>, <phr>, and other phrase-level elements to encode the 
context of a page break, line break, or column break ==

Martin Holmes said:

"I still think that the linebreak tag is the wrong place to supply 
information about whatever-it-is-that-is-being-broken (word, phrase or 
whatever) and whatever-it-is-that-is-signalling-the-break (hyphen or 
whatever). The linebreak tag says there is a linebreak in the text. The 
context, and the glyph that precedes the linebreak, are not attributes 
of the linebreak.

I think it would be better to encourage the use of <w>, <phr> and other 
phrase-level tags to mark the context of the linebreak. Even if such 
tags are not being used for any other purpose in a text -- or perhaps 
_especially_ if they aren't -- they could be used for exactly this 
purpose, and it's easy for a processor to detect when a 
linebreak-signalling glyph or a linebreak tag occur within such contexts 
and process accordingly."

Martin Holmes said he'd encode Paul Schaffner's examples (again, with my 
annotations) like this:

a) <phr>street<lb/>walker</phr>  -- line break between components of a 
usually non-hyphenated compound

b) <phr type="hyphenated">bag-<lb/>lady</phr> -- line break and hyphen 
between components of a usually hyphenated compound

c) <w>win-<lb/>some</w> -- line break and hyphen between syllables (or 
morphemes) in a single word

d) <w>iP-<lb/>hone</w> -- line break and hyphen within a word but 
misplaced according to usual rules of breaking words across lines

e) <choice><orig>gentle-man</orig><corr>gentleman</corr></choice> -- 
line break and hyphen inside of a something that may or may not be 
regarded as a compound

f) <w>abusive</w>-<lb/><w>tagger</w> -- line break between words; hyphen 
included for unclear reasons

Perhaps he also meant to include type="hyphenated" on <w> in (c), <w> in 
(d), <orig> in (e).

We need to keep in mind that if we recommend encoding use of hyphens on 
<w>, <phr>, or other phrase-level elements without specifying which 
one(s), we will reduce the ability to interchange TEI documents.  Martin 
Mueller would surely not be happy with us for that.

***

I support (A) or (B) but with a new section of the Guidelines explaining 
how to do this (replacing the brief information currently given at 
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-lb.html ).

Let me also reiterate my request to Lou (and the rest of you) to edit 
the minutes from Dublin at

http://wiki.tei-c.org/index.php/Draft_minutes_of_2010-04_Council_meeting#hyphenation_.28and_orthographical_changes_at_line_breaks.29

to set the record straight on what we actually discussed.

--Kevin


More information about the tei-council mailing list