[tei-council] how to encode a hyphen at the end of a line, column, or page when you are encoding hyphens

Kevin Hawkins kevin.s.hawkins at ultraslavonic.info
Wed Jan 5 10:17:14 EST 2011

While Christmas intervened in our discussion of hyphens, the vigorous 
discussion of objectType makes me think that all of you workaholics have 
been reading your email all through the holidays.

On 1/5/2011 8:00 AM, Lou Burnard wrote:
> Well, like Sebastian, I don't think I would attribute the lack of
> response on this issue to any lack of understanding on the part of
> Council members! Myself, I am a bit at a loss to understand what it is
> exactly that needs further explanation. There is a note in the element
> description for<lb>  which reads
> "The type attribute may be used to characterize the line break in any
> respect, but its most common use is to specify that the presence of the
> line break does not imply the end of the word in which it is embedded. A
> value such as inWord or nobreak is recommended for this purpose, but
> encoders are free to choose whichever values are appropriate. "

It is not clear to me whether "inWord" and "nobreak" are synonyms that 
both "specify that the presence of the line break does not imply the end 
of the word in which it is embedded", or whether only one of these 
values is suppposed to mean this while the other is not supposed to. 
Without any context, I don't understand what exactly "inWord" and 
"nobreak" mean.

> There is also an example in 3.10.3
> (http://www.tei-c.org/release/doc/tei-p5-doc/en/html/CO.html#CORS5)
> which reads
> "The type attribute may be used on milestone elements such as lb
> <http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-lb.html>  and pb
> <http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-pb.html>  to
> categorize them in any way. One particularly useful way is to indicate
> whether or not these milestone tags are word-breaking. By default it is
> reasonable to assume that words are not broken across page or line
> boundaries, and that therefore a sequence such as
> ...sed imp<lb/>erator dixit...
> should be tokenized as four words (sed, imp, erator, and dixit). To make
> explicit that this is not the case, a tagging such as the following is
> recommended:
> ...sed imp<lb type="nobreak"/>erator dixit...
> Where hyphenation appears before a line or page break, the encoder may
> or may not choose to include it, either explicitly using an appropriate
> Unicode character, or descriptively for example by means of the rend
> attribute; see further 3.2 Treatment of Punctuation
> <http://www.tei-c.org/release/doc/tei-p5-doc/en/html/CO.html#COPU>. "

The following are unclear to me:

a) whether the source document had a hyphen or other indication that 
"imperator" is to be considered as single word.  That is, an image of 
the source document would be helpful here.

b) If the encoder chooses to include hyphenation, either explicitly 
using an appropriate Unicode character, which Unicode character is 

c) If the encoder chooses to include hyphenation descriptive by means of 
the rend= attribute, what sorts of value(s) for rend= are recommended?

> However, it's true that the referenced section on Punctuation doesn't
> seem to mention hyphenation at all, so maybe it would be a good idea to
> add more discussion there.

The introduction to that section mentions three Unicode characters for 
"hyphens" but does not give any guidance on the use of them.

> For me the main issue that needs to be clarified is the interaction
> between<lb/>  and whitespace with regard to implicit tokenization. The
> excellent  TEI-L posting from one L. Burnard
> (http://listserv.brown.edu/archives/cgi-bin/wa?A2=ind1003&L=TEI-L&P=R22031)
> which you mention addresses that at length. Subsequent discussion of the
> issue on TEI-L seems to support the proposals therein too. So maybe what
> I should do is rehash that discussion a bit and bung it into 3.2
> somewhere.  I'll try that anyway, and post a draft here for comment.


More information about the tei-council mailing list