[tei-council] word-dividing

Lou Burnard lou.burnard at oucs.ox.ac.uk
Thu Jul 2 07:34:44 EDT 2009


With a few exceptions, the examples throughout the Guidelines are chosen 
as examples of textual phenomena, not of ways those textual phenomena 
have been treated in particular encoding projects. Since @type on <lb> 
or <pb> was only recently introduced it's unsurprising that there aren't 
that many existing TEI precedents to follow and, while I yield to no-one 
in my admiration for the epidoc project, I would resist pressure to make 
it (or any other project) the sole driver for decisions about what goes 
in the Guidelines.

The particular phenomenon we're dealing with here can be -- has been -- 
dealt with in several different ways -- there are thousands of cases 
also of texts in which the encoder has chosen to treat this phenomenon 
in a completely different way! In the Bibliotheque Virtuelle des 
Humanistes, for example, they mark the word-fragments introduced by the 
presence of the <lb/> explicitly. So they would have something like

<caes full="imperator">imp</caes><lb/><caes>erator</caes>

("caes" is short for "caesura" which means something different in 
French, apparently)
In other projects, (probably even more numerous) they decided to just 
move the <lb> to the end of the nearest word:  imperator<lb/>

My objection to wordDiv, wordDivision, vel sim is just that it's 
ambiguous as between  "division between words" or "division within a 
word". Since the whole point of this attribute is to specify exactly 
which of those two is the case , this seems a bad idea. With all 
humility, I still think that "nobreak" is less ambiguous -- it implies 
that although the name of the element bearing it implies some kind of 
"break", in this particular case, the break isn't considered to be 
there.  I am perfectly amenable to other suggestions, but the only one 
I've seen so far is David's. "intraword" is certainly unambiguous (at 
least to those who've been properly educated) but does seem a bit 
long-winded. Remember that we'd like these values to be comprehensible 
to native speakers of non-Latin languages as well if possible.


Gabriel BODARD wrote:
> Sure it doesn't terribly matter what the attribute value is, since it's 
> not constrained, but aren't these examples supposed where possible to be 
> based on real usage? Why then would you invent an attribute value that 
> no one's using, rather than using the value that has been used in tens 
> of thousands of examples in the real world?
>
> G
>
> Lou Burnard wrote:
>   
>> We considered that, but it's a bit latinate, don't you think?
>>
>> I agree with Dan that there's no available time to sweat this further 
>> (despite the weather :-). If people want to make further changes to 
>> wording (I'm assuming everyone has actually looked at the newly revised 
>> examples and discussion?) they will go into the mix for next time, but 
>> we need to get this error fixing release out the door today. 
>>
>>
>>
>> David Sewell wrote:
>>     
>>> As a naive non-epigraphist, I would find this unambiguous, for what it's
>>> worth:
>>>
>>>   <lb type="intraword"/>
>>>
>>> David
>>>
>>> On Wed, 1 Jul 2009, Dot Porter wrote:
>>>
>>>   
>>>       
>>>> Dan, I don't think anyone is suggesting the value be technically
>>>> controlled, but we want an example in the Guidelines. And as people
>>>> tend to take the Guidelines suggestions quite seriously, it's worth
>>>> considering what the suggested value be.
>>>>
>>>> Dot
>>>>
>>>> On Wed, Jul 1, 2009 at 5:45 PM, O'Donnell, Dan<daniel.odonnell at uleth.ca> wrote:
>>>>     
>>>>         
>>>>> I also don't understand why we are sweating the att value. Are we really interested in controlling this vocabulary? Why?
>>>>>
>>>>> -----------
>>>>> Daniel O'Donnell
>>>>> University of Lethbridge
>>>>> (From my mobile telephone)
>>>>>
>>>>> --- original message ---
>>>>> From: "Dot Porter" <dot.porter at gmail.com>
>>>>> Subject: Re: [tei-council] word-dividing
>>>>> Date: July 1, 2009
>>>>> Time: 10:17:9
>>>>>
>>>>> I don't really understand the concern here. An lb (or cb, or pb) that
>>>>> appears in the middle of a word physically divides that word, hence
>>>>> "worddiv". As long as this usage is defined clearly in the Guidelines
>>>>> ("use @type='worddiv' to mark lb, pb or cb that physically divide
>>>>> words") I don't think there will be any confusion on the part of
>>>>> users. It's clear. And there's a history of usage, since EpiDoc is
>>>>> already doing this, and has been. Why mess with something that works?
>>>>>
>>>>> Dot
>>>>>
>>>>> On Wed, Jul 1, 2009 at 5:08 PM, Gabriel Bodard<gabriel.bodard at kcl.ac.uk> wrote:
>>>>>       
>>>>>           
>>>>>> Right. I guess my only objection is that it sounds more like a
>>>>>> processing instruction than a description of the text. But I take your
>>>>>> point. Let's see if anyone comes up with any suggestions better than
>>>>>> either of ours. :-) (It would be nice if what we suggested in the
>>>>>> example was something that is actually being used... and if we come to a
>>>>>> consensus I'll recommend changing EpiDoc usage to whatever we use in the
>>>>>> example in the guidelines.
>>>>>>
>>>>>> (If we don't come to a consensus, as you say, no problem.)
>>>>>>
>>>>>> G
>>>>>>
>>>>>> Lou Burnard wrote:
>>>>>>         
>>>>>>             
>>>>>>> Sorry, but I do not follow your logic. "nobreak" says something about
>>>>>>> the type of <lb> -- it is a "non-breaking" line break.  The implication
>>>>>>> is that other <lb> (or <cb> etc) s are "breaking" i.e. they are
>>>>>>> understood not only to mark the start of a line, column etc, but also to
>>>>>>> break  a word. so that foo<lb/>bar should be considered to be two words.
>>>>>>>
>>>>>>> There are breaks between your words conceptually, I hope? If not, what
>>>>>>> is the point of trying to distinguish types of <lb> anyway?
>>>>>>>
>>>>>>> If epidockers dont like this though they can always make up their own
>>>>>>> terminology -- the type value is not constrained by the schema.
>>>>>>>
>>>>>>> Gabriel Bodard wrote:
>>>>>>>           
>>>>>>>               
>>>>>>>> I'm not sure I like "nobreak", as it doesn't really say anything about
>>>>>>>> the status of the lb (or, as Dot points out, cb, pb, etc.); especially
>>>>>>>> since there are never (or rarely) breaks _between_ words in our texts.
>>>>>>>> The idea behind "worddiv" was that this is a linebreak that appears
>>>>>>>> mid-word, splitting it atwain, as Dan has it. Let me canvas the EpiDoc
>>>>>>>> markup list, and see if people there have opinions one way or the other
>>>>>>>> to contribute to this...
>>>>>>>>
>>>>>>>> G
>>>>>>>>
>>>>>>>> Lou Burnard wrote:
>>>>>>>>
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> After much head scratching here in Oxford, we've decided on "nobreak"
>>>>>>>>>
>>>>>>>>> I added a couple more examples and a bit more discussion, taking
>>>>>>>>> examples from some real projects too. Affected are the definition for
>>>>>>>>> <lb> and the discussion of milestones in CO.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Daniel Paul O'Donnell wrote:
>>>>>>>>>
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>>>> I think "word-dividing" in this case means "splitting individual words
>>>>>>>>>> atwain" rather than "demarcating their boundaries" ;)
>>>>>>>>>>
>>>>>>>>>> In my edition of Cædmon's Hymn I needed to encode space and lb
>>>>>>>>>> similarly explicitly: i.e. indicating whether it fell within the word
>>>>>>>>>> or between words: the stylesheets (such as they were in those days)
>>>>>>>>>> handled them differently depending on the value of @type (which I'd
>>>>>>>>>> made universal). White space wouldn't have done it for me, because I
>>>>>>>>>> was reformatting the data with and without the word-internal spaces
>>>>>>>>>> and lines depending on the view the user selected.
>>>>>>>>>>
>>>>>>>>>> -dan
>>>>>>>>>>
>>>>>>>>>> Lou Burnard wrote:
>>>>>>>>>>
>>>>>>>>>>                 
>>>>>>>>>>                     
>>>>>>>>>>> Gabriel BODARD wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                   
>>>>>>>>>>>                       
>>>>>>>>>>>> Lou Burnard wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>                     
>>>>>>>>>>>>                         
>>>>>>>>>>>                   
>>>>>>>>>>>                       
>>>>>>>>>>>>>> (9) lb: should we add an example of the usage of
>>>>>>>>>>>>>> lb/type=word-dividing, which currently sits a little uncomfortably
>>>>>>>>>>>>>> in the note. I suggest "Cae<lb type="worddiv"/>sari".
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                         
>>>>>>>>>>>>>>                             
>>>>>>>>>>>>> Don't know what note you're referring to. Don't see the point of
>>>>>>>>>>>>> the @type attribute. Haven't done anything.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>                       
>>>>>>>>>>>>>                           
>>>>>>>>>>>> This was discussed some months ago, and is the reason @type was
>>>>>>>>>>>> allowed on <lb> in the first place. There is currently a note at the
>>>>>>>>>>>> bottom of LB that says: "The type attribute may be used to
>>>>>>>>>>>> characterize the linebreak in any respect, for example as
>>>>>>>>>>>> word-breaking or not." We have literally thousands of examples of
>>>>>>>>>>>> this in EpiDoc files, where words are not always tagged explicitly
>>>>>>>>>>>> and it's the only way we can be sure to tokenize correctly. I just
>>>>>>>>>>>> thought an example would help to clarify the use-case.
>>>>>>>>>>>>
>>>>>>>>>>>> (If people feel strongly that [e.g.] "wordDividing" would be a
>>>>>>>>>>>> better recommended value than "worddiv", I'm happy to make that part
>>>>>>>>>>>> of our P5 upgrade script.)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>                     
>>>>>>>>>>>>                         
>>>>>>>>>>> I don't mind adding examples, but this one confuses me. Isn't the
>>>>>>>>>>> point that the <lb/> in your example does NOT divide the word ? so
>>>>>>>>>>> both "wordDividing" and "worddiv" seem exactly the opposite of what
>>>>>>>>>>> you want here. How about "nowordbreak" or "nwb"?
>>>>>>>>>>>
>>>>>>>>>>> I know I lost this argument last time, but I still think in practice
>>>>>>>>>>> I'd deal with this by putting in whitespace where the <lb> coincided
>>>>>>>>>>> with a word boundary and leaving  it out where it didn't!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                   
>>>>>>>>>>>                       
>>>>>>>>>>>> Best,
>>>>>>>>>>>>
>>>>>>>>>>>> G
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>                     
>>>>>>>>>>>>                         
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> tei-council mailing list
>>>>>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                   
>>>>>>>>>>>                       
>>>>>>>> _______________________________________________
>>>>>>>> tei-council mailing list
>>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>>
>>>>>>>>             
>>>>>>>>                 
>>>>>> --
>>>>>> Dr Gabriel BODARD
>>>>>> (Epigrapher & Digital Classicist)
>>>>>>
>>>>>> Centre for Computing in the Humanities
>>>>>> King's College London
>>>>>> 26-29 Drury Lane
>>>>>> London WC2B 5RL
>>>>>> Email: gabriel.bodard at kcl.ac.uk
>>>>>> Tel: +44 (0)20 7848 1388
>>>>>> Fax: +44 (0)20 7848 2980
>>>>>>
>>>>>> http://www.digitalclassicist.org/
>>>>>> http://www.currentepigraphy.org/
>>>>>> _______________________________________________
>>>>>> tei-council mailing list
>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>
>>>>>>         
>>>>>>             
>>>>> --
>>>>> *~*~*~*~*~*~*~*~*~*~*
>>>>> Dot Porter (MA, MSLS)          Metadata Manager
>>>>> Digital Humanities Observatory (RIA), Regus House, 28-32 Upper
>>>>> Pembroke Street, Dublin 2, Ireland
>>>>> -- A Project of the Royal Irish Academy --
>>>>> Phone: +353 1 234 2444        Fax: +353 1 234 2400
>>>>> http://dho.ie          Email: dot.porter at gmail.com
>>>>> *~*~*~*~*~*~*~*~*~*~*
>>>>> _______________________________________________
>>>>> tei-council mailing list
>>>>> tei-council at lists.village.Virginia.EDU
>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>
>>>>>       
>>>>>           
>>>>     
>>>>         
>>>   
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> tei-council mailing list
>>> tei-council at lists.village.Virginia.EDU
>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>   
>>>       
>> _______________________________________________
>> tei-council mailing list
>> tei-council at lists.village.Virginia.EDU
>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>     
>
>   



More information about the tei-council mailing list