[tei-council] word-dividing

Daniel Paul O'Donnell daniel.odonnell at gmail.com
Thu Jul 2 12:30:52 EDT 2009


Hi Lou,

I had exactly the same question a couple of minutes ago: it's the epidoc 
list http://lsv.uky.edu/scripts/wa.exe. But I'm not sure they accept 
just any riff raff ;)

-dan

Lou Burnard wrote:
> What is the "markup" list? I'm not on it, and no-onewho is saw fit to 
> make the suggestion here or I'd have probably seized on it with 
> gratitude.  I really must protest at the implication that we're making 
> arbitrary decisions here. There was  a long discussion when the @type 
> attribute was added to these elements ages ago. No one proposed any 
> suggested values at that time. In the first message in the thread below, 
> you see Gabriel saying he's open to suggestions for them, but again no 
> discussion occurred. My report that we'd decided on "nobreak" here at 
> Oxford is at the most recent end of the thread, not the impost distant 
> one. Give us a break Dot, we're trying to get the job done, and the 
> weather's not helping!
>
> Incidentally, I just received mail from someone else concerned with how 
> this textual phenomenon is to be encoded. Their practice is to record 
> *two* <lb>s -- one for the "facs" view of the document (faithful to its 
> appearance) and the other for the "editorial" view (in which the word is 
> reassembled). So if they found a word "foobar" hyphenated at a 
> linebreak, it would be recorded like this
>
> foo<lb type="facs"/>bar<lb type="edit"/>
>
> I don't know how many thousands of cases they've got marked up like that 
> already...
>
>
>  Dot Porter wrote:
>   
>> "midword" was suggested on the Markup list. It is not overly Latinate,
>> it is clear, I think, to most people (a line break in the middle of
>> the word), and in my opinion it's better than "nobreak". I'd still
>> rather see worddiv given that 1) (in my opinion again) it's not as
>> ambiguous as some seem to think, 2) it effectively describes the lb
>> (dividing a word) and 3) it has a longstanding history of usage (since
>> at least 2002, and currently used in something in the order of 60,000
>> existing EpiDoc TEI documents). I really do think that should count
>> for something.
>>
>> More generally, again in my opinion, this entire discussion represents
>> what I hope is a one-off problem. In one of his earliest messages in
>> this thread, Lou said, "After much head scratching here in Oxford,
>> we've decided on "nobreak"." No real room left for discussion, just a
>> decision made. No request for suggestions. David did make a
>> suggestion, but it wasn't asked for. The argument Lou sent today,
>> which sets out his thoughts in some detail, should have come before
>> the final text went out in the Guidelines.
>>
>> I think it's dangerous to give a single person, or group of people,
>> the power to override everyone else who should be involved in the
>> decision making process even if, as in this case, it's not a very
>> important matter. That's all.
>>
>> Dot
>>
>> On Thu, Jul 2, 2009 at 12:34 PM, Lou Burnard<lou.burnard at oucs.ox.ac.uk> wrote:
>>   
>>     
>>> With a few exceptions, the examples throughout the Guidelines are chosen
>>> as examples of textual phenomena, not of ways those textual phenomena
>>> have been treated in particular encoding projects. Since @type on <lb>
>>> or <pb> was only recently introduced it's unsurprising that there aren't
>>> that many existing TEI precedents to follow and, while I yield to no-one
>>> in my admiration for the epidoc project, I would resist pressure to make
>>> it (or any other project) the sole driver for decisions about what goes
>>> in the Guidelines.
>>>
>>> The particular phenomenon we're dealing with here can be -- has been --
>>> dealt with in several different ways -- there are thousands of cases
>>> also of texts in which the encoder has chosen to treat this phenomenon
>>> in a completely different way! In the Bibliotheque Virtuelle des
>>> Humanistes, for example, they mark the word-fragments introduced by the
>>> presence of the <lb/> explicitly. So they would have something like
>>>
>>> <caes full="imperator">imp</caes><lb/><caes>erator</caes>
>>>
>>> ("caes" is short for "caesura" which means something different in
>>> French, apparently)
>>> In other projects, (probably even more numerous) they decided to just
>>> move the <lb> to the end of the nearest word:  imperator<lb/>
>>>
>>> My objection to wordDiv, wordDivision, vel sim is just that it's
>>> ambiguous as between  "division between words" or "division within a
>>> word". Since the whole point of this attribute is to specify exactly
>>> which of those two is the case , this seems a bad idea. With all
>>> humility, I still think that "nobreak" is less ambiguous -- it implies
>>> that although the name of the element bearing it implies some kind of
>>> "break", in this particular case, the break isn't considered to be
>>> there.  I am perfectly amenable to other suggestions, but the only one
>>> I've seen so far is David's. "intraword" is certainly unambiguous (at
>>> least to those who've been properly educated) but does seem a bit
>>> long-winded. Remember that we'd like these values to be comprehensible
>>> to native speakers of non-Latin languages as well if possible.
>>>
>>>
>>> Gabriel BODARD wrote:
>>>     
>>>       
>>>> Sure it doesn't terribly matter what the attribute value is, since it's
>>>> not constrained, but aren't these examples supposed where possible to be
>>>> based on real usage? Why then would you invent an attribute value that
>>>> no one's using, rather than using the value that has been used in tens
>>>> of thousands of examples in the real world?
>>>>
>>>> G
>>>>
>>>> Lou Burnard wrote:
>>>>
>>>>       
>>>>         
>>>>> We considered that, but it's a bit latinate, don't you think?
>>>>>
>>>>> I agree with Dan that there's no available time to sweat this further
>>>>> (despite the weather :-). If people want to make further changes to
>>>>> wording (I'm assuming everyone has actually looked at the newly revised
>>>>> examples and discussion?) they will go into the mix for next time, but
>>>>> we need to get this error fixing release out the door today.
>>>>>
>>>>>
>>>>>
>>>>> David Sewell wrote:
>>>>>
>>>>>         
>>>>>           
>>>>>> As a naive non-epigraphist, I would find this unambiguous, for what it's
>>>>>> worth:
>>>>>>
>>>>>>   <lb type="intraword"/>
>>>>>>
>>>>>> David
>>>>>>
>>>>>> On Wed, 1 Jul 2009, Dot Porter wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>           
>>>>>>             
>>>>>>> Dan, I don't think anyone is suggesting the value be technically
>>>>>>> controlled, but we want an example in the Guidelines. And as people
>>>>>>> tend to take the Guidelines suggestions quite seriously, it's worth
>>>>>>> considering what the suggested value be.
>>>>>>>
>>>>>>> Dot
>>>>>>>
>>>>>>> On Wed, Jul 1, 2009 at 5:45 PM, O'Donnell, Dan<daniel.odonnell at uleth.ca> wrote:
>>>>>>>
>>>>>>>
>>>>>>>             
>>>>>>>               
>>>>>>>> I also don't understand why we are sweating the att value. Are we really interested in controlling this vocabulary? Why?
>>>>>>>>
>>>>>>>> -----------
>>>>>>>> Daniel O'Donnell
>>>>>>>> University of Lethbridge
>>>>>>>> (From my mobile telephone)
>>>>>>>>
>>>>>>>> --- original message ---
>>>>>>>> From: "Dot Porter" <dot.porter at gmail.com>
>>>>>>>> Subject: Re: [tei-council] word-dividing
>>>>>>>> Date: July 1, 2009
>>>>>>>> Time: 10:17:9
>>>>>>>>
>>>>>>>> I don't really understand the concern here. An lb (or cb, or pb) that
>>>>>>>> appears in the middle of a word physically divides that word, hence
>>>>>>>> "worddiv". As long as this usage is defined clearly in the Guidelines
>>>>>>>> ("use @type='worddiv' to mark lb, pb or cb that physically divide
>>>>>>>> words") I don't think there will be any confusion on the part of
>>>>>>>> users. It's clear. And there's a history of usage, since EpiDoc is
>>>>>>>> already doing this, and has been. Why mess with something that works?
>>>>>>>>
>>>>>>>> Dot
>>>>>>>>
>>>>>>>> On Wed, Jul 1, 2009 at 5:08 PM, Gabriel Bodard<gabriel.bodard at kcl.ac.uk> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>               
>>>>>>>>                 
>>>>>>>>> Right. I guess my only objection is that it sounds more like a
>>>>>>>>> processing instruction than a description of the text. But I take your
>>>>>>>>> point. Let's see if anyone comes up with any suggestions better than
>>>>>>>>> either of ours. :-) (It would be nice if what we suggested in the
>>>>>>>>> example was something that is actually being used... and if we come to a
>>>>>>>>> consensus I'll recommend changing EpiDoc usage to whatever we use in the
>>>>>>>>> example in the guidelines.
>>>>>>>>>
>>>>>>>>> (If we don't come to a consensus, as you say, no problem.)
>>>>>>>>>
>>>>>>>>> G
>>>>>>>>>
>>>>>>>>> Lou Burnard wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                 
>>>>>>>>>                   
>>>>>>>>>> Sorry, but I do not follow your logic. "nobreak" says something about
>>>>>>>>>> the type of <lb> -- it is a "non-breaking" line break.  The implication
>>>>>>>>>> is that other <lb> (or <cb> etc) s are "breaking" i.e. they are
>>>>>>>>>> understood not only to mark the start of a line, column etc, but also to
>>>>>>>>>> break  a word. so that foo<lb/>bar should be considered to be two words.
>>>>>>>>>>
>>>>>>>>>> There are breaks between your words conceptually, I hope? If not, what
>>>>>>>>>> is the point of trying to distinguish types of <lb> anyway?
>>>>>>>>>>
>>>>>>>>>> If epidockers dont like this though they can always make up their own
>>>>>>>>>> terminology -- the type value is not constrained by the schema.
>>>>>>>>>>
>>>>>>>>>> Gabriel Bodard wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                   
>>>>>>>>>>                     
>>>>>>>>>>> I'm not sure I like "nobreak", as it doesn't really say anything about
>>>>>>>>>>> the status of the lb (or, as Dot points out, cb, pb, etc.); especially
>>>>>>>>>>> since there are never (or rarely) breaks _between_ words in our texts.
>>>>>>>>>>> The idea behind "worddiv" was that this is a linebreak that appears
>>>>>>>>>>> mid-word, splitting it atwain, as Dan has it. Let me canvas the EpiDoc
>>>>>>>>>>> markup list, and see if people there have opinions one way or the other
>>>>>>>>>>> to contribute to this...
>>>>>>>>>>>
>>>>>>>>>>> G
>>>>>>>>>>>
>>>>>>>>>>> Lou Burnard wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                     
>>>>>>>>>>>                       
>>>>>>>>>>>> After much head scratching here in Oxford, we've decided on "nobreak"
>>>>>>>>>>>>
>>>>>>>>>>>> I added a couple more examples and a bit more discussion, taking
>>>>>>>>>>>> examples from some real projects too. Affected are the definition for
>>>>>>>>>>>> <lb> and the discussion of milestones in CO.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Daniel Paul O'Donnell wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>                       
>>>>>>>>>>>>                         
>>>>>>>>>>>>> I think "word-dividing" in this case means "splitting individual words
>>>>>>>>>>>>> atwain" rather than "demarcating their boundaries" ;)
>>>>>>>>>>>>>
>>>>>>>>>>>>> In my edition of Cædmon's Hymn I needed to encode space and lb
>>>>>>>>>>>>> similarly explicitly: i.e. indicating whether it fell within the word
>>>>>>>>>>>>> or between words: the stylesheets (such as they were in those days)
>>>>>>>>>>>>> handled them differently depending on the value of @type (which I'd
>>>>>>>>>>>>> made universal). White space wouldn't have done it for me, because I
>>>>>>>>>>>>> was reformatting the data with and without the word-internal spaces
>>>>>>>>>>>>> and lines depending on the view the user selected.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -dan
>>>>>>>>>>>>>
>>>>>>>>>>>>> Lou Burnard wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>                         
>>>>>>>>>>>>>                           
>>>>>>>>>>>>>> Gabriel BODARD wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                           
>>>>>>>>>>>>>>                             
>>>>>>>>>>>>>>> Lou Burnard wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>                             
>>>>>>>>>>>>>>>                               
>>>>>>>>>>>>>>                           
>>>>>>>>>>>>>>                             
>>>>>>>>>>>>>>>>> (9) lb: should we add an example of the usage of
>>>>>>>>>>>>>>>>> lb/type=word-dividing, which currently sits a little uncomfortably
>>>>>>>>>>>>>>>>> in the note. I suggest "Cae<lb type="worddiv"/>sari".
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>                                 
>>>>>>>>>>>>>>>>>                                   
>>>>>>>>>>>>>>>> Don't know what note you're referring to. Don't see the point of
>>>>>>>>>>>>>>>> the @type attribute. Haven't done anything.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>                               
>>>>>>>>>>>>>>>>                                 
>>>>>>>>>>>>>>> This was discussed some months ago, and is the reason @type was
>>>>>>>>>>>>>>> allowed on <lb> in the first place. There is currently a note at the
>>>>>>>>>>>>>>> bottom of LB that says: "The type attribute may be used to
>>>>>>>>>>>>>>> characterize the linebreak in any respect, for example as
>>>>>>>>>>>>>>> word-breaking or not." We have literally thousands of examples of
>>>>>>>>>>>>>>> this in EpiDoc files, where words are not always tagged explicitly
>>>>>>>>>>>>>>> and it's the only way we can be sure to tokenize correctly. I just
>>>>>>>>>>>>>>> thought an example would help to clarify the use-case.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (If people feel strongly that [e.g.] "wordDividing" would be a
>>>>>>>>>>>>>>> better recommended value than "worddiv", I'm happy to make that part
>>>>>>>>>>>>>>> of our P5 upgrade script.)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>                             
>>>>>>>>>>>>>>>                               
>>>>>>>>>>>>>> I don't mind adding examples, but this one confuses me. Isn't the
>>>>>>>>>>>>>> point that the <lb/> in your example does NOT divide the word ? so
>>>>>>>>>>>>>> both "wordDividing" and "worddiv" seem exactly the opposite of what
>>>>>>>>>>>>>> you want here. How about "nowordbreak" or "nwb"?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I know I lost this argument last time, but I still think in practice
>>>>>>>>>>>>>> I'd deal with this by putting in whitespace where the <lb> coincided
>>>>>>>>>>>>>> with a word boundary and leaving  it out where it didn't!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                           
>>>>>>>>>>>>>>                             
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> G
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>                             
>>>>>>>>>>>>>>>                               
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> tei-council mailing list
>>>>>>>>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>>>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                           
>>>>>>>>>>>>>>                             
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> tei-council mailing list
>>>>>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                     
>>>>>>>>>>>                       
>>>>>>>>> --
>>>>>>>>> Dr Gabriel BODARD
>>>>>>>>> (Epigrapher & Digital Classicist)
>>>>>>>>>
>>>>>>>>> Centre for Computing in the Humanities
>>>>>>>>> King's College London
>>>>>>>>> 26-29 Drury Lane
>>>>>>>>> London WC2B 5RL
>>>>>>>>> Email: gabriel.bodard at kcl.ac.uk
>>>>>>>>> Tel: +44 (0)20 7848 1388
>>>>>>>>> Fax: +44 (0)20 7848 2980
>>>>>>>>>
>>>>>>>>> http://www.digitalclassicist.org/
>>>>>>>>> http://www.currentepigraphy.org/
>>>>>>>>> _______________________________________________
>>>>>>>>> tei-council mailing list
>>>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                 
>>>>>>>>>                   
>>>>>>>> --
>>>>>>>> *~*~*~*~*~*~*~*~*~*~*
>>>>>>>> Dot Porter (MA, MSLS)          Metadata Manager
>>>>>>>> Digital Humanities Observatory (RIA), Regus House, 28-32 Upper
>>>>>>>> Pembroke Street, Dublin 2, Ireland
>>>>>>>> -- A Project of the Royal Irish Academy --
>>>>>>>> Phone: +353 1 234 2444        Fax: +353 1 234 2400
>>>>>>>> http://dho.ie          Email: dot.porter at gmail.com
>>>>>>>> *~*~*~*~*~*~*~*~*~*~*
>>>>>>>> _______________________________________________
>>>>>>>> tei-council mailing list
>>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>               
>>>>>>>>                 
>>>>>>>             
>>>>>>>               
>>>>>> ------------------------------------------------------------------------
>>>>>>
>>>>>> _______________________________________________
>>>>>> tei-council mailing list
>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>
>>>>>>
>>>>>>           
>>>>>>             
>>>>> _______________________________________________
>>>>> tei-council mailing list
>>>>> tei-council at lists.village.Virginia.EDU
>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>
>>>>>         
>>>>>           
>>>>       
>>>>         
>>> _______________________________________________
>>> tei-council mailing list
>>> tei-council at lists.village.Virginia.EDU
>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>
>>>     
>>>       
>>
>>   
>>     
>
> _______________________________________________
> tei-council mailing list
> tei-council at lists.village.Virginia.EDU
> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>   

-- 
Daniel Paul O'Donnell
Associate Professor of English
University of Lethbridge

Chair and CEO, Text Encoding Initiative (http://www.tei-c.org/)
Co-Chair, Digital Initiatives Advisory Board, Medieval Academy of America
President-elect (English), Society for Digital Humanities/Société pour l'étude des médias interactifs (http://sdh-semi.org/)
Founding Director (2003-2009), Digital Medievalist Project (http://www.digitalmedievalist.org/)

Vox: +1 403 329-2377
Fax: +1 403 382-7191 (non-confidental)
Home Page: http://people.uleth.ca/~daniel.odonnell/




More information about the tei-council mailing list