[tei-council] word-dividing

Lou Burnard lou.burnard at oucs.ox.ac.uk
Thu Jul 2 12:48:43 EDT 2009


Thank you Gabby! I can read the list in digest form, I discover, though 
not (yet) post to it.
When I can, I shall be posting the following comment

"midword" is definitely good. How about "inWord" ? (shorter, doesnt make 
me think of a cricketting position)


Gabriel Bodard wrote:
> Sorry, Lou: I thought I remembered that you had taken part in 
> conversations on the EpiDoc Markup list several years ago, and so 
> assumed that you were still on it.
>
> http://lsv.uky.edu/archives/markup.html
>
> That wasn't supposed to be an obscure reference.
>
> G
>
> Lou Burnard a écrit :
>> What is the "markup" list? I'm not on it, and no-onewho is saw fit to
>> make the suggestion here or I'd have probably seized on it with
>> gratitude.  I really must protest at the implication that we're making
>> arbitrary decisions here. There was  a long discussion when the @type
>> attribute was added to these elements ages ago. No one proposed any
>> suggested values at that time. In the first message in the thread below,
>> you see Gabriel saying he's open to suggestions for them, but again no
>> discussion occurred. My report that we'd decided on "nobreak" here at
>> Oxford is at the most recent end of the thread, not the impost distant
>> one. Give us a break Dot, we're trying to get the job done, and the
>> weather's not helping!
>>
>> Incidentally, I just received mail from someone else concerned with how
>> this textual phenomenon is to be encoded. Their practice is to record
>> *two* <lb>s -- one for the "facs" view of the document (faithful to its
>> appearance) and the other for the "editorial" view (in which the word is
>> reassembled). So if they found a word "foobar" hyphenated at a
>> linebreak, it would be recorded like this
>>
>> foo<lb type="facs"/>bar<lb type="edit"/>
>>
>> I don't know how many thousands of cases they've got marked up like that
>> already...
>>
>>
>>  Dot Porter wrote:
>>> "midword" was suggested on the Markup list. It is not overly Latinate,
>>> it is clear, I think, to most people (a line break in the middle of
>>> the word), and in my opinion it's better than "nobreak". I'd still
>>> rather see worddiv given that 1) (in my opinion again) it's not as
>>> ambiguous as some seem to think, 2) it effectively describes the lb
>>> (dividing a word) and 3) it has a longstanding history of usage (since
>>> at least 2002, and currently used in something in the order of 60,000
>>> existing EpiDoc TEI documents). I really do think that should count
>>> for something.
>>>
>>> More generally, again in my opinion, this entire discussion represents
>>> what I hope is a one-off problem. In one of his earliest messages in
>>> this thread, Lou said, "After much head scratching here in Oxford,
>>> we've decided on "nobreak"." No real room left for discussion, just a
>>> decision made. No request for suggestions. David did make a
>>> suggestion, but it wasn't asked for. The argument Lou sent today,
>>> which sets out his thoughts in some detail, should have come before
>>> the final text went out in the Guidelines.
>>>
>>> I think it's dangerous to give a single person, or group of people,
>>> the power to override everyone else who should be involved in the
>>> decision making process even if, as in this case, it's not a very
>>> important matter. That's all.
>>>
>>> Dot
>>>
>>> On Thu, Jul 2, 2009 at 12:34 PM, Lou 
>>> Burnard<lou.burnard at oucs.ox.ac.uk> wrote:
>>>
>>>> With a few exceptions, the examples throughout the Guidelines are 
>>>> chosen
>>>> as examples of textual phenomena, not of ways those textual phenomena
>>>> have been treated in particular encoding projects. Since @type on <lb>
>>>> or <pb> was only recently introduced it's unsurprising that there 
>>>> aren't
>>>> that many existing TEI precedents to follow and, while I yield to 
>>>> no-one
>>>> in my admiration for the epidoc project, I would resist pressure to 
>>>> make
>>>> it (or any other project) the sole driver for decisions about what 
>>>> goes
>>>> in the Guidelines.
>>>>
>>>> The particular phenomenon we're dealing with here can be -- has 
>>>> been --
>>>> dealt with in several different ways -- there are thousands of cases
>>>> also of texts in which the encoder has chosen to treat this phenomenon
>>>> in a completely different way! In the Bibliotheque Virtuelle des
>>>> Humanistes, for example, they mark the word-fragments introduced by 
>>>> the
>>>> presence of the <lb/> explicitly. So they would have something like
>>>>
>>>> <caes full="imperator">imp</caes><lb/><caes>erator</caes>
>>>>
>>>> ("caes" is short for "caesura" which means something different in
>>>> French, apparently)
>>>> In other projects, (probably even more numerous) they decided to just
>>>> move the <lb> to the end of the nearest word:  imperator<lb/>
>>>>
>>>> My objection to wordDiv, wordDivision, vel sim is just that it's
>>>> ambiguous as between  "division between words" or "division within a
>>>> word". Since the whole point of this attribute is to specify exactly
>>>> which of those two is the case , this seems a bad idea. With all
>>>> humility, I still think that "nobreak" is less ambiguous -- it implies
>>>> that although the name of the element bearing it implies some kind of
>>>> "break", in this particular case, the break isn't considered to be
>>>> there.  I am perfectly amenable to other suggestions, but the only one
>>>> I've seen so far is David's. "intraword" is certainly unambiguous (at
>>>> least to those who've been properly educated) but does seem a bit
>>>> long-winded. Remember that we'd like these values to be comprehensible
>>>> to native speakers of non-Latin languages as well if possible.
>>>>
>>>>
>>>> Gabriel BODARD wrote:
>>>>
>>>>> Sure it doesn't terribly matter what the attribute value is, since 
>>>>> it's
>>>>> not constrained, but aren't these examples supposed where possible 
>>>>> to be
>>>>> based on real usage? Why then would you invent an attribute value 
>>>>> that
>>>>> no one's using, rather than using the value that has been used in 
>>>>> tens
>>>>> of thousands of examples in the real world?
>>>>>
>>>>> G
>>>>>
>>>>> Lou Burnard wrote:
>>>>>
>>>>>
>>>>>> We considered that, but it's a bit latinate, don't you think?
>>>>>>
>>>>>> I agree with Dan that there's no available time to sweat this 
>>>>>> further
>>>>>> (despite the weather :-). If people want to make further changes to
>>>>>> wording (I'm assuming everyone has actually looked at the newly 
>>>>>> revised
>>>>>> examples and discussion?) they will go into the mix for next 
>>>>>> time, but
>>>>>> we need to get this error fixing release out the door today.
>>>>>>
>>>>>>
>>>>>>
>>>>>> David Sewell wrote:
>>>>>>
>>>>>>
>>>>>>> As a naive non-epigraphist, I would find this unambiguous, for 
>>>>>>> what it's
>>>>>>> worth:
>>>>>>>
>>>>>>>   <lb type="intraword"/>
>>>>>>>
>>>>>>> David
>>>>>>>
>>>>>>> On Wed, 1 Jul 2009, Dot Porter wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Dan, I don't think anyone is suggesting the value be technically
>>>>>>>> controlled, but we want an example in the Guidelines. And as 
>>>>>>>> people
>>>>>>>> tend to take the Guidelines suggestions quite seriously, it's 
>>>>>>>> worth
>>>>>>>> considering what the suggested value be.
>>>>>>>>
>>>>>>>> Dot
>>>>>>>>
>>>>>>>> On Wed, Jul 1, 2009 at 5:45 PM, O'Donnell, 
>>>>>>>> Dan<daniel.odonnell at uleth.ca> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> I also don't understand why we are sweating the att value. Are 
>>>>>>>>> we really interested in controlling this vocabulary? Why?
>>>>>>>>>
>>>>>>>>> -----------
>>>>>>>>> Daniel O'Donnell
>>>>>>>>> University of Lethbridge
>>>>>>>>> (From my mobile telephone)
>>>>>>>>>
>>>>>>>>> --- original message ---
>>>>>>>>> From: "Dot Porter" <dot.porter at gmail.com>
>>>>>>>>> Subject: Re: [tei-council] word-dividing
>>>>>>>>> Date: July 1, 2009
>>>>>>>>> Time: 10:17:9
>>>>>>>>>
>>>>>>>>> I don't really understand the concern here. An lb (or cb, or 
>>>>>>>>> pb) that
>>>>>>>>> appears in the middle of a word physically divides that word, 
>>>>>>>>> hence
>>>>>>>>> "worddiv". As long as this usage is defined clearly in the 
>>>>>>>>> Guidelines
>>>>>>>>> ("use @type='worddiv' to mark lb, pb or cb that physically divide
>>>>>>>>> words") I don't think there will be any confusion on the part of
>>>>>>>>> users. It's clear. And there's a history of usage, since 
>>>>>>>>> EpiDoc is
>>>>>>>>> already doing this, and has been. Why mess with something that 
>>>>>>>>> works?
>>>>>>>>>
>>>>>>>>> Dot
>>>>>>>>>
>>>>>>>>> On Wed, Jul 1, 2009 at 5:08 PM, Gabriel 
>>>>>>>>> Bodard<gabriel.bodard at kcl.ac.uk> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Right. I guess my only objection is that it sounds more like a
>>>>>>>>>> processing instruction than a description of the text. But I 
>>>>>>>>>> take your
>>>>>>>>>> point. Let's see if anyone comes up with any suggestions 
>>>>>>>>>> better than
>>>>>>>>>> either of ours. :-) (It would be nice if what we suggested in 
>>>>>>>>>> the
>>>>>>>>>> example was something that is actually being used... and if 
>>>>>>>>>> we come to a
>>>>>>>>>> consensus I'll recommend changing EpiDoc usage to whatever we 
>>>>>>>>>> use in the
>>>>>>>>>> example in the guidelines.
>>>>>>>>>>
>>>>>>>>>> (If we don't come to a consensus, as you say, no problem.)
>>>>>>>>>>
>>>>>>>>>> G
>>>>>>>>>>
>>>>>>>>>> Lou Burnard wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Sorry, but I do not follow your logic. "nobreak" says 
>>>>>>>>>>> something about
>>>>>>>>>>> the type of <lb> -- it is a "non-breaking" line break.  The 
>>>>>>>>>>> implication
>>>>>>>>>>> is that other <lb> (or <cb> etc) s are "breaking" i.e. they are
>>>>>>>>>>> understood not only to mark the start of a line, column etc, 
>>>>>>>>>>> but also to
>>>>>>>>>>> break  a word. so that foo<lb/>bar should be considered to 
>>>>>>>>>>> be two words.
>>>>>>>>>>>
>>>>>>>>>>> There are breaks between your words conceptually, I hope? If 
>>>>>>>>>>> not, what
>>>>>>>>>>> is the point of trying to distinguish types of <lb> anyway?
>>>>>>>>>>>
>>>>>>>>>>> If epidockers dont like this though they can always make up 
>>>>>>>>>>> their own
>>>>>>>>>>> terminology -- the type value is not constrained by the schema.
>>>>>>>>>>>
>>>>>>>>>>> Gabriel Bodard wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> I'm not sure I like "nobreak", as it doesn't really say 
>>>>>>>>>>>> anything about
>>>>>>>>>>>> the status of the lb (or, as Dot points out, cb, pb, etc.); 
>>>>>>>>>>>> especially
>>>>>>>>>>>> since there are never (or rarely) breaks _between_ words in 
>>>>>>>>>>>> our texts.
>>>>>>>>>>>> The idea behind "worddiv" was that this is a linebreak that 
>>>>>>>>>>>> appears
>>>>>>>>>>>> mid-word, splitting it atwain, as Dan has it. Let me canvas 
>>>>>>>>>>>> the EpiDoc
>>>>>>>>>>>> markup list, and see if people there have opinions one way 
>>>>>>>>>>>> or the other
>>>>>>>>>>>> to contribute to this...
>>>>>>>>>>>>
>>>>>>>>>>>> G
>>>>>>>>>>>>
>>>>>>>>>>>> Lou Burnard wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> After much head scratching here in Oxford, we've decided 
>>>>>>>>>>>>> on "nobreak"
>>>>>>>>>>>>>
>>>>>>>>>>>>> I added a couple more examples and a bit more discussion, 
>>>>>>>>>>>>> taking
>>>>>>>>>>>>> examples from some real projects too. Affected are the 
>>>>>>>>>>>>> definition for
>>>>>>>>>>>>> <lb> and the discussion of milestones in CO.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Daniel Paul O'Donnell wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think "word-dividing" in this case means "splitting 
>>>>>>>>>>>>>> individual words
>>>>>>>>>>>>>> atwain" rather than "demarcating their boundaries" ;)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In my edition of Cædmon's Hymn I needed to encode space 
>>>>>>>>>>>>>> and lb
>>>>>>>>>>>>>> similarly explicitly: i.e. indicating whether it fell 
>>>>>>>>>>>>>> within the word
>>>>>>>>>>>>>> or between words: the stylesheets (such as they were in 
>>>>>>>>>>>>>> those days)
>>>>>>>>>>>>>> handled them differently depending on the value of @type 
>>>>>>>>>>>>>> (which I'd
>>>>>>>>>>>>>> made universal). White space wouldn't have done it for 
>>>>>>>>>>>>>> me, because I
>>>>>>>>>>>>>> was reformatting the data with and without the 
>>>>>>>>>>>>>> word-internal spaces
>>>>>>>>>>>>>> and lines depending on the view the user selected.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -dan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Lou Burnard wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Gabriel BODARD wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Lou Burnard wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> (9) lb: should we add an example of the usage of
>>>>>>>>>>>>>>>>>> lb/type=word-dividing, which currently sits a little 
>>>>>>>>>>>>>>>>>> uncomfortably
>>>>>>>>>>>>>>>>>> in the note. I suggest "Cae<lb type="worddiv"/>sari".
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Don't know what note you're referring to. Don't see 
>>>>>>>>>>>>>>>>> the point of
>>>>>>>>>>>>>>>>> the @type attribute. Haven't done anything.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This was discussed some months ago, and is the reason 
>>>>>>>>>>>>>>>> @type was
>>>>>>>>>>>>>>>> allowed on <lb> in the first place. There is currently 
>>>>>>>>>>>>>>>> a note at the
>>>>>>>>>>>>>>>> bottom of LB that says: "The type attribute may be used to
>>>>>>>>>>>>>>>> characterize the linebreak in any respect, for example as
>>>>>>>>>>>>>>>> word-breaking or not." We have literally thousands of 
>>>>>>>>>>>>>>>> examples of
>>>>>>>>>>>>>>>> this in EpiDoc files, where words are not always tagged 
>>>>>>>>>>>>>>>> explicitly
>>>>>>>>>>>>>>>> and it's the only way we can be sure to tokenize 
>>>>>>>>>>>>>>>> correctly. I just
>>>>>>>>>>>>>>>> thought an example would help to clarify the use-case.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (If people feel strongly that [e.g.] "wordDividing" 
>>>>>>>>>>>>>>>> would be a
>>>>>>>>>>>>>>>> better recommended value than "worddiv", I'm happy to 
>>>>>>>>>>>>>>>> make that part
>>>>>>>>>>>>>>>> of our P5 upgrade script.)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I don't mind adding examples, but this one confuses me. 
>>>>>>>>>>>>>>> Isn't the
>>>>>>>>>>>>>>> point that the <lb/> in your example does NOT divide the 
>>>>>>>>>>>>>>> word ? so
>>>>>>>>>>>>>>> both "wordDividing" and "worddiv" seem exactly the 
>>>>>>>>>>>>>>> opposite of what
>>>>>>>>>>>>>>> you want here. How about "nowordbreak" or "nwb"?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I know I lost this argument last time, but I still think 
>>>>>>>>>>>>>>> in practice
>>>>>>>>>>>>>>> I'd deal with this by putting in whitespace where the 
>>>>>>>>>>>>>>> <lb> coincided
>>>>>>>>>>>>>>> with a word boundary and leaving  it out where it didn't!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> G
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> tei-council mailing list
>>>>>>>>>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>>>>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council 
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> tei-council mailing list
>>>>>>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>> -- 
>>>>>>>>>> Dr Gabriel BODARD
>>>>>>>>>> (Epigrapher & Digital Classicist)
>>>>>>>>>>
>>>>>>>>>> Centre for Computing in the Humanities
>>>>>>>>>> King's College London
>>>>>>>>>> 26-29 Drury Lane
>>>>>>>>>> London WC2B 5RL
>>>>>>>>>> Email: gabriel.bodard at kcl.ac.uk
>>>>>>>>>> Tel: +44 (0)20 7848 1388
>>>>>>>>>> Fax: +44 (0)20 7848 2980
>>>>>>>>>>
>>>>>>>>>> http://www.digitalclassicist.org/
>>>>>>>>>> http://www.currentepigraphy.org/
>>>>>>>>>> _______________________________________________
>>>>>>>>>> tei-council mailing list
>>>>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>> *~*~*~*~*~*~*~*~*~*~*
>>>>>>>>> Dot Porter (MA, MSLS)          Metadata Manager
>>>>>>>>> Digital Humanities Observatory (RIA), Regus House, 28-32 Upper
>>>>>>>>> Pembroke Street, Dublin 2, Ireland
>>>>>>>>> -- A Project of the Royal Irish Academy --
>>>>>>>>> Phone: +353 1 234 2444        Fax: +353 1 234 2400
>>>>>>>>> http://dho.ie          Email: dot.porter at gmail.com
>>>>>>>>> *~*~*~*~*~*~*~*~*~*~*
>>>>>>>>> _______________________________________________
>>>>>>>>> tei-council mailing list
>>>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>> ------------------------------------------------------------------------ 
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> tei-council mailing list
>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> _______________________________________________
>>>>>> tei-council mailing list
>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>
>>>>>>
>>>> _______________________________________________
>>>> tei-council mailing list
>>>> tei-council at lists.village.Virginia.EDU
>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>
>>>>
>>>
>>>
>>>
>>
>



More information about the tei-council mailing list