[tei-council] word-dividing

Dot Porter dot.porter at gmail.com
Thu Jul 2 12:14:39 EDT 2009


"midword" was suggested on the Markup list. It is not overly Latinate,
it is clear, I think, to most people (a line break in the middle of
the word), and in my opinion it's better than "nobreak". I'd still
rather see worddiv given that 1) (in my opinion again) it's not as
ambiguous as some seem to think, 2) it effectively describes the lb
(dividing a word) and 3) it has a longstanding history of usage (since
at least 2002, and currently used in something in the order of 60,000
existing EpiDoc TEI documents). I really do think that should count
for something.

More generally, again in my opinion, this entire discussion represents
what I hope is a one-off problem. In one of his earliest messages in
this thread, Lou said, "After much head scratching here in Oxford,
we've decided on "nobreak"." No real room left for discussion, just a
decision made. No request for suggestions. David did make a
suggestion, but it wasn't asked for. The argument Lou sent today,
which sets out his thoughts in some detail, should have come before
the final text went out in the Guidelines.

I think it's dangerous to give a single person, or group of people,
the power to override everyone else who should be involved in the
decision making process even if, as in this case, it's not a very
important matter. That's all.

Dot

On Thu, Jul 2, 2009 at 12:34 PM, Lou Burnard<lou.burnard at oucs.ox.ac.uk> wrote:
> With a few exceptions, the examples throughout the Guidelines are chosen
> as examples of textual phenomena, not of ways those textual phenomena
> have been treated in particular encoding projects. Since @type on <lb>
> or <pb> was only recently introduced it's unsurprising that there aren't
> that many existing TEI precedents to follow and, while I yield to no-one
> in my admiration for the epidoc project, I would resist pressure to make
> it (or any other project) the sole driver for decisions about what goes
> in the Guidelines.
>
> The particular phenomenon we're dealing with here can be -- has been --
> dealt with in several different ways -- there are thousands of cases
> also of texts in which the encoder has chosen to treat this phenomenon
> in a completely different way! In the Bibliotheque Virtuelle des
> Humanistes, for example, they mark the word-fragments introduced by the
> presence of the <lb/> explicitly. So they would have something like
>
> <caes full="imperator">imp</caes><lb/><caes>erator</caes>
>
> ("caes" is short for "caesura" which means something different in
> French, apparently)
> In other projects, (probably even more numerous) they decided to just
> move the <lb> to the end of the nearest word:  imperator<lb/>
>
> My objection to wordDiv, wordDivision, vel sim is just that it's
> ambiguous as between  "division between words" or "division within a
> word". Since the whole point of this attribute is to specify exactly
> which of those two is the case , this seems a bad idea. With all
> humility, I still think that "nobreak" is less ambiguous -- it implies
> that although the name of the element bearing it implies some kind of
> "break", in this particular case, the break isn't considered to be
> there.  I am perfectly amenable to other suggestions, but the only one
> I've seen so far is David's. "intraword" is certainly unambiguous (at
> least to those who've been properly educated) but does seem a bit
> long-winded. Remember that we'd like these values to be comprehensible
> to native speakers of non-Latin languages as well if possible.
>
>
> Gabriel BODARD wrote:
>> Sure it doesn't terribly matter what the attribute value is, since it's
>> not constrained, but aren't these examples supposed where possible to be
>> based on real usage? Why then would you invent an attribute value that
>> no one's using, rather than using the value that has been used in tens
>> of thousands of examples in the real world?
>>
>> G
>>
>> Lou Burnard wrote:
>>
>>> We considered that, but it's a bit latinate, don't you think?
>>>
>>> I agree with Dan that there's no available time to sweat this further
>>> (despite the weather :-). If people want to make further changes to
>>> wording (I'm assuming everyone has actually looked at the newly revised
>>> examples and discussion?) they will go into the mix for next time, but
>>> we need to get this error fixing release out the door today.
>>>
>>>
>>>
>>> David Sewell wrote:
>>>
>>>> As a naive non-epigraphist, I would find this unambiguous, for what it's
>>>> worth:
>>>>
>>>>   <lb type="intraword"/>
>>>>
>>>> David
>>>>
>>>> On Wed, 1 Jul 2009, Dot Porter wrote:
>>>>
>>>>
>>>>
>>>>> Dan, I don't think anyone is suggesting the value be technically
>>>>> controlled, but we want an example in the Guidelines. And as people
>>>>> tend to take the Guidelines suggestions quite seriously, it's worth
>>>>> considering what the suggested value be.
>>>>>
>>>>> Dot
>>>>>
>>>>> On Wed, Jul 1, 2009 at 5:45 PM, O'Donnell, Dan<daniel.odonnell at uleth.ca> wrote:
>>>>>
>>>>>
>>>>>> I also don't understand why we are sweating the att value. Are we really interested in controlling this vocabulary? Why?
>>>>>>
>>>>>> -----------
>>>>>> Daniel O'Donnell
>>>>>> University of Lethbridge
>>>>>> (From my mobile telephone)
>>>>>>
>>>>>> --- original message ---
>>>>>> From: "Dot Porter" <dot.porter at gmail.com>
>>>>>> Subject: Re: [tei-council] word-dividing
>>>>>> Date: July 1, 2009
>>>>>> Time: 10:17:9
>>>>>>
>>>>>> I don't really understand the concern here. An lb (or cb, or pb) that
>>>>>> appears in the middle of a word physically divides that word, hence
>>>>>> "worddiv". As long as this usage is defined clearly in the Guidelines
>>>>>> ("use @type='worddiv' to mark lb, pb or cb that physically divide
>>>>>> words") I don't think there will be any confusion on the part of
>>>>>> users. It's clear. And there's a history of usage, since EpiDoc is
>>>>>> already doing this, and has been. Why mess with something that works?
>>>>>>
>>>>>> Dot
>>>>>>
>>>>>> On Wed, Jul 1, 2009 at 5:08 PM, Gabriel Bodard<gabriel.bodard at kcl.ac.uk> wrote:
>>>>>>
>>>>>>
>>>>>>> Right. I guess my only objection is that it sounds more like a
>>>>>>> processing instruction than a description of the text. But I take your
>>>>>>> point. Let's see if anyone comes up with any suggestions better than
>>>>>>> either of ours. :-) (It would be nice if what we suggested in the
>>>>>>> example was something that is actually being used... and if we come to a
>>>>>>> consensus I'll recommend changing EpiDoc usage to whatever we use in the
>>>>>>> example in the guidelines.
>>>>>>>
>>>>>>> (If we don't come to a consensus, as you say, no problem.)
>>>>>>>
>>>>>>> G
>>>>>>>
>>>>>>> Lou Burnard wrote:
>>>>>>>
>>>>>>>
>>>>>>>> Sorry, but I do not follow your logic. "nobreak" says something about
>>>>>>>> the type of <lb> -- it is a "non-breaking" line break.  The implication
>>>>>>>> is that other <lb> (or <cb> etc) s are "breaking" i.e. they are
>>>>>>>> understood not only to mark the start of a line, column etc, but also to
>>>>>>>> break  a word. so that foo<lb/>bar should be considered to be two words.
>>>>>>>>
>>>>>>>> There are breaks between your words conceptually, I hope? If not, what
>>>>>>>> is the point of trying to distinguish types of <lb> anyway?
>>>>>>>>
>>>>>>>> If epidockers dont like this though they can always make up their own
>>>>>>>> terminology -- the type value is not constrained by the schema.
>>>>>>>>
>>>>>>>> Gabriel Bodard wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> I'm not sure I like "nobreak", as it doesn't really say anything about
>>>>>>>>> the status of the lb (or, as Dot points out, cb, pb, etc.); especially
>>>>>>>>> since there are never (or rarely) breaks _between_ words in our texts.
>>>>>>>>> The idea behind "worddiv" was that this is a linebreak that appears
>>>>>>>>> mid-word, splitting it atwain, as Dan has it. Let me canvas the EpiDoc
>>>>>>>>> markup list, and see if people there have opinions one way or the other
>>>>>>>>> to contribute to this...
>>>>>>>>>
>>>>>>>>> G
>>>>>>>>>
>>>>>>>>> Lou Burnard wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> After much head scratching here in Oxford, we've decided on "nobreak"
>>>>>>>>>>
>>>>>>>>>> I added a couple more examples and a bit more discussion, taking
>>>>>>>>>> examples from some real projects too. Affected are the definition for
>>>>>>>>>> <lb> and the discussion of milestones in CO.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Daniel Paul O'Donnell wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I think "word-dividing" in this case means "splitting individual words
>>>>>>>>>>> atwain" rather than "demarcating their boundaries" ;)
>>>>>>>>>>>
>>>>>>>>>>> In my edition of Cædmon's Hymn I needed to encode space and lb
>>>>>>>>>>> similarly explicitly: i.e. indicating whether it fell within the word
>>>>>>>>>>> or between words: the stylesheets (such as they were in those days)
>>>>>>>>>>> handled them differently depending on the value of @type (which I'd
>>>>>>>>>>> made universal). White space wouldn't have done it for me, because I
>>>>>>>>>>> was reformatting the data with and without the word-internal spaces
>>>>>>>>>>> and lines depending on the view the user selected.
>>>>>>>>>>>
>>>>>>>>>>> -dan
>>>>>>>>>>>
>>>>>>>>>>> Lou Burnard wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Gabriel BODARD wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Lou Burnard wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>>> (9) lb: should we add an example of the usage of
>>>>>>>>>>>>>>> lb/type=word-dividing, which currently sits a little uncomfortably
>>>>>>>>>>>>>>> in the note. I suggest "Cae<lb type="worddiv"/>sari".
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Don't know what note you're referring to. Don't see the point of
>>>>>>>>>>>>>> the @type attribute. Haven't done anything.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> This was discussed some months ago, and is the reason @type was
>>>>>>>>>>>>> allowed on <lb> in the first place. There is currently a note at the
>>>>>>>>>>>>> bottom of LB that says: "The type attribute may be used to
>>>>>>>>>>>>> characterize the linebreak in any respect, for example as
>>>>>>>>>>>>> word-breaking or not." We have literally thousands of examples of
>>>>>>>>>>>>> this in EpiDoc files, where words are not always tagged explicitly
>>>>>>>>>>>>> and it's the only way we can be sure to tokenize correctly. I just
>>>>>>>>>>>>> thought an example would help to clarify the use-case.
>>>>>>>>>>>>>
>>>>>>>>>>>>> (If people feel strongly that [e.g.] "wordDividing" would be a
>>>>>>>>>>>>> better recommended value than "worddiv", I'm happy to make that part
>>>>>>>>>>>>> of our P5 upgrade script.)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> I don't mind adding examples, but this one confuses me. Isn't the
>>>>>>>>>>>> point that the <lb/> in your example does NOT divide the word ? so
>>>>>>>>>>>> both "wordDividing" and "worddiv" seem exactly the opposite of what
>>>>>>>>>>>> you want here. How about "nowordbreak" or "nwb"?
>>>>>>>>>>>>
>>>>>>>>>>>> I know I lost this argument last time, but I still think in practice
>>>>>>>>>>>> I'd deal with this by putting in whitespace where the <lb> coincided
>>>>>>>>>>>> with a word boundary and leaving  it out where it didn't!
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>
>>>>>>>>>>>>> G
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> tei-council mailing list
>>>>>>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> tei-council mailing list
>>>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>> --
>>>>>>> Dr Gabriel BODARD
>>>>>>> (Epigrapher & Digital Classicist)
>>>>>>>
>>>>>>> Centre for Computing in the Humanities
>>>>>>> King's College London
>>>>>>> 26-29 Drury Lane
>>>>>>> London WC2B 5RL
>>>>>>> Email: gabriel.bodard at kcl.ac.uk
>>>>>>> Tel: +44 (0)20 7848 1388
>>>>>>> Fax: +44 (0)20 7848 2980
>>>>>>>
>>>>>>> http://www.digitalclassicist.org/
>>>>>>> http://www.currentepigraphy.org/
>>>>>>> _______________________________________________
>>>>>>> tei-council mailing list
>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> --
>>>>>> *~*~*~*~*~*~*~*~*~*~*
>>>>>> Dot Porter (MA, MSLS)          Metadata Manager
>>>>>> Digital Humanities Observatory (RIA), Regus House, 28-32 Upper
>>>>>> Pembroke Street, Dublin 2, Ireland
>>>>>> -- A Project of the Royal Irish Academy --
>>>>>> Phone: +353 1 234 2444        Fax: +353 1 234 2400
>>>>>> http://dho.ie          Email: dot.porter at gmail.com
>>>>>> *~*~*~*~*~*~*~*~*~*~*
>>>>>> _______________________________________________
>>>>>> tei-council mailing list
>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> tei-council mailing list
>>>> tei-council at lists.village.Virginia.EDU
>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>
>>>>
>>> _______________________________________________
>>> tei-council mailing list
>>> tei-council at lists.village.Virginia.EDU
>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>
>>
>>
>
> _______________________________________________
> tei-council mailing list
> tei-council at lists.village.Virginia.EDU
> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>



-- 
*~*~*~*~*~*~*~*~*~*~*
Dot Porter (MA, MSLS)          Metadata Manager
Digital Humanities Observatory (RIA), Regus House, 28-32 Upper
Pembroke Street, Dublin 2, Ireland
-- A Project of the Royal Irish Academy --
Phone: +353 1 234 2444        Fax: +353 1 234 2400
http://dho.ie          Email: dot.porter at gmail.com
*~*~*~*~*~*~*~*~*~*~*


More information about the tei-council mailing list