[tei-council] word-dividing

Dot Porter dot.porter at gmail.com
Thu Jul 2 12:37:56 EDT 2009


I don't mean to imply the decisions being made at Oxford are
arbitrary, but it did seem to me, in this instance, the decision was
made in a rush and with little input from anyone other than those at
Oxford. All I'm saying is that in the future we should be a little
more careful that the decisions made by council on behalf of the
entire TEI community be more transparent and made with a bit more care
and input.

Dot

On Thu, Jul 2, 2009 at 5:28 PM, Lou Burnard<lou.burnard at oucs.ox.ac.uk> wrote:
> What is the "markup" list? I'm not on it, and no-onewho is saw fit to make
> the suggestion here or I'd have probably seized on it with gratitude.  I
> really must protest at the implication that we're making arbitrary decisions
> here. There was  a long discussion when the @type attribute was added to
> these elements ages ago. No one proposed any suggested values at that time.
> In the first message in the thread below, you see Gabriel saying he's open
> to suggestions for them, but again no discussion occurred. My report that
> we'd decided on "nobreak" here at Oxford is at the most recent end of the
> thread, not the impost distant one. Give us a break Dot, we're trying to get
> the job done, and the weather's not helping!
>
> Incidentally, I just received mail from someone else concerned with how this
> textual phenomenon is to be encoded. Their practice is to record *two* <lb>s
> -- one for the "facs" view of the document (faithful to its appearance) and
> the other for the "editorial" view (in which the word is reassembled). So if
> they found a word "foobar" hyphenated at a linebreak, it would be recorded
> like this
>
> foo<lb type="facs"/>bar<lb type="edit"/>
>
> I don't know how many thousands of cases they've got marked up like that
> already...
>
>
> Dot Porter wrote:
>>
>> "midword" was suggested on the Markup list. It is not overly Latinate,
>> it is clear, I think, to most people (a line break in the middle of
>> the word), and in my opinion it's better than "nobreak". I'd still
>> rather see worddiv given that 1) (in my opinion again) it's not as
>> ambiguous as some seem to think, 2) it effectively describes the lb
>> (dividing a word) and 3) it has a longstanding history of usage (since
>> at least 2002, and currently used in something in the order of 60,000
>> existing EpiDoc TEI documents). I really do think that should count
>> for something.
>>
>> More generally, again in my opinion, this entire discussion represents
>> what I hope is a one-off problem. In one of his earliest messages in
>> this thread, Lou said, "After much head scratching here in Oxford,
>> we've decided on "nobreak"." No real room left for discussion, just a
>> decision made. No request for suggestions. David did make a
>> suggestion, but it wasn't asked for. The argument Lou sent today,
>> which sets out his thoughts in some detail, should have come before
>> the final text went out in the Guidelines.
>>
>> I think it's dangerous to give a single person, or group of people,
>> the power to override everyone else who should be involved in the
>> decision making process even if, as in this case, it's not a very
>> important matter. That's all.
>>
>> Dot
>>
>> On Thu, Jul 2, 2009 at 12:34 PM, Lou Burnard<lou.burnard at oucs.ox.ac.uk>
>> wrote:
>>
>>>
>>> With a few exceptions, the examples throughout the Guidelines are chosen
>>> as examples of textual phenomena, not of ways those textual phenomena
>>> have been treated in particular encoding projects. Since @type on <lb>
>>> or <pb> was only recently introduced it's unsurprising that there aren't
>>> that many existing TEI precedents to follow and, while I yield to no-one
>>> in my admiration for the epidoc project, I would resist pressure to make
>>> it (or any other project) the sole driver for decisions about what goes
>>> in the Guidelines.
>>>
>>> The particular phenomenon we're dealing with here can be -- has been --
>>> dealt with in several different ways -- there are thousands of cases
>>> also of texts in which the encoder has chosen to treat this phenomenon
>>> in a completely different way! In the Bibliotheque Virtuelle des
>>> Humanistes, for example, they mark the word-fragments introduced by the
>>> presence of the <lb/> explicitly. So they would have something like
>>>
>>> <caes full="imperator">imp</caes><lb/><caes>erator</caes>
>>>
>>> ("caes" is short for "caesura" which means something different in
>>> French, apparently)
>>> In other projects, (probably even more numerous) they decided to just
>>> move the <lb> to the end of the nearest word:  imperator<lb/>
>>>
>>> My objection to wordDiv, wordDivision, vel sim is just that it's
>>> ambiguous as between  "division between words" or "division within a
>>> word". Since the whole point of this attribute is to specify exactly
>>> which of those two is the case , this seems a bad idea. With all
>>> humility, I still think that "nobreak" is less ambiguous -- it implies
>>> that although the name of the element bearing it implies some kind of
>>> "break", in this particular case, the break isn't considered to be
>>> there.  I am perfectly amenable to other suggestions, but the only one
>>> I've seen so far is David's. "intraword" is certainly unambiguous (at
>>> least to those who've been properly educated) but does seem a bit
>>> long-winded. Remember that we'd like these values to be comprehensible
>>> to native speakers of non-Latin languages as well if possible.
>>>
>>>
>>> Gabriel BODARD wrote:
>>>
>>>>
>>>> Sure it doesn't terribly matter what the attribute value is, since it's
>>>> not constrained, but aren't these examples supposed where possible to be
>>>> based on real usage? Why then would you invent an attribute value that
>>>> no one's using, rather than using the value that has been used in tens
>>>> of thousands of examples in the real world?
>>>>
>>>> G
>>>>
>>>> Lou Burnard wrote:
>>>>
>>>>
>>>>>
>>>>> We considered that, but it's a bit latinate, don't you think?
>>>>>
>>>>> I agree with Dan that there's no available time to sweat this further
>>>>> (despite the weather :-). If people want to make further changes to
>>>>> wording (I'm assuming everyone has actually looked at the newly revised
>>>>> examples and discussion?) they will go into the mix for next time, but
>>>>> we need to get this error fixing release out the door today.
>>>>>
>>>>>
>>>>>
>>>>> David Sewell wrote:
>>>>>
>>>>>
>>>>>>
>>>>>> As a naive non-epigraphist, I would find this unambiguous, for what
>>>>>> it's
>>>>>> worth:
>>>>>>
>>>>>>  <lb type="intraword"/>
>>>>>>
>>>>>> David
>>>>>>
>>>>>> On Wed, 1 Jul 2009, Dot Porter wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Dan, I don't think anyone is suggesting the value be technically
>>>>>>> controlled, but we want an example in the Guidelines. And as people
>>>>>>> tend to take the Guidelines suggestions quite seriously, it's worth
>>>>>>> considering what the suggested value be.
>>>>>>>
>>>>>>> Dot
>>>>>>>
>>>>>>> On Wed, Jul 1, 2009 at 5:45 PM, O'Donnell,
>>>>>>> Dan<daniel.odonnell at uleth.ca> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> I also don't understand why we are sweating the att value. Are we
>>>>>>>> really interested in controlling this vocabulary? Why?
>>>>>>>>
>>>>>>>> -----------
>>>>>>>> Daniel O'Donnell
>>>>>>>> University of Lethbridge
>>>>>>>> (From my mobile telephone)
>>>>>>>>
>>>>>>>> --- original message ---
>>>>>>>> From: "Dot Porter" <dot.porter at gmail.com>
>>>>>>>> Subject: Re: [tei-council] word-dividing
>>>>>>>> Date: July 1, 2009
>>>>>>>> Time: 10:17:9
>>>>>>>>
>>>>>>>> I don't really understand the concern here. An lb (or cb, or pb)
>>>>>>>> that
>>>>>>>> appears in the middle of a word physically divides that word, hence
>>>>>>>> "worddiv". As long as this usage is defined clearly in the
>>>>>>>> Guidelines
>>>>>>>> ("use @type='worddiv' to mark lb, pb or cb that physically divide
>>>>>>>> words") I don't think there will be any confusion on the part of
>>>>>>>> users. It's clear. And there's a history of usage, since EpiDoc is
>>>>>>>> already doing this, and has been. Why mess with something that
>>>>>>>> works?
>>>>>>>>
>>>>>>>> Dot
>>>>>>>>
>>>>>>>> On Wed, Jul 1, 2009 at 5:08 PM, Gabriel
>>>>>>>> Bodard<gabriel.bodard at kcl.ac.uk> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Right. I guess my only objection is that it sounds more like a
>>>>>>>>> processing instruction than a description of the text. But I take
>>>>>>>>> your
>>>>>>>>> point. Let's see if anyone comes up with any suggestions better
>>>>>>>>> than
>>>>>>>>> either of ours. :-) (It would be nice if what we suggested in the
>>>>>>>>> example was something that is actually being used... and if we come
>>>>>>>>> to a
>>>>>>>>> consensus I'll recommend changing EpiDoc usage to whatever we use
>>>>>>>>> in the
>>>>>>>>> example in the guidelines.
>>>>>>>>>
>>>>>>>>> (If we don't come to a consensus, as you say, no problem.)
>>>>>>>>>
>>>>>>>>> G
>>>>>>>>>
>>>>>>>>> Lou Burnard wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Sorry, but I do not follow your logic. "nobreak" says something
>>>>>>>>>> about
>>>>>>>>>> the type of <lb> -- it is a "non-breaking" line break.  The
>>>>>>>>>> implication
>>>>>>>>>> is that other <lb> (or <cb> etc) s are "breaking" i.e. they are
>>>>>>>>>> understood not only to mark the start of a line, column etc, but
>>>>>>>>>> also to
>>>>>>>>>> break  a word. so that foo<lb/>bar should be considered to be two
>>>>>>>>>> words.
>>>>>>>>>>
>>>>>>>>>> There are breaks between your words conceptually, I hope? If not,
>>>>>>>>>> what
>>>>>>>>>> is the point of trying to distinguish types of <lb> anyway?
>>>>>>>>>>
>>>>>>>>>> If epidockers dont like this though they can always make up their
>>>>>>>>>> own
>>>>>>>>>> terminology -- the type value is not constrained by the schema.
>>>>>>>>>>
>>>>>>>>>> Gabriel Bodard wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I'm not sure I like "nobreak", as it doesn't really say anything
>>>>>>>>>>> about
>>>>>>>>>>> the status of the lb (or, as Dot points out, cb, pb, etc.);
>>>>>>>>>>> especially
>>>>>>>>>>> since there are never (or rarely) breaks _between_ words in our
>>>>>>>>>>> texts.
>>>>>>>>>>> The idea behind "worddiv" was that this is a linebreak that
>>>>>>>>>>> appears
>>>>>>>>>>> mid-word, splitting it atwain, as Dan has it. Let me canvas the
>>>>>>>>>>> EpiDoc
>>>>>>>>>>> markup list, and see if people there have opinions one way or the
>>>>>>>>>>> other
>>>>>>>>>>> to contribute to this...
>>>>>>>>>>>
>>>>>>>>>>> G
>>>>>>>>>>>
>>>>>>>>>>> Lou Burnard wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> After much head scratching here in Oxford, we've decided on
>>>>>>>>>>>> "nobreak"
>>>>>>>>>>>>
>>>>>>>>>>>> I added a couple more examples and a bit more discussion, taking
>>>>>>>>>>>> examples from some real projects too. Affected are the
>>>>>>>>>>>> definition for
>>>>>>>>>>>> <lb> and the discussion of milestones in CO.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Daniel Paul O'Donnell wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think "word-dividing" in this case means "splitting
>>>>>>>>>>>>> individual words
>>>>>>>>>>>>> atwain" rather than "demarcating their boundaries" ;)
>>>>>>>>>>>>>
>>>>>>>>>>>>> In my edition of Cædmon's Hymn I needed to encode space and lb
>>>>>>>>>>>>> similarly explicitly: i.e. indicating whether it fell within
>>>>>>>>>>>>> the word
>>>>>>>>>>>>> or between words: the stylesheets (such as they were in those
>>>>>>>>>>>>> days)
>>>>>>>>>>>>> handled them differently depending on the value of @type (which
>>>>>>>>>>>>> I'd
>>>>>>>>>>>>> made universal). White space wouldn't have done it for me,
>>>>>>>>>>>>> because I
>>>>>>>>>>>>> was reformatting the data with and without the word-internal
>>>>>>>>>>>>> spaces
>>>>>>>>>>>>> and lines depending on the view the user selected.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -dan
>>>>>>>>>>>>>
>>>>>>>>>>>>> Lou Burnard wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Gabriel BODARD wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Lou Burnard wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (9) lb: should we add an example of the usage of
>>>>>>>>>>>>>>>>> lb/type=word-dividing, which currently sits a little
>>>>>>>>>>>>>>>>> uncomfortably
>>>>>>>>>>>>>>>>> in the note. I suggest "Cae<lb type="worddiv"/>sari".
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Don't know what note you're referring to. Don't see the
>>>>>>>>>>>>>>>> point of
>>>>>>>>>>>>>>>> the @type attribute. Haven't done anything.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This was discussed some months ago, and is the reason @type
>>>>>>>>>>>>>>> was
>>>>>>>>>>>>>>> allowed on <lb> in the first place. There is currently a note
>>>>>>>>>>>>>>> at the
>>>>>>>>>>>>>>> bottom of LB that says: "The type attribute may be used to
>>>>>>>>>>>>>>> characterize the linebreak in any respect, for example as
>>>>>>>>>>>>>>> word-breaking or not." We have literally thousands of
>>>>>>>>>>>>>>> examples of
>>>>>>>>>>>>>>> this in EpiDoc files, where words are not always tagged
>>>>>>>>>>>>>>> explicitly
>>>>>>>>>>>>>>> and it's the only way we can be sure to tokenize correctly. I
>>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>> thought an example would help to clarify the use-case.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (If people feel strongly that [e.g.] "wordDividing" would be
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>> better recommended value than "worddiv", I'm happy to make
>>>>>>>>>>>>>>> that part
>>>>>>>>>>>>>>> of our P5 upgrade script.)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I don't mind adding examples, but this one confuses me. Isn't
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> point that the <lb/> in your example does NOT divide the word
>>>>>>>>>>>>>> ? so
>>>>>>>>>>>>>> both "wordDividing" and "worddiv" seem exactly the opposite of
>>>>>>>>>>>>>> what
>>>>>>>>>>>>>> you want here. How about "nowordbreak" or "nwb"?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I know I lost this argument last time, but I still think in
>>>>>>>>>>>>>> practice
>>>>>>>>>>>>>> I'd deal with this by putting in whitespace where the <lb>
>>>>>>>>>>>>>> coincided
>>>>>>>>>>>>>> with a word boundary and leaving  it out where it didn't!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> G
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> tei-council mailing list
>>>>>>>>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>>>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> tei-council mailing list
>>>>>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Dr Gabriel BODARD
>>>>>>>>> (Epigrapher & Digital Classicist)
>>>>>>>>>
>>>>>>>>> Centre for Computing in the Humanities
>>>>>>>>> King's College London
>>>>>>>>> 26-29 Drury Lane
>>>>>>>>> London WC2B 5RL
>>>>>>>>> Email: gabriel.bodard at kcl.ac.uk
>>>>>>>>> Tel: +44 (0)20 7848 1388
>>>>>>>>> Fax: +44 (0)20 7848 2980
>>>>>>>>>
>>>>>>>>> http://www.digitalclassicist.org/
>>>>>>>>> http://www.currentepigraphy.org/
>>>>>>>>> _______________________________________________
>>>>>>>>> tei-council mailing list
>>>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> *~*~*~*~*~*~*~*~*~*~*
>>>>>>>> Dot Porter (MA, MSLS)          Metadata Manager
>>>>>>>> Digital Humanities Observatory (RIA), Regus House, 28-32 Upper
>>>>>>>> Pembroke Street, Dublin 2, Ireland
>>>>>>>> -- A Project of the Royal Irish Academy --
>>>>>>>> Phone: +353 1 234 2444        Fax: +353 1 234 2400
>>>>>>>> http://dho.ie          Email: dot.porter at gmail.com
>>>>>>>> *~*~*~*~*~*~*~*~*~*~*
>>>>>>>> _______________________________________________
>>>>>>>> tei-council mailing list
>>>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------
>>>>>>
>>>>>> _______________________________________________
>>>>>> tei-council mailing list
>>>>>> tei-council at lists.village.Virginia.EDU
>>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> tei-council mailing list
>>>>> tei-council at lists.village.Virginia.EDU
>>>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>>>
>>>>>
>>>>
>>>>
>>>
>>> _______________________________________________
>>> tei-council mailing list
>>> tei-council at lists.village.Virginia.EDU
>>> http://lists.village.Virginia.EDU/mailman/listinfo/tei-council
>>>
>>>
>>
>>
>>
>>
>
>



-- 
*~*~*~*~*~*~*~*~*~*~*
Dot Porter (MA, MSLS)          Metadata Manager
Digital Humanities Observatory (RIA), Regus House, 28-32 Upper
Pembroke Street, Dublin 2, Ireland
-- A Project of the Royal Irish Academy --
Phone: +353 1 234 2444        Fax: +353 1 234 2400
http://dho.ie          Email: dot.porter at gmail.com
*~*~*~*~*~*~*~*~*~*~*


More information about the tei-council mailing list