[tei-council] quotation marks, quotes, etc.

Syd Bauman Syd_Bauman at Brown.edu
Wed Apr 18 05:44:24 EDT 2007


In the following, I'm going to use "said" as the name of the proposed
element for direct speech or thought only. The name is not carved in
stone, but I think Sebastian, Dan, Julia and I all like it better
than anything else that has been suggested.


> I really do not agree with Syd on this issue 
> As I understand it, ...

I'm curious then, Lou, whether you've changed your mind or it's just
that I did such a bad job explaining the proposal, you didn't
recognize it as the one to which we had already agreed! No matter,
really, but I suspect I must have done a very poor job of explaining
it, for your recap does not match the proposal well at all. Let me try
again.


There were actually 2 proposals rolled up in my previous e-mail, so I
will separate them explicitly here.

In all cases here typographic distinction is no more important for
these elements than it is for <soCalled> et al. That is, it is not
relevant in the abstract, although it is important to some encoders &
projects.


--- 1 ---
Lou: this first proposal is exactly what we spoke at length about on
the phone ~3 weeks ago. You were behind this proposal as long as we
leave <q> as the general-purpose element.

<said> is for direct speech (or its discursive equivalents: e.g.
       reported thought or speech, dialog, etc.), whether real or
       contrived, typically as part of the current text, although I
       suppose one could imagine otherwise. Most common usage is
       likely to be a character's spoken words in a novel or a
       person's spoken words reported in a non-fiction article. In
       English prose it will very often be associated with phrases
       like "he said", or "she asked". <said> would not be a viable
       child of <cit>.

<quote> is for material that is quoted from sources outside the text,
        whether correctly or not, whether real or contrived, whether
        originally spoken or written. Most common usage is likely to
        be quoting passages from other documents. May be used in a
        dictionary for real or contrived examples of usage. <quote>
        would continue to be a viable child of <cit>.

<q> is for passages quoted from elsewhere; in narrative, either
    direct or indirect speech or something being quoted from outside
    the text; in dictionaries, real or contrived examples of usage.
    <q> would continue to be a viable child of <cit>, for those who
    don't use the more specific <quote>.

That is, <q> remains exactly as it was in P4; <quote> remains as it
was in P4; <said> takes the role of only the "direct speech or thought"
subset of what <q> handles.

There are of course cases where people speak quotations or quotations
include direct speech, in which case <quote> can go in <said> or
vice-versa. 

Some people assert that they can't see the distinction between <quote>
and the proposed <said>, and I'm sorry to say I simply can't
understand this point of view. While there must be some difficult edge
cases, as a general rule I do not think it is difficult to tell these
two phenomena apart at all. I just picked up a mid-20th century
science fiction novel and flipped through the pages. I had no trouble
at all differentiating (the vast majority were <said>). I read an
article in the NY Times over the weekend -- again, no trouble at all
differentiating. I just asked Julia, and she said that in the creation
of the entire WWP corpus so far there has never been a case where an
encoder (most of whom are undergraduates) has had any difficulty making
the distinction. There are 17,104 occurrences of <q> or <quote>
start-tags in the distributed WWP corpus[2]. She said "it's harder to
tell what is and isn't a <list> than it is to [differentiate between
<said> and <quote>]".

There is one definite hole in this proposal: as it is worded, the
encoding of indirect speech (e.g. the bit about grapes in "He said
that he doesn't like grapes") is forced into <q>. The only two
reasonable solutions I see are:
A. add indirect speech to semantics of <said>
B. remove indirect speech from semantics of <q>
Personally, I really don't care. I have never seen anyone encode
indirect speech ever, but that doesn't mean there aren't cases out
there.

Here is a passage from an article about an anti-war protest that was
held in Boston while we were in Sofia at the 2005 MM.

   <p>One demonstrator carried a sign that read, <quote>Bush Wants
   Your Children For Cannon Fodder,</quote> and another ...</p>
   <p>[Cindy Sheehan] mentioned a woman who had once e-mailed her
   after she cursed the Bush administration.</p>
   <p><said>She said, <said>Cindy, don't you want to use a little
   nicer language, because you know there might be people sitting on
   the fence that you offend,</said></said> Sheehan told the crowd.
   <said>And do you know what I said? I said, <said>Damn it, why is
   anybody on that fence still?</said></said>
   <p><said>A lot of people will come up to me and say, <said>My
   country right or wrong,</said></said> Sheehan added later.
   <said>And you know what I say? When my country is wrong, it is so
   wrong, and it is mandatory for us to stop it, to stop the killing,
   to stop the people in power.</said>
   -- http://www.commondreams.org/headlines05/1030-05.htm

I will also note that off the top of my head I can't really see why an
example of usage in a dictionary would be <q> instead of <quote>.
Laurent?


--- 2 ---
The second proposal is essentially an offshoot or amplification of the
first. It stems from my observation that there are people out there
who not only don't want to differentiate between <said> and <quote>,
but who don't want to make the distinctions between <soCalled>,
<mentioned>, <term>, etc., either. In many cases these encoders don't
use any element (they just transcribe the quotation marks), but IIRC
at least one library project used <q> for all of them. This second
proposal leaves <said> and <quote> as they were in the first proposal,
but expands the catchment of <q> to include all of these often-
enclosed-in-quotation-marks sorts of phrase level phenomena:

<q> id for any of a number of features when differentiating among
    them is not desired, e.g. because it is economically not feasible
    or simply not of interest for the current purpose. Items that may
    be encoded this way include
    - representation of speech or thought
    - quotation
    - technical terms and glosses
    - passages mentioned, not used
    - authorial distance
    and perhaps even
    - from a foreign language
    - linguistically distinct
    - emphasized
    - any other use of quotation marks in the source


Notes
-----
[1] "Outside the text" here does not necessarily mean "not a
    descendant of my ancestor <text> element", but rather something
    quite a bit less precise, more along the lines of "not from 'round
    here". I.e., something in chapter 1 may be a <quote> even if the
    thing being quoted is a passage from chapter 3 of the same
    document. This, of course, blurs the line a bit, and may be worth
    consideration. 
[2] I did not say "elements" because many of these <q> and <quote>
    elements may be partial elements which, via next= and prev=, are
    only part of a complete aggregate element.




More information about the tei-council mailing list