[tei-council] Chapter 15 - Language Corpora

Lou's Laptop lou.burnard at oucs.ox.ac.uk
Mon Jan 28 13:30:12 EST 2008


Brett Zamir wrote:
> If the corpus chapter could be used for corpora besides language 
> corpora, any thought to rename the chapter to just Corpora or Textual 
> and Language Corpora?
> I
Well, you could argue that all of TEI P5 is about textual corpora in 
some sense! This chapter is specifically about how you use it for 
language corpora; it was introduced largely because we were told that 
many of those working with language corpora needed to have the relevant 
features of TEI picked out for them. Maybe "Language and other corpora"?

>
> The definitions for <interaction> and <langUsage>, (not in source) I 
> think ought to have a comma added before their "etc."
>
agreed


> The definition for <locale> (not in source), I think ought to have 
> commas surrounding "for example".
>
yes: reworded it a bit too
> I changed "It may refer simply to any collection of linguistic data 
> (e.g., written, spoken, or a mixture of the two)" to include the 
> "e.g.,", since besides potentially including sign language, there are 
> also more obscure forms of "linguistic data" such as signals/code, 
> knots, etc.
>
OK
> *15.1 Varieties of Composite Text
> *
> If one were managing a collection which included a document which were 
> published both as part of a larger work (e.g., within a corpus) as 
> well as independently, would there be some particular means for that 
> project to correlate the two together to indicate that they were 
> identical? Sorry if this were already covered somewhere and I just 
> missed it... It's been a while since I read through this, and I've 
> only had time for a one-time through...
I tink you wuld get this information from the header of the respective 
documents.

>
> *15.2.3 The Setting Description
> *
> When the docs say "it is not possible to encode different settings for 
> the same participant: a participant is deemed to be a person within a 
> specific setting", why is this so?
That is how we define things.

> Might not a person move to another setting (or even the same setting 
> as another existing participant's) within one interaction?

If a participant moves to a different setting, we treat them as a 
different participant. That's common practice so far as I know. Of 
course, at a later stage we might decide that participants X and Y are 
actually the same person,  but that's in general quite difficult to do 
for most real life  language corpora I've seen. Remember that a 
"participant" may  be "unidentified voice no 2" -- definitely not the 
same as "unidentified voice no 2" in a different setting.

> If so, would <settingDesc> be added to the list of declarable elements 
> in 15.3.2?
>
It probably should be, but not for this reason.


> *15.3.2 Declarable Elements*
>
> 1) Though perhaps this might seem a stretch at this point, might 
> <scriptStmt> also be expanded to refer as applying to not only 
> "spoken" texts but also any live interactions such as internet chats?
>
Not really. The script underlying a spoken interaction is distinct from 
it. In the case of an IRC chat, the script *is* the interaction.

> 2) For the line "Each element specified, explicitly or implicitly, by 
> the list of identifiers must be of a different type." referring to 
> identifiers specified by @decls, might "type" be changed to "kind" to 
> prevent ambiguity with the @type attribute?
>
if you like! though I dont think there is much ambiguity here.
> *15.5 Recommendations for the Encoding of Large Corpora*
>
> In addition to "required", "recommended" and "optional", perhaps a 
> category of "prohibited" or "removed" might be appropriate (it can 
> help sometimes as much to know what a project does not need or want 
> encoded)...
OK, added "proscribed".



More information about the tei-council mailing list