[tei-council] on regularizing names
Syd Bauman
Syd_Bauman at Brown.edu
Thu Sep 21 08:57:55 EDT 2006
Back in May I posted a discussion of regularization of names
[http://lists.village.virginia.edu/pipermail/tei-council/2006/001353.html].
I reproduce that list of possible solutions, with four more added
here. One of the new additions to the list is that which Julia &
Perry recommended in their work of 2005-07
[http://lists.village.virginia.edu/pipermail/tei-council/2005/000600.html],
another two are simplifications of that.
I will then go through the list, and show that many suggestions are
problematic at best, ending with a set of 7 choices for Council to
consider.
some possibilities
---- -------------
a) <reg> on a par w/ the PCDATA inside name:
<persName>Syd
<reg>Bauman, Sydney D.</reg>
</persName>
b) <reg> with a sister element inside name:
<persName>
<ZZZ>Syd</ZZZ>
<reg>Bauman, Sydney D.</reg>
</persName>
where ZZZ could be "literal", "asIs", "diplomatic", "transcribed"
or some such -- if it is "orig", then this is same as (e)
c) names *in* <choice>:
<choice>
<persName>Syd</persName>
<reg>Bauman, Sydney D.</reg>
</choice>
d) <choice> in names:
<persName>
<choice>
<orig>Syd</orig>
<reg>Bauman, Sydney D.</reg>
</choice>
</persName>
e) name *is* <choice>, as it were:
<persName>
<orig>Syd</orig>
<reg>Bauman, Sydney D.</reg>
</persName>
f) Sorry, no gaiji and no other languages in your
regularizations:
<persName reg="Bauman, Sydney D.">Syd</persName>
g) Sorry, no gaiji, but use another attribute to represent a
different language:
<persName regLang="es" xml:lang="en"
reg="Bia, Alejandro">Alex</persName>
h) Pointer to a regularization and/or a pointer:
This is the method Julia & Perry recommended.
<persName reg="#reg.sb">Syd</persName>
<!-- meanwhile, in header or elsewhere: -->
<regName xml:id="reg.sb"
authority='LCNAF"
target="http://authorities.loc.gov/cgi-bin/Pwebrecon.cgi?AuthRecID=56..."
>Bauman, Sydney D.</regName>
i) Pointer to a regularization, optional key=:
Use the basic gist of Julia & Perry's suggestion, but rather than
permit a pointer to an authority instead of content (which makes
the distinction between regularizing a name and disambiguating a
person a bit fuzzy), say that, like <persName>, the content of
<regName> is required, and is a *name*. Furthermore, like
<persName>, <regName> can bear a key= attribute. E.g.:
<p>In the 1940s he was known as
<persName reg="#reg25">Ritchie</persName>, but most of us
know him as <persName reg="#reg26">Ringo</persName>.</p>
<!-- meanwhile, in header or elsewhere: -->
<regName key="url:http://www.imdb.com/name/nm0823592/"
xml:id="reg25">Starkey, Richard</regName>
<regName key="url:http://www.imdb.com/name/nm0823592/"
xml:id="reg26">Starr, Ringo</regName>
j) Pointer to a regularization, simple case rules:
As above, but don't permit key= on <regName>
k) Pointer to a regularization, using <persName>: Rather than create
a special element <regName>, use <persName> inside some special
element in the tei Header (<nameList>, <list type="regularNames">,
some such).
analyses
--------
(q) rubs most people (including me) the wrong way, because the
information is not in parallel structures. In some way, we think
of the PCDATA content of an element as being different than, on a
different level than, the nested element's content.
But more importantly, this method (like several others) makes it
all but impossible for software to reliably extract either the
source name or the regularized name. This is because of the
inherent difficulty differentiating
<persName>Barr., J<reg>Barrington, Jonathan U.</reg></persName>
from
<persName>Barr., <reg>J</reg></persName>
(b) makes the above differentiation possible, if painful:
<xsl:if test="./ZZZ">
<xsl:value-of select="./reg"/>
</xsl:if>
(c) seems perfectly reasonable to me. Does bring us back to the
content of <choice> problem, though. (What would
<choice><name/><orig/><corr/><reg/></choice>
mean?)
(d) seems cumbersome, but requires no change to our schemas, just to
our prose & examples. However, it would be quite hard if not
impossible for software to differentiate
<persName>
<choice>
<orig>John</orig>
<reg>Barrington, John U.</reg>
</choice>
</persName>
from
<persName>
<choice>
<orig>Iohn</orig>
<reg>John</reg>
</choice>
</persName>
(e) runs into trouble because both <orig> and <reg> are already
permitted as children of <name>. It would be quite hard to
differentiate
<persName>
<orig>Barr., J. V.</orig>
<reg>Barrington, Jonathan U.</reg>
</persName>
from
<persName>Barr., <orig>J</orig>. <reg>V</reg>.</persName>
(Not that anyone actually uses <orig> and <reg> like that, but we
don't want to rely on no one wanting to do that.)
(f) is unacceptable. The main reasons to move something from an
attribute to an element is to be able to use gaiji within it and
to be able to say what natural language it's in. There is no
excuse to wanting a gaiji in a regularized name (if it's not in
Unicode, it's not a regularization, e.g., it couldn't be sorted
by any standard algorithm). However, there is every reason to
want to have regularizations in a different language than the
source. So (f) is out.
(g) tries to solve the problem (f) runs into, but this violates the
explicit semantics of xml:lang=. This is a limitation we tied
ourselves to when we agreed to use xml:lang= and not tei:lang=,
and here we pay the consequences of that decision by not being
able to use (g).
(h) has some strong advantages. However, the optional dual-pronged
approach both leaves the "am I pointing to a name or a person"
question a little fuzzy (but that can be dealt with by defining
the semantics clearly) and makes it harder for software to
actually find the regularized name.
(i) solves the fuzziness problem. reg= always points to a *name*,
which is nothing more than a regularization of a *name*. key=
always refers to a database record (possibly by pointing), which
is about a *person*; it is quite possible that said record has no
information other than a regularized name, of course. Note that
software needs to look in 2 places to try to find this database
record key, though: key= of <persName>, and if not there then the
key= of the <regName> pointed to by the reg= of <persName>.
(j) solves the "two places to look" problem for the programmer by
forcing the encoder to put key= on each occurrence of a <persName>
she wants keyed, rather than allowing the indirection of
specifying the key= once on the <regName>.
(k) takes advantage of the fact that <persName> already has the
content model you would want if you care about the inner details
of the name, and already has a key= attribute. The disadvantage is
that we would still have to create a special element, and that
<persName> would also bear a reg= attribute, which would be silly
when it was used in this context.
Thus, I think there are only 7 viable solutions for Council to
consider, listed here in my (current) personal order of preference:
(c): names *in* <choice>
(i): pointer to a regularization
(j): pointer to a regularization, no key=
(b): <reg> with a sister element inside name
(h): pointer to a regularization and/or a pointer
(k): pointer to another <persName>
(e): name *is* <choice>, as it were
More information about the tei-council
mailing list