[tei-council] on regularizing names

Thu Sep 21 08:57:55 EDT 2006

Back in May I posted a discussion of regularization of names
[http://lists.village.virginia.edu/pipermail/tei-council/2006/001353.html].

I reproduce that list of possible solutions, with four more added
here. One of the new additions to the list is that which Julia &
Perry recommended in their work of 2005-07
[http://lists.village.virginia.edu/pipermail/tei-council/2005/000600.html],
another two are simplifications of that.

I will then go through the list, and show that many suggestions are
problematic at best, ending with a set of 7 choices for Council to
consider.

some possibilities
---- -------------
a) <reg> on a par w/ the PCDATA inside name:
     <persName>Syd
       <reg>Bauman, Sydney D.</reg>
     </persName>

b) <reg> with a sister element inside name:
     <persName>
       <ZZZ>Syd</ZZZ>
       <reg>Bauman, Sydney D.</reg>
     </persName>
   where ZZZ could be "literal", "asIs", "diplomatic", "transcribed"
   or some such -- if it is "orig", then this is same as (e)

c) names *in* <choice>:
     <choice>
       <persName>Syd</persName>
       <reg>Bauman, Sydney D.</reg>
     </choice>

d) <choice> in names:
     <persName>
       <choice>
         <orig>Syd</orig>
         <reg>Bauman, Sydney D.</reg>
       </choice>
     </persName>

e) name *is* <choice>, as it were:
     <persName>
       <orig>Syd</orig>
       <reg>Bauman, Sydney D.</reg>
     </persName>

f) Sorry, no gaiji and no other languages in your
   regularizations:
   <persName reg="Bauman, Sydney D.">Syd</persName>

g) Sorry, no gaiji, but use another attribute to represent a
   different language:
   <persName regLang="es" xml:lang="en"
             reg="Bia, Alejandro">Alex</persName>

h) Pointer to a regularization and/or a pointer:
   This is the method Julia & Perry recommended.
   <persName reg="#reg.sb">Syd</persName>
   <!-- meanwhile, in header or elsewhere: -->
   <regName xml:id="reg.sb"
            authority='LCNAF"
            target="http://authorities.loc.gov/cgi-bin/Pwebrecon.cgi?AuthRecID=56..."
            >Bauman, Sydney D.</regName>

i) Pointer to a regularization, optional key=:
   Use the basic gist of Julia & Perry's suggestion, but rather than
   permit a pointer to an authority instead of content (which makes
   the distinction between regularizing a name and disambiguating a
   person a bit fuzzy), say that, like <persName>, the content of
   <regName> is required, and is a *name*. Furthermore, like
   <persName>, <regName> can bear a key= attribute. E.g.:
      <p>In the 1940s he was known as
      <persName reg="#reg25">Ritchie</persName>, but most of us
      know him as <persName reg="#reg26">Ringo</persName>.</p>
      <!-- meanwhile, in header or elsewhere: -->
      <regName key="url:http://www.imdb.com/name/nm0823592/"
               xml:id="reg25">Starkey, Richard</regName>
      <regName key="url:http://www.imdb.com/name/nm0823592/"
               xml:id="reg26">Starr, Ringo</regName>

j) Pointer to a regularization, simple case rules:
   As above, but don't permit key= on <regName>

k) Pointer to a regularization, using <persName>: Rather than create
   a special element <regName>, use <persName> inside some special
   element in the tei Header (<nameList>, <list type="regularNames">,
   some such). 

analyses
--------
(q) rubs most people (including me) the wrong way, because the
    information is not in parallel structures. In some way, we think
    of the PCDATA content of an element as being different than, on a
    different level than, the nested element's content.
    But more importantly, this method (like several others) makes it
    all but impossible for software to reliably extract either the
    source name or the regularized name. This is because of the
    inherent difficulty differentiating
      <persName>Barr., J<reg>Barrington, Jonathan U.</reg></persName>
    from
      <persName>Barr., <reg>J</reg></persName>

(b) makes the above differentiation possible, if painful:
    <xsl:if test="./ZZZ">
      <xsl:value-of select="./reg"/>
    </xsl:if>

(c) seems perfectly reasonable to me. Does bring us back to the
    content of <choice> problem, though. (What would
    <choice><name/><orig/><corr/><reg/></choice>
    mean?)

(d) seems cumbersome, but requires no change to our schemas, just to
    our prose & examples. However, it would be quite hard if not
    impossible for software to differentiate
      <persName>
        <choice>
          <orig>John</orig>
          <reg>Barrington, John U.</reg>
        </choice>
      </persName>
    from
      <persName>
        <choice>
          <orig>Iohn</orig>
          <reg>John</reg>
        </choice>
      </persName>

(e) runs into trouble because both <orig> and <reg> are already
    permitted as children of <name>. It would be quite hard to
    differentiate 
      <persName>
        <orig>Barr., J. V.</orig>
        <reg>Barrington, Jonathan U.</reg>
      </persName>
    from
      <persName>Barr., <orig>J</orig>. <reg>V</reg>.</persName>
    (Not that anyone actually uses <orig> and <reg> like that, but we
    don't want to rely on no one wanting to do that.)

(f) is unacceptable. The main reasons to move something from an
    attribute to an element is to be able to use gaiji within it and
    to be able to say what natural language it's in. There is no
    excuse to wanting a gaiji in a regularized name (if it's not in
    Unicode, it's not a regularization, e.g., it couldn't be sorted
    by any standard algorithm). However, there is every reason to
    want to have regularizations in a different language than the
    source. So (f) is out.

(g) tries to solve the problem (f) runs into, but this violates the
    explicit semantics of xml:lang=. This is a limitation we tied
    ourselves to when we agreed to use xml:lang= and not tei:lang=,
    and here we pay the consequences of that decision by not being
    able to use (g).

(h) has some strong advantages. However, the optional dual-pronged
    approach both leaves the "am I pointing to a name or a person"
    question a little fuzzy (but that can be dealt with by defining
    the semantics clearly) and makes it harder for software to
    actually find the regularized name.

(i) solves the fuzziness problem. reg= always points to a *name*,
    which is nothing more than a regularization of a *name*. key=
    always refers to a database record (possibly by pointing), which
    is about a *person*; it is quite possible that said record has no
    information other than a regularized name, of course. Note that
    software needs to look in 2 places to try to find this database
    record key, though: key= of <persName>, and if not there then the
    key= of the <regName> pointed to by the reg= of <persName>.

(j) solves the "two places to look" problem for the programmer by
    forcing the encoder to put key= on each occurrence of a <persName>
    she wants keyed, rather than allowing the indirection of
    specifying the key= once on the <regName>.

(k) takes advantage of the fact that <persName> already has the
    content model you would want if you care about the inner details
    of the name, and already has a key= attribute. The disadvantage is
    that we would still have to create a special element, and that
    <persName> would also bear a reg= attribute, which would be silly
    when it was used in this context.

Thus, I think there are only 7 viable solutions for Council to
consider, listed here in my (current) personal order of preference:
  (c): names *in* <choice>
  (i): pointer to a regularization
  (j): pointer to a regularization, no key=
  (b): <reg> with a sister element inside name
  (h): pointer to a regularization and/or a pointer
  (k): pointer to another <persName>
  (e): name *is* <choice>, as it were