[tei-council] datatype issues (part 1) continued,,,

Syd Bauman Syd_Bauman at Brown.edu
Tue Sep 20 10:23:43 EDT 2005


> The only places [tei.data.regexp] is used at present is for
> attributes like metDecl which have as their value a string of
> gobbledegook in some special syntax defined by the TEI.

Huh?? Even in P3 the pattern= attribute of <metDecl> was defined as a
regular expression. In P3 and P4 the regular expression language used
was that created by the TEI for its extended pointer syntax. In P5 we
have made the move to using the W3C regular expression language
instead. I think using a regexp here is and was a good idea,
switching to W3C regexps is a good move, and this attribute should
most certainly remain as is.


> This [string of gobbledegook in some special syntax defined by the
> TEI] seems a useful category of information.

If there are any such attributes, then yes, I agree, it's a useful
category. But don't put pattern= of <metDecl> in it, it doesn't
belong there.

Since we use regexps as an attributes' type very rarely (only 2 or 3
occurrences, I think), I don't really care whether we abstract them
into a datatype or not, although it may be a useful place to explain
it.


> I hesitated a long time over NMTOKEN and NCName. The former allows
> hyphens but not underscore; the latter allows underscore but not
> hyphen.

This is simply untrue. xsd:NMTOKEN maps to an XML NMTOKEN; xsd:NCName
maps to an XML Namespaces NCName, which maps to an XML name except
that colon is not allowed. Both allow hyphen, both allow underscore.
xsd:NCName does not permit the string to *start* with a digit or
punctuation character other than underscore.


> Syd's proposed pattern

Credit where credit is due: it was your proposal, Lou.

> allows [hyphen or underscore] and also comma.

And also semicolon, circled plus sign, curly braces, the Euro
symbol, the copyright symbol, etc.


> I am open to persuasion that both tei.data.key and tei.data.ident
> should have the same mapping; less to defining something which is
> not either NMTOKEN or NCName.

In P4 most of these things were CDATA. I think making them a single
token (normal sense of the word) is a really good idea -- if we could
do so in P4 we should (we can't). I even think forbidding some kinds
of non-whitespace characters would be a really good idea (e.g.,
control characters, PUA characters, etc. See [1] for list). However,
I'm not really sure of the advantage of telling people who would like
to have things like "damaged/deliberate" and "damaged/accidental" as
their values for reason= of <gap> that they cannot, just because W3C
says that names of elements etc. can't have a slash. 


> So in order of ascending tightness of constraint, we have
> 
> tei.data.enumerated : the value is defined by a valList (type=closed)
> tei.data.code : the value is defined by a pointer to something which 
>                 must exist
> tei.data.key : the value is defined by an enumeration elsewhere e.g. a 
>                database key
> tei.data.ident : the value is a name or identifier of some kind but not 
>                  necessarily enumerated or enumeratable

Where in your scheme to <valList>s of type= "open" and "semi" fit?
I.e., are they still assigned the datatype tei.data.enumerated?

Note
----
[1] A first cut at Unicode categories that should and should not be
    permitted in tei.data.[token,ident,name,term]. I am not a Unicode
    expert, nor have I taken the time to examine each category
    carefully, nor to see into what categories the characters
    permitted in xsd:NMTOKEN fall.

    Abbr.  Description                    Should be OK?
    -----  -----------                    ------ -- ---
    Lu     Letter, Uppercase              yes
    Ll     Letter, Lowercase              yes
    Lt     Letter, Titlecase              yes
    Lm     Letter, Modifier               yes?
    Lo     Letter, Other                  yes
    Mn     Mark, Nonspacing               ?
    Mc     Mark, Spacing Combining        ?
    Me     Mark, Enclosing                ?
    Nd     Number, Decimal Digit          yes
    Nl     Number, Letter                 yes
    No     Number, Other                  yes
    Pc     Punctuation, Connector         yes
    Pd     Punctuation, Dash              yes
    Ps     Punctuation, Open              yes
    Pe     Punctuation, Close             yes
    Pi     Punctuation, Initial quote     yes
    Pf     Punctuation, Final quote       yes
    Po     Punctuation, Other             yes
    Sm     Symbol, Math                   yes
    Sc     Symbol, Currency               yes
    Sk     Symbol, Modifier               yes
    So     Symbol, Other                  yes
    Zs     Separator, Space               NO
    Zl     Separator, Line                NO
    Zp     Separator, Paragraph           NO
    Cc     Other, Control                 NO
    Cf     Other, Format                  NO
    Cs     Other, Surrogate               NO
    Co     Other, Private Use             NO
    Cn     Other, Not Assigned            NO




More information about the tei-council mailing list