[tei-council] datatype issues (part 1) continued,,,
Syd Bauman
Syd_Bauman at Brown.edu
Tue Sep 20 10:23:43 EDT 2005
> The only places [tei.data.regexp] is used at present is for
> attributes like metDecl which have as their value a string of
> gobbledegook in some special syntax defined by the TEI.
Huh?? Even in P3 the pattern= attribute of <metDecl> was defined as a
regular expression. In P3 and P4 the regular expression language used
was that created by the TEI for its extended pointer syntax. In P5 we
have made the move to using the W3C regular expression language
instead. I think using a regexp here is and was a good idea,
switching to W3C regexps is a good move, and this attribute should
most certainly remain as is.
> This [string of gobbledegook in some special syntax defined by the
> TEI] seems a useful category of information.
If there are any such attributes, then yes, I agree, it's a useful
category. But don't put pattern= of <metDecl> in it, it doesn't
belong there.
Since we use regexps as an attributes' type very rarely (only 2 or 3
occurrences, I think), I don't really care whether we abstract them
into a datatype or not, although it may be a useful place to explain
it.
> I hesitated a long time over NMTOKEN and NCName. The former allows
> hyphens but not underscore; the latter allows underscore but not
> hyphen.
This is simply untrue. xsd:NMTOKEN maps to an XML NMTOKEN; xsd:NCName
maps to an XML Namespaces NCName, which maps to an XML name except
that colon is not allowed. Both allow hyphen, both allow underscore.
xsd:NCName does not permit the string to *start* with a digit or
punctuation character other than underscore.
> Syd's proposed pattern
Credit where credit is due: it was your proposal, Lou.
> allows [hyphen or underscore] and also comma.
And also semicolon, circled plus sign, curly braces, the Euro
symbol, the copyright symbol, etc.
> I am open to persuasion that both tei.data.key and tei.data.ident
> should have the same mapping; less to defining something which is
> not either NMTOKEN or NCName.
In P4 most of these things were CDATA. I think making them a single
token (normal sense of the word) is a really good idea -- if we could
do so in P4 we should (we can't). I even think forbidding some kinds
of non-whitespace characters would be a really good idea (e.g.,
control characters, PUA characters, etc. See [1] for list). However,
I'm not really sure of the advantage of telling people who would like
to have things like "damaged/deliberate" and "damaged/accidental" as
their values for reason= of <gap> that they cannot, just because W3C
says that names of elements etc. can't have a slash.
> So in order of ascending tightness of constraint, we have
>
> tei.data.enumerated : the value is defined by a valList (type=closed)
> tei.data.code : the value is defined by a pointer to something which
> must exist
> tei.data.key : the value is defined by an enumeration elsewhere e.g. a
> database key
> tei.data.ident : the value is a name or identifier of some kind but not
> necessarily enumerated or enumeratable
Where in your scheme to <valList>s of type= "open" and "semi" fit?
I.e., are they still assigned the datatype tei.data.enumerated?
Note
----
[1] A first cut at Unicode categories that should and should not be
permitted in tei.data.[token,ident,name,term]. I am not a Unicode
expert, nor have I taken the time to examine each category
carefully, nor to see into what categories the characters
permitted in xsd:NMTOKEN fall.
Abbr. Description Should be OK?
----- ----------- ------ -- ---
Lu Letter, Uppercase yes
Ll Letter, Lowercase yes
Lt Letter, Titlecase yes
Lm Letter, Modifier yes?
Lo Letter, Other yes
Mn Mark, Nonspacing ?
Mc Mark, Spacing Combining ?
Me Mark, Enclosing ?
Nd Number, Decimal Digit yes
Nl Number, Letter yes
No Number, Other yes
Pc Punctuation, Connector yes
Pd Punctuation, Dash yes
Ps Punctuation, Open yes
Pe Punctuation, Close yes
Pi Punctuation, Initial quote yes
Pf Punctuation, Final quote yes
Po Punctuation, Other yes
Sm Symbol, Math yes
Sc Symbol, Currency yes
Sk Symbol, Modifier yes
So Symbol, Other yes
Zs Separator, Space NO
Zl Separator, Line NO
Zp Separator, Paragraph NO
Cc Other, Control NO
Cf Other, Format NO
Cs Other, Surrogate NO
Co Other, Private Use NO
Cn Other, Not Assigned NO
More information about the tei-council
mailing list