[tei-council] Chapter 1
Syd Bauman
Syd_Bauman at Brown.edu
Wed Feb 6 12:42:04 EST 2008
> > 1.3.1.1.1 In reference to @n, it is said "Its value may be any
> > string of characters". Should this be stated as being limited to
> > non-whitespace characters? I see the definition in the schema as
> > @n being of type "data.word",
No, we shouldn't say the value of n= is limited to non-whitespace
characters, because it's not :-)
The value of n= is declared as 1 or more occurrences of 'data.word',
separated by whitespace. Thus, whitespace is permitted inside the
value. (And I believe the whitespace, although normalized before
comparison against the schema for validity, is reported to the
application w/o normalization, but I'm not 100% sure.)
> > but I'm not familiar with the regular expression components which
> > define it ((\p{L}|\p{N}|\p{P}|\p{S})+).
For our purposes, the outer parens can be ignored, leaving us with
the following (whitespace added for readability):
( \p{L} | \p{N} | \p{P} | \p{S} )+
The construction "\p{X}" is called a 'category escape', and means
"all characters with Unicode property X" (roughly). As you probably
already know, the vertical bar is a disjunction, the plus sign is for
"one or more of the preceding pattern", and the parens group patterns
as expected.
So all that's left is to decode the Unicode properties:
L = all letters
N = all numbers
P = all punctuation
S = all symbols
So this regular expression says "one or more of letters, numbers,
punctuation, and symbols".
Another way to look at it is to ask what characters are *not*
allowed:
no M = marks (includes the combining characters)
no Z = separators (includes whitespace)
no C = other (mostly control characters like ESC, BEL, and NUL)
So most any character you'd ever actually want in a string is
allowed, and probably a lot you wouldn't (like curly quotes, or
MATHEMATICAL SANS-SERIF DIGIT ZERO :-).
More information about the tei-council
mailing list