[tei-council] Chapter 1

Wed Feb 6 12:42:04 EST 2008

> > 1.3.1.1.1 In reference to @n, it is said "Its value may be any
> > string of characters". Should this be stated as being limited to
> > non-whitespace characters? I see the definition in the schema as
> > @n being of type "data.word",

No, we shouldn't say the value of n= is limited to non-whitespace
characters, because it's not :-)

The value of n= is declared as 1 or more occurrences of 'data.word',
separated by whitespace. Thus, whitespace is permitted inside the
value. (And I believe the whitespace, although normalized before
comparison against the schema for validity, is reported to the
application w/o normalization, but I'm not 100% sure.)

> > but I'm not familiar with the regular expression components which
> > define it ((\p{L}|\p{N}|\p{P}|\p{S})+).

For our purposes, the outer parens can be ignored, leaving us with
the following (whitespace added for readability):

   ( \p{L} | \p{N} | \p{P} | \p{S} )+

The construction "\p{X}" is called a 'category escape', and means
"all characters with Unicode property X" (roughly). As you probably
already know, the vertical bar is a disjunction, the plus sign is for
"one or more of the preceding pattern", and the parens group patterns
as expected.

So all that's left is to decode the Unicode properties:

 L = all letters
 N = all numbers
 P = all punctuation
 S = all symbols

So this regular expression says "one or more of letters, numbers,
punctuation, and symbols".

Another way to look at it is to ask what characters are *not*
allowed:

 no M = marks      (includes the combining characters)
 no Z = separators (includes whitespace)
 no C = other      (mostly control characters like ESC, BEL, and NUL)

So most any character you'd ever actually want in a string is
allowed, and probably a lot you wouldn't (like curly quotes, or
MATHEMATICAL SANS-SERIF DIGIT ZERO :-).