18.560 acronyms, text, writing

From: Humanist Discussion Group (by way of Willard McCarty willard.mccarty_at_kcl.ac.uk>
Date: Thu, 3 Feb 2005 08:01:50 +0000

               Humanist Discussion Group, Vol. 18, No. 560.
       Centre for Computing in the Humanities, King's College London
                   www.kcl.ac.uk/humanities/cch/humanist/
                        www.princeton.edu/humanist/
                     Submit to: humanist_at_princeton.edu

   [1] From: Norman Hinton <hinton_at_springnet1.com> (10)
         Subject: Re: 18.550 acronyms (was plain text)

   [2] From: Patrick Sahle <sahle_at_uni-koeln.de> (179)
         Subject: Re: 18.543 thoughts on representing (text) [was:
                 thoughts on writing (plain text)]

   [3] From: Wendell Piez <wapiez_at_mulberrytech.com> (62)
         Subject: Re: 18.555 thoughts on writing (plain text)

--[1]------------------------------------------------------------------
         Date: Thu, 03 Feb 2005 07:43:17 +0000
         From: Norman Hinton <hinton_at_springnet1.com>
         Subject: Re: 18.550 acronyms (was plain text)

I don't quite know what you mean. Acronyms used as words were very common
in 20th century English well before the invention of computers -- Cf. the
WW II ones WAAC, SNAFU. COMSUBPAC, CIC, USAF, etc. There were plenty more,
from business, advertising -- I recall people drinking "ojay" well before
the football player/ accused was born -- and one branch of in-laws never
pronounced the words "toilet paper". They thought it was impolite and
called it "teepee" instead. I don't think computers have changed the
acronym situation one whit.

>Good point. But don't these acronymic practices represent a shift in the
>way acronyms are conceived through mainstream culture?

--[2]------------------------------------------------------------------
         Date: Thu, 03 Feb 2005 07:43:42 +0000
         From: Patrick Sahle <sahle_at_uni-koeln.de>
         Subject: Re: 18.543 thoughts on representing (text) [was: thoughts
on writing (plain text)]

[I (*) comment on Alexandre Enkerli (>), Mon, 31 Jan 2005; sorry for the=20
arbitrary selection of issues]

> [Disclaimer: I mostly work on dynamic *oral* traditions and think of=20
writing as only a specific mode of language transmission.]

* [I mostly work on historical documents and think of writing as a mostly=20
autonomous system to language (in the sense of speech). The following is=20
restricted to the "representation" of already existing texts.]

> As was probably clear from my message, by "Plain Text" I mean any type=20
of human-readable computer file format for textual content, including=20
markup formats.

* Oh! The definitions I was thinking of were something like "only=20
ASCII-characters (in no modified usage)" or "what is left when you strip=20
off all markup". I agree, this would leave all problems to the notion of=20
"markup". To me even the wikipedia-definition is somwhat unclear: "plain=20
text files are files with generally a one-to-one correspondence between the=
=20
bytes and ordinary readable characters such as letters and digits". But=20
what does "ordinary readable" means in this case? Is the "<"-sign in an xml=
=20
file to be ordinarily read? Or is it to be read in an extraordinary way? I=
=20
think the latter holds true. Thus I would say, that a marked up text (in=20
the sense of XML) is not a plain text. But this leads to the wider=20
questions ...

> As "Patrick" said, text comprises both markup data and character data.=20
Willard's anecdote on a fellow French-Canadian's reaction to the=20
elimination of "accented characters" (including, I assume, c-cedilla and=20
such) carries this point forward. Capital letters *are* markup. So is=20
punctuation. The history of typography has a lot to say about this and what=
=20
we're witnessing now is another major step in the history of=20
"writing." Parallels abound in music transcription and notation. Current=20
computer technologies (a major theme on this list) do encourage us to think=
=20
of text in new ways. Not to *limit* text. To expand it.
In fact, thinking of simplistic compression algorithms may help in the=20
discussion. What *is* the minimal information requirements for text? As we=
=20
all know, text is extremely redundant in terms of pure information=20
processing. Thinking of "Plain Text" (ASCII or other encodings) might work:=
=20
we need "character data" and "markup data." (There doesn't seem to be a=20
significant difference between "markup data" and metadata.) We use=20
different methods to separate this type of data from "character data" but=20
we still use parts of the same character set. This practice is the basis of=
=20
some technical issues, certainly, but we can think of this in the abstract.
[...]
Character data can itself be reduced, and we certainly all have note-taking=
=20
practices (fewer vowels, for one thing) which considerably reduce the=20
number of characters we need to type. While some may frown upon these=20
practices when used in more formal communication, they certainly have an=20
impact on the way *people* think of text. Current computer users probably=20
write on average much more than scribes of old. The "intrinsic quality" of=
=20
their writing isn't the issue, nor is the "intrinsic quality" of what they=
=20
read. People *do* read and write. Our goal could be to understand how they=
=20
do it. Instructors in composition are now acknowledging these "new methods"=
=20
of writing and may more easily help their students think of different sets=
=20
of rules for different forms of writing.

* I mostly agree to these points. Text can be represented by a mixture of=20
character data and markup data. The problem is: which markup do you need to=
=20
represent a (given) text properly? Some people say, that the distinction=20
between upper and lower case is markup, some call the punctuation system=20
markup, some even call spaces markup. [and just today Dino Buzzetti called=
=20
every diacritical sign markup] If you strip off all these kinds of markup=20
you obviously get a plain text which is nearly unreadable (in the=20
communicative sense and function of text). And you don't get a reasonable=20
theory of text. A reasonable theory of text would have to define up to=20
which extent (which kind of) markup has to be used to represent (a certain=
=20
already existing) text.

> If one fully separates written form and content, there could be new ways=
=20
to write. LaTeX is a clear example of the separation of form and content.=20
[...] LaTeX has kept challenging the idea of "word processors" and WYSIWYG.=
=20
XML formats for textual content are also on that side of the equation. But=
=20
the revolution hasn't happened yet. It's coming, though. Whether it's=20
through a specific format (possibly XML-based like DocBook, OPML, RSS,=20
TEI...) or through changes in the way people *write* is uncertain at this=
  point

* The concept of separation of form and content may work for "writing"=20
(where you are free to base your wrinting on this distinction). But what=20
about "representing" text from historical media (like printed books)? Form=
=20
"is" content in the sense that it conveys meaning. But that's nothing more=
=20
than a truism because it would lead to the unrealistic demand to recode=20
every visual (and material) aspect of a given document. Obviously we still=
=20
need a sound theory of text which draws a line between those aspects of=20
text which have to be recoded to guarantee the identity of the text and=20
those aspects which can be ignored (I still try to work out that theory).=20
This is somewhat complicated because it has to include factors like the=20
historical conditions of text production (including social, discursive and=
=20
technical circumstances), the kind (genre) of text to be represented and=20
the (intended) audience of a text representation.

*[I now start to comment on a mail by Michael Hart, Sat, 29 Jan - Re:=20
18.536 plain text]

   * From a text-theoretical point of view Project Gutenberg seems to be=20
rather easy to describe:
The texts in the project are the result of the application of a perception=
=20
filter. the project is not about "representing the text" but on=20
"representing the text as can be seen through the filtering glass of the=20
ASCII-Code". The texts in the project are "performances" of the originals=20
according to a certain theory of text: "Text is what can be expressed by=20
the ASCII code, everything else is not essential to the text and can be=20
ignored without damage to the identity of the text".
Maybe it simply cannot be better said than in the words of Michael Hart:

> It's not that plain text offers EVERYTHING. . .it's just that a plain=20
text file offers over 99% of what most authors wrote in a
library of books currently freely available for download.
How much more effort is it worth to get from 99% to 99.5% ???
For some, it's worth the moon and the stars, while for others a
plain text eBook provides all they ever wanted.

* There are books which can be represented by using nothing but the ASCII=20
code and you (or a certain reader) still have the impression to read the=20
same book (the same text) as in the original document. But there are others=
=20
too. Think of Laurence Sterne's "Tristram Shandy" (with its greek passages,=
=20
black pages, "filling characters", textual ornaments, drawings, font=20
changes etc.). Some readers will still be satisfied with the ASCII=20
representation (depending on their theory of text) which ignore all of=20
these features (like the ASCII-Tristram-Shandy in Project Gutenberg, text=20
no. 1079). But others (like me) would say: the text in Project Gutenberg=20
misses everything which is crucial for this text (work). If I read the=20
Project Gutenberg version of this text I will miss everything which really=
=20
constitutes the work (most of it - but not all - is in the html-version of=
=20
Project Gutenberg 1079 which really isn't bad).
This is not a criticism on Project Gutenberg! There are readings (ways of=20
reading) and analytical attitudes within which the texts still "work" as=20
substitutes for the original documents. But in a more global theory of text=
=20
we would have to say that these electronic plain texts are not the "text"=20
itself (whatever that is!) but merely "extracts" according to a certain=20
filtering tool (or perception): the ASCII code.
Back to Laurence Sterne: as author he used a variety of media/communication=
=20
channels to express his thoughts as "text". Project Gutenberg ('s=20
ASCII-Version) is (and has to be) blind to most of them. It's not blind to=
=20
those channels which are supported by the ASCII code: alphabetic characters=
=20
and some other characters which some people would call markup (more=20
precisely: some of the textual signs which some people call markup are=20
supported by the ASCII code because they can be represented (often:=20
simulated) as characters in the sense of the ASCII code). But Project=20
Gutenberg ignores all the other textual information channels which where=20
consciously used by the author to express his "text".
The idea of "text identity" (in the electronic representation of text) can=
=20
be based on (1) authorial intention, (2) readers reception or (3) a wider=20
theory of text which includes not only these two positions but also other=20
aspects of text and textuality. Project Gutenberg and plain text fail the=20
first and the latter approach but maybe partially fulfill (as Michael Hart=
=20
says with 99% of the readers) the second - depending on the text theory of=
=20
the reader.

Patrick Sahle

University of Cologne
Humanities Computing (Historisch-Kulturwissenschaftliche=20
Informationsverarbeitung)
Albertus-Magnus-Platz
50923 K=F6ln

--[3]------------------------------------------------------------------
         Date: Thu, 03 Feb 2005 07:44:01 +0000
         From: Wendell Piez <wapiez_at_mulberrytech.com>
         Subject: Re: 18.555 thoughts on writing (plain text)

At 01:33 AM 2/2/2005, Dino Buzzetti wrote:
> >Contrary to punctuation and capital letters, accented characters are *not*
> >markup, at least not structural markup (they may work for morphological and
> >syntactic markup, though). They simply *work* as other elements in the
> >character set and do not typically represent meta-data.
>
>I am not so sure. Take this:
>
> Le texte r=E9cite...
> Le texte r=E9cit=E9...
>
>Isn't there a structural difference here? Well, it depends on how
>you define structure, but I would take it to comprise also logical
>structure, or form. In my opinion every diacritical sign can be
>thought of as markup.

I agree with this, but mainly because it points to how the text/markup
distinction is itself questionable, drawing a line that can be usefully
challenged.

Is not markup also text? Isn't text, in a sense, also markup? Which came
first? When a hunter in the wilderness reads the tracks of his quarry, is
he reading a text, or markup? Isn't a sign itself, in the structuralist
view, a form of markup, a representation "in line" of something "out of line"?

This question isn't meant to try to dissolve the distinction, merely to
note that it's relative. As Prof Buzzetti says, "it depends on how you
define structure". What has form at one layer may be formless in the view
of the next one up or down. A digital processor may have to work at several
of these levels: there is the bitstream, which may include "markup" such as
checksums or compression encodings. There is the stream of characters
derived from those bits, which from one point of view is merely a sequence
of alphanumerics, but from another (a parser) can be distinguished between
unreserved characters and markup delimiters. There is the sequence of
"tags" and "text" (note that a binary encoding may skip this stage, which
is one reason we don't like most WYWSIWYG). There is the abstract model we
derive from this sequence, be that a series of parser events, or a
higher-order model such as "the tree" in XML systems. There may be yet
other higher-order models that we derive from such bare abstractions, once
we have provided semantics to tags such as <head> and <body>.

Study, for example, the manuscripts of a working playwright, and it becomes
hard to say with any certainty what's text and what's markup. Revisions
scrawled in the margins? Stage directions, blocking notes, lists of props?
(Like property lists appearing in comments in a programmer's code.) The
text/markup distinction is useful -- even invaluable, if one wants to
construct a hierarchy from data to information to knowledge to whatever
comes next -- each layer of the stack providing the next one down with the
context it needs to be more than it is in itself -- but it can also appear
to be fairly arbitrary. This doesn't mean that mixing up our layers is a
good idea. It just means that considering what's text and what's markup may
present a kind of uncertainty principle: we decide, for the purposes of the
layer we are working at, what's text and what's text-that-is-more-than-text
(more because it is less), like a diacritical mark. All metadata is someone
else's data.

Best regards,
Wendell Piez

======================================================================
Wendell Piez mailto:wapiez_at_mulberrytech.com
Mulberry Technologies, Inc. http://www.mulberrytech.com
17 West Jefferson Street Direct Phone: 301/315-9635
Suite 207 Phone: 301/315-9631
Rockville, MD 20850 Fax: 301/315-8285
----------------------------------------------------------------------
    Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================
Received on Thu Feb 03 2005 - 03:50:20 EST

This archive was generated by hypermail 2.2.0 : Thu Feb 03 2005 - 03:50:20 EST