-- -MarkV. -------------------- From: Tim Bray <uunet!watsol.waterloo.edu!tbray> Subject: Suggested Database Standards Tim Bray at the New OED Project here; have enjoyed the discussions about format standardization. Some feedback: >Hmmm... I would suggest that text be the primary standard. There is no other sane choice, for reasons of portability. >The next >question of course, is what to use for graphic information? I hesitate >to suggest any of the present standards, although I suspect that >at present PostScript may be the best choice, on the grounds that there >are a very large number of installed PostScript devices *and conversion >utilities*. Yes, and another great virtue of PostScript is that any text incorporated in a graphic can be stored *as text* and then is available for searching the normal course of affairs. Furthermore, the conversion utilities include excellent programs such as Adobe Illustrator for going from scanned graphics to PostScript as well as the other way around. The fellow who was arguing for a standard format like GIF or TIFF also had some good points. But those formats in general handle *text* poorly. >1. Storage of information. >2> Searching of information. >3. Display of information. The first is (relatively) easy; put everything in a flat byte-stream style with descriptive markup. The second and third are both very hard to do well, and the problem is that everybody who has any kind of a solution wants to make lots of money selling it. For example us: we have a pretty good toolkit and are commercializing it with great success. Having each database include its own search capability isn't very helpful because nobody can tolerate having to deal with N different search systems. I'd say just concentrate on getting #1 right, and punt the other two. That way, if people have nothing, they can at least use grep. If they have something like a NeXT, they can accomplish quite a bit with the Digital Librarian software. If they have some big-league high-powered search/display software like we and some other people sell, they can really make good use of the database. I think OBI has to concentrate on making the stuff available in as generic and flexible a fashion as possible, and leave the how-to-use problem to the consumer. >Tying ourselves to a mark-up language sounds like a bad idea; >better to have the individual databases decide I suggest that while markup *syntax* is unimportant, text which is stored with descriptive as opposed to typographical markup will be much much more useful for our purposes. Descriptive markup doesn't mean you have to go all the way to full SGML. For example, if you're typing in some text, it's much more useful in general to put in explicit markers for the *structural* components than waste your time trying to mimic the exact typography of the document. *BUT*, most of the stuff that's going to be available is probably going to be only available with typographical markup of some sort, so we'll just have to live with that to a certain extent. >(I.E., server cycles >are cheap, client cycles are expensive, and Network bandwidth is like >doritos - 'Go ahead, use it, we'll make more'). And memory/disk is real, real cheap. Buying flexibility with somewhat less compact markup is always a win. No binary magic cookies of any type in the text! >Are there databases for nontextual records? There aren't even good databases for *textual* material available off-the-shelf at this time! >It would be nice to have something which has cross-referencing (in the >hypertext manner, for instance) inherent. This capability could >obviously be used for query capability as well (pretty easy to collect >and index links). Cross referencing is a Good Thing, but unfortunately I don't believe there is a single problem here. It seems that the cross-reference problem is heavily tied up with the semantics and knowledge base of each different document. This is one of the reasons the hypertext people have trouble dealing with large existing textbases. On the other hand, a good fast full-text search system does most of what you need. > I was trying to sell people (for a while) on the idea of >developing a standard marked-up-text protocol (like SGML but without >the COBOL goo) and using that to extend into a communications >protocol, query language, and display driver protocol. I'm talking >low-level, though. In other words, someone might make a query of a >database server: > ><QUERY>RELEVANCE TO "FOO" .GT. 90%</QUERY> ><QUERY>RELEVANCE TO "FOO" IN <H1> .GT. 90%</QUERY> > >(or something like that) >And the server might send back an answer coded up in some simple >mark-up that could then be displayed in a device-dependent manner >on the user's display. Absolutely. One of the fundamental principles we've been applying with great success here on the OED project is: 1. All information, including software control files etc, must be stored as tagged text unless you can prove it's impossible, and 2. All software modules must communicate with each other using streams of tagged text (exactly as described here) unless you can prove it's impossible. The benefits, in terms of network independence and max flexibility, are huge. But then the same person goes on to say: >...it is NOT going to work very well >if the base form of everything is ASCII text. Some kind of higher-level >language for representing structure/comments/format will be needed. Wrong. Yes, you want to represent the structure and so on, but the right way to do it is with embedded descriptive markup in the text. You're right that 7-bit ASCII ain't gonna do the job though; there are interesting non-English languages. Cheers, Tim Bray New Oxford English Dictionary Project, U of Waterloo and Open Text Systems, Inc. -------------------- From: uunet!flash.bellcore.com!amsler (Robert A Amsler) Subject: Re: `pure' text And what is `pure text' format. Having spent many days now trying to convert a key-punched text from the 1960s into some semblance of contemporary keyboarding practice, I am curious where this guide to how to keyboard `pure text' exists? For example, the text in question contained footnotes. The keyboarders didn't know what to do with them, so they put them in the text at exactly the point on the physical page where the footnote started far removed from the point of citation, in mid-sentence at the bottom of the appropriate column. There was no mark in the text nor on the footnote to note that these were footnotes at all. Headings are likewise just typed in as text. No conventions on line breaks, blank lines, etc. were followed. Determining that these were headings is thus not at all mechanical. Then there is the punctuation. " stood for opening and closing quotes; but to help in subsequent analysis all punctuation was typed in separated by blanks from the surrounding words. For commas, periods, etc this is not a problem--but for quotes it is impossible to tell whether they are attached to the preceding or following text. This `pure' text also deleted all --'s; a small loss, but without any indication these parenthetical comments lose their distinguishability. There there are symbols and foreign letters. C cedilla's, acute and grave accents on foreign words. What is the `pure text' version of these? Does one translate u umlaut into u", ue, {u"}, @Ovp{"}u or just u? How should one encode the degree symbol as in 32 degrees. Perhaps everything should be spelled out rather than special symbols used? They did this with %, but not with degrees, nor fractions. So... where is the keyboarder's guide to `pure text'? Perhaps, `pure text' is what an OCR system would produce from a document.... That is of course just another big problem. How does the OCR system scan columns? What about thin lines between different stories on a newspaper page? Captions for photos? I guess I just don't know what `pure text' looks like. Does `pure text' mean we translate % into `per cent'? I think I'd prefer anything BUT `pure' text. I'd prefer some type well-documented format, with all the conventions noted for anything that was outside ASCII. With stated conventions for super/sub scripts, fractions, formulae, headings in differing point sizes (esp. where the point size indicated the level of heading), listing of special symbols, notes on footnotes, and the dozens of other things that I haven't specified. Please no more `pure' text. It is too non-standard.