[tei-council] Fwd: Re: heuristics for document structure

Fri Apr 22 11:05:10 EDT 2011

For those interested in "white space XML" ...

-------- Original Message --------
Subject: 	Re: heuristics for document structure
Date: 	Fri, 22 Apr 2011 10:52:42 -0400
From: 	Burns, John <John.Burns at proquest.com>
To: 	Kevin Hawkins <kevin.s.hawkins at ultraslavonic.info>

Good Morning Kevin,

My goodness -- where to start ?  It depends on what you want to do ;-) 
  Not  lot of end-user tools are openly available -- but a lot of 
systems exist either as proofs of concept or behind closed doors doing 
things like legal discovery, news extraction &c.

A good place to start would be the proceedings of ICDAR, the work of 
Thomas Breuel et al at Kaiserslautern/DFKI and the LAMP folks at 
Maryland (Language and Media Processing Group)

Kris West did some lovely work on the automatic extraction of items from 
auction catalogs /and the system is freely available since it was funded 
by the Mellon -- /Clare Llewelyn (now at Edinburgh -- Clare Llewellyn 
<clarellewellyn at yahoo.com <mailto:clarellewellyn at yahoo.com>> ) would be 
a good place to start, since she is a good friend of Kris and managed 
the project in a previous life.    The system is trained to do catalogs, 
but could, in principle, do any corpus that is moderately consistent. 
  The SEASR project also has tools to do similar things, and would 
probably be the best academic partner.

There is a lot of work in Europe, off the top of my head -- NaCTeM in 
the UK,  Claire Grover et al at Edinburgh,  plus all the FP7 funding. 
  Paul Watry at Liverpool might be worth a ping since he has an 
encyclopedic knowledge of who is doing what.

I'll poke around and see what I can find that is openly available ,  but 
it would be worth thinking through what level of refinement you want -- 
entity labeling vs, structural  vs.  document labeling ( topic, 
sentiment, tone, complexity ).

-john

John Burns
Director, Platform Research
ProQuest
501 North 34th St # 400
Seattle, WA 98103
Cell: (206) 450 0329   email: john.burns at proquest.com 
<mailto:john.burns at proquest.com>

On Apr 21, 2011, at 6:53 PM, Kevin Hawkins wrote:

Hi John,

[. . .]

I'm writing to see if you can point me to some literature on heuristics
for deducing structure in documents.  I remember in Belfast you said
that Decapod uses some of the techniques developed at places like Xerox
or HP and now used by Amazon to convert PDFs to Kindle files.  There's
lots of interest in doing this sort of thing in the TEI community, and
people keep looking for ways of doing this on semi- and unstructured
documents.

Thanks,

Kevin