[tei-council] Fwd: Re: heuristics for document structure
Kevin Hawkins
kevin.s.hawkins at ultraslavonic.info
Fri Apr 22 11:05:10 EDT 2011
For those interested in "white space XML" ...
-------- Original Message --------
Subject: Re: heuristics for document structure
Date: Fri, 22 Apr 2011 10:52:42 -0400
From: Burns, John <John.Burns at proquest.com>
To: Kevin Hawkins <kevin.s.hawkins at ultraslavonic.info>
Good Morning Kevin,
My goodness -- where to start ? It depends on what you want to do ;-)
Not lot of end-user tools are openly available -- but a lot of
systems exist either as proofs of concept or behind closed doors doing
things like legal discovery, news extraction &c.
A good place to start would be the proceedings of ICDAR, the work of
Thomas Breuel et al at Kaiserslautern/DFKI and the LAMP folks at
Maryland (Language and Media Processing Group)
Kris West did some lovely work on the automatic extraction of items from
auction catalogs /and the system is freely available since it was funded
by the Mellon -- /Clare Llewelyn (now at Edinburgh -- Clare Llewellyn
<clarellewellyn at yahoo.com <mailto:clarellewellyn at yahoo.com>> ) would be
a good place to start, since she is a good friend of Kris and managed
the project in a previous life. The system is trained to do catalogs,
but could, in principle, do any corpus that is moderately consistent.
The SEASR project also has tools to do similar things, and would
probably be the best academic partner.
There is a lot of work in Europe, off the top of my head -- NaCTeM in
the UK, Claire Grover et al at Edinburgh, plus all the FP7 funding.
Paul Watry at Liverpool might be worth a ping since he has an
encyclopedic knowledge of who is doing what.
I'll poke around and see what I can find that is openly available , but
it would be worth thinking through what level of refinement you want --
entity labeling vs, structural vs. document labeling ( topic,
sentiment, tone, complexity ).
-john
John Burns
Director, Platform Research
ProQuest
501 North 34th St # 400
Seattle, WA 98103
Cell: (206) 450 0329 email: john.burns at proquest.com
<mailto:john.burns at proquest.com>
On Apr 21, 2011, at 6:53 PM, Kevin Hawkins wrote:
Hi John,
[. . .]
I'm writing to see if you can point me to some literature on heuristics
for deducing structure in documents. I remember in Belfast you said
that Decapod uses some of the techniques developed at places like Xerox
or HP and now used by Amazon to convert PDFs to Kindle files. There's
lots of interest in doing this sort of thing in the TEI community, and
people keep looking for ways of doing this on semi- and unstructured
documents.
Thanks,
Kevin
More information about the tei-council
mailing list