Summarization of Imaged Documents without OCR

Francine R. Chen and Dan S. Bloomberg

A system is presented for creating a summary indicating the contents of an imaged document. The summary is composed from selected regions extracted from the imaged document. The regions may include sentences, keyphrases, headings and figures. The extracts are identified without the use of optical character recognition. The imaged document is first processed to identify the word bounding boxes, the reading order of words, and the location of sentence and paragraph boundaries in the text. The word bounding boxes are grouped into equivalence classes to mimic the terms in a text document. Equivalence classes representing content words are identified, and keyphrases are identified from the set of content words. Summary sentences are selected using a statistically-based classifier applied to a set of discrete sentence features. Evaluation of sentence selection against a set of abstracts created by a professional abstracting company is given.