UbiText

 

UbiText is a new process for taking text from page images and turning it into reflowable content suitable for many purposes.

The core idea of UbiText is to rip text in page images (scanned pages, TIFF files, PDF or Postscript documents, etc.) into individual word images, along with non-text images such as illustrations or signatures, then string these images together in reading order in some more useful format, such as an HTML or XML file, which contains no text, just a long sequence of embedded images. In this way ebooks suitable for use on PDAs may be created without OCR errors, and with the original typography preserved. This version of the document can also be retargeted to Web browsers or printers.

an original page imagea UbiText-reflowed view of that page


Papers:
Paper to PDA (from ICPR 2002)


Commercial contact: jschen@parc.com


updated: $Date: 2002/09/17 02:45:17 $ GMT by $Author: janssen $