Triage of OCR Results Using `Confidence' Scores

Prateek Sarkar, Henry S. Baird, John Henderson

Abstract

We describe a technique for modeling the character recognition accuracy of an OCR system -- treated as a "black box" -- on a particular page of printed text based on an examination only of the output top-choice character classifications and, for each, a "confidence score" such as is supplied by many commercial OCR systems. Latent conditional independence (LCI) models perform better on this task, in our experience, than naive uniform thresholding methods. Given a sufficiently large and representative dataset of OCR (errorful) output and manually "proofed" (correct) text, we can automatically infer LCI models that exhibit a useful degree of reliability. A collaboration between a PARC research group and a Xerox legacy conversion service bureau has demonstrated that such models can significantly improve the productivity of human proofing staff by "triaging" -- that is, selecting to bypass manual inspection -- pages whose estimated OCR accuracy exceeds a threshold chosen to ensure that a customer-specified per-page accuracy target will be met with sufficient confidence. We report experimental results on over 1400 pages. Our triage software tools are running in production and will be applied to more than 5 million pages of multi-lingual text.

Download paper

PostScript (93K) PDF

Bibtex entry

@InProceedings{Sarkar2002:triage
,author = {Prateek Sarkar and Henry S. Baird and John Henderson}
,title = {Triage of OCR Output Using 'Confidence' Scores}
,booktitle = {[accepted for publication in] Proceedings of SPIE/IS&T 2002 Document Recognition & Retrieval IX Conf. (DR&R IX)}
,year = 2002
,address = {San Jose, California, USA}
,month = {January 20-25}
,http = {http://parcweb.parc.xerox.com/istl/members/psarkar/PUBLICATIONS/SPIE2002/download.html}
,psinternal = {papers/ps/Sarkar2002_triage.ps}
,pdfinternal = {papers/pdf/Sarkar2002_triage.pdf}
}
Prateek Sarkar
Last modified: Wed Nov 7 17:08:20 PST 2001