"Bounding the probability of error for high precision optical character recognition" by Gary B. Huang

Selected Works of Erik G Learned-Miller

Follow Contact

Article

Bounding the probability of error for high precision optical character recognition

Journal of Machine Learning Research (2012)

Gary B. Huang
Andrew Kae
Carl Doersch
Erik G Learned-Miller, University of Massachusetts - Amherst

Download

Abstract

We consider a model for which it is important, early in proces sing, to estimate some variables with high precision, but perhaps at relatively low recall. I f some variables can be identified with near certainty, they can be conditioned upon, allowing furt her inference to be done efficiently. Specifically, we consider optical character recognition (O CR) systems that can be bootstrapped by identifying a subset of correctly translated document wo rds with very high precision. This “clean set” is subsequently used as document-specific train ing data. While OCR systems produce confidence measures for the identity of each letter or word, t hresholding these values still produces a significant number of errors.

We introduce a novel technique for identifying a set of corre ct words with very high precision. Rather than estimating posterior probabilities, we bound the probability that any given word is incorrect using an approximate worst case analysis. We give empirical results on a data set of difficult historical newspaper scans, demonstrating that o ur method for identifying correct words makes only two errors in 56 documents. Using document-speci fic character models generated from this data, we are able to reduce the error over properly segme nted characters by 34.1% from an initial OCR system’s translation.

Disciplines

Computer Sciences

Publication Date

February 12, 2012

Citation Information

Gary B. Huang, Andrew Kae, Carl Doersch and Erik G Learned-Miller. "Bounding the probability of error for high precision optical character recognition" Journal of Machine Learning Research Vol. 13 (2012)
Available at: http://works.bepress.com/erik_learned_miller/47/