We consider a model for which it is important, early in proces sing, to estimate some variables with high precision, but perhaps at relatively low recall. I f some variables can be identified with near certainty, they can be conditioned upon, allowing furt her inference to be done efficiently. Specifically, we consider optical character recognition (O CR) systems that can be bootstrapped by identifying a subset of correctly translated document wo rds with very high precision. This “clean set” is subsequently used as document-specific train ing data. While OCR systems produce confidence measures for the identity of each letter or word, t hresholding these values still produces a significant number of errors.
We introduce a novel technique for identifying a set of corre ct words with very high precision. Rather than estimating posterior probabilities, we bound the probability that any given word is incorrect using an approximate worst case analysis. We give empirical results on a data set of difficult historical newspaper scans, demonstrating that o ur method for identifying correct words makes only two errors in 56 documents. Using document-speci fic character models generated from this data, we are able to reduce the error over properly segme nted characters by 34.1% from an initial OCR system’s translation.
Available at: http://works.bepress.com/andrew_kae/1/