Skip to main content
Article
Bounding the probability of error for high precision recognition
UMass Amherst Technical Report (2009)
  • Andrew Kae
  • Gary B. Huang
  • Erik G Learned-Miller, University of Massachusetts - Amherst
Abstract

We consider models for which it is important, early in proces sing, to estimate some variables with high precision, but perhaps at relative ly low rates of recall. If some variables can be identified with near certainty, then th ey can be conditioned upon, allowing further inference to be done efficiently. Spe cifically, we consider optical character recognition (OCR) systems that can be boo tstrapped by identify- ing a subset of correctly translated document words with ver y high precision. This “clean set” is subsequently used as document-specific train ing data. While many current OCR systems produce measures of confidence for the id entity of each let- ter or word, thresholding these confidence values, even at ve ry high values, still produces some errors. We introduce a novel technique for identifying a set of corre ct words with very high precision. Rather than estimating posterior probabil ities, we bound the prob- ability that any given word is incorrect under very general a ssumptions, using an approximate worst case analysis. As a result, the parameter s of the model are nearly irrelevant, and we are able to identify a subset of wor ds, even in noisy doc- uments, of which we are highly confident. On our set of 10 docum ents, we are able to identify about 6% of the words on average without making a single er- ror. This ability to produce word lists with very high precision a llows us to use a family of models which depends upon such clean word lists.

Disciplines
Publication Date
2009
Citation Information
Andrew Kae, Gary B. Huang and Erik G Learned-Miller. "Bounding the probability of error for high precision recognition" UMass Amherst Technical Report (2009)
Available at: http://works.bepress.com/andrew_kae/2/