Skip to main content
Unpublished Paper
Nearest Neighbor based Collection OCR
(2010)
  • Pramod Sankar K.
  • C. V. Jawahar
  • R. Manmatha, University of Massachusetts - Amherst
Abstract

Conventional optical character recognition (OCR) systems operate on individual characters and words, and do not normally exploit document or collection context. We describe a Collection OCR which takes advantage of the fact that multiple examples of the same word (often in the same font) may occur in a document or collection. The idea here is that an OCR or a reCAPTCHA like process generates a partial set of recognized words. In the second stage, a nearest neighbor algorithm compares the remaining word-images to those already recognized and propagates labels from the nearest neighbors. It is shown that by using an approximate fast nearest neighbor algorithm based on Hierarchical K-Means (HKM), we can do this accurately and efficiently. It is also shown that profile based features perform much better than SIFT and Pyramid Histogram of Gradient (PHOG) features. We believe that this is because profile features are more robust to word degradations (common in our documents). This approach is applied to a collection of Telugu books - a language for which no commercial OCR exists. We show from a selection of 33 Telugu books that starting with OCR labels for only 30% of the collection we can recognize the remaining 70% of the words in the collection with 70% accuracy using this approach. Since the approach makes no language specific assumptions, it should be applicable to a large number of languages. In particular we are interested in its applicability to Indic languages and scripts.

Disciplines
Publication Date
2010
Comments
This is the pre-published version harvested from CIIR.
Citation Information
Pramod Sankar K., C. V. Jawahar and R. Manmatha. "Nearest Neighbor based Collection OCR" (2010)
Available at: http://works.bepress.com/r_manmatha/40/