Skip to main content
Unpublished Paper
Creating an Improved Version Using Noisy OCR from Multiple Editions
(2013)
  • David Wemhoener
  • Ismet Zeki
  • R. Manmatha, University of Massachusetts - Amherst
Abstract

This paper evaluates an automated scheme for aligning and combining optical character recognition (OCR) output from three scans of a book to generate a composite version with fewer OCR errors. While there has been some previous work on aligning multiple OCR versions of the same scan, the scheme introduced in this paper does not require that scans be from the same copy of the book, or even the same edition. The three OCR outputs are combined using an algorithm which builds upon an technique which aligns two sequences at a time. In the algorithm a multiple sequence alignment of the scans is generated by zipping together pairwise alignments and is used in turn to construct a corrected text. The algorithm is able to remove OCR errors so long as the same error does not occur in multiple scans. The alignment works even if one of the editions includes an extra long introduction or additional footnotes. This scheme is used to generate improved versions from OCR texts taken from the Internet Archive. The accuracy of the original scans and the composite text are evaluated by comparing them to the version available from Project Gutenberg.

Keywords
  • OCR error correction,
  • sequence alignment,
  • scanned book collections
Disciplines
Publication Date
2013
Comments
This is the pre-published version harvested from CIIR.
Citation Information
David Wemhoener, Ismet Zeki and R. Manmatha. "Creating an Improved Version Using Noisy OCR from Multiple Editions" (2013)
Available at: http://works.bepress.com/r_manmatha/53/