Skip to main content
Unpublished Paper
Indexing of Handwritten Historical Documents - Recent Progress
(2003)
  • R. Manmatha, University of Massachusetts - Amherst
  • Toni M. Rath
Abstract

Indexing and searching collections of handwritten archival documents and manuscripts has always been a challenge because handwriting recognizers do not perform well on such noisy documents. Given a collection of documents written by a single author (or a few authors), one can apply a technique called word spotting. The approach is to cluster word images based on their visual appearance, after segmenting them from the documents. Annotation can then be performed for clusters rather than documents.

Given segmented pages, matching handwritten word images in historical documents is a great challenge due to the variations in handwriting and the noise in the image. We describe investigations into a number of different matching techniques for word images. These include shape context matching, SSD correlation, Euclidean Distance Mapping and dynamic time warping. Experimental results show that dynamic time warping works best and gives an average precision of around 70% on a test set of 2000 word images (from ten pages) from the George Washington corpus.

Dynamic time warping is relatively expensive and we will describe approaches to speeding up the computation so that the approach scales. Our immediate goal is to process a set of 100 page images with a longer term goal of processing all 6000 available pages.

Disciplines
Publication Date
2003
Comments
This is the pre-published version harvested from CIIR.
Citation Information
R. Manmatha and Toni M. Rath. "Indexing of Handwritten Historical Documents - Recent Progress" (2003)
Available at: http://works.bepress.com/r_manmatha/22/