"A Statistical Approach to Retrieving Historical Manuscript Images without Recognition" by Toni M. Rath

Selected Works of R. Manmatha

Follow Contact

Unpublished Paper

A Statistical Approach to Retrieving Historical Manuscript Images without Recognition

(2003)

Toni M. Rath
Victor Lavrenko
R. Manmatha, University of Massachusetts - Amherst

Download

Abstract

Handwritten historical document collections in libraries and other areas are often of interest to researchers, students or the general public. Convenient access to such corpora generally requires an index, which allows one to locate individual text units (pages, sentences, lines) that are relevant to a given query (usually provided as ASCII text). Several solutions are possible: manual annotation (very expensive), handwriting recognition (poor results) and word spotting - an image matching approach (computationally expensive).

In this work, we present a novel retrieval approach for historical document collections, which does not require recognition. We assume that word images can be described using a vocabulary of discretized word features. From a training set of labeled word images, we extract discrete feature vectors, and estimate the joint probability distribution of features and word labels. For a given feature vector (i.e. a word image), we can then calculate conditional probabilities for all labels in the training vocabulary. Experiments show that this relevance-based language model works very well with a mean average precision of 84% for 4-word queries on a subset of George Washingtons manuscripts.

Disciplines

Computer Sciences

Publication Date

2003

Comments

This is the pre-published version harvested from CIIR.

Citation Information

Toni M. Rath, Victor Lavrenko and R. Manmatha. "A Statistical Approach to Retrieving Historical Manuscript Images without Recognition" (2003)
Available at: http://works.bepress.com/r_manmatha/21/