Skip to main content
Unpublished Paper
Finding Translations in Scanned Book Collections
(2012)
  • Ismet Zeki Yalniz
  • R. Manmatha, University of Massachusetts - Amherst
Abstract

This paper describes an approach for identifying translations of books in large scanned book collections with OCR errors. The method is based on the idea that although individual sentences do not necessarily preserve the word order when translated, a book must preserve the linear progression of ideas for it to be a valid translation. Consider two books in two different languages, say English and German. The English book in the collection is represented by the sequence of words (in the order they appear in the text) which appear only once in the book. Similarly, the book in German is represented by its sequence of words which appear only once. An English-German dictionary is used to transform the word sequence of the English book into German by translating individual words in place. It is not necessary to translate all the words and this method works even with small dictionaries. Both sequences are now in German and can, therefore, be aligned using a Longest Common Subsequence (LCS) algorithm. We describe two scoring functions TRANS-cs and TRANS-its which account for both the LCS length and the lengths of the original word sequences. Experiments demonstrate that TRANS-its is particularly successful in finding translations of books and outperforms several baselines including metadata search based on matching titles and authors. Experiments performed on a Europarl parallel corpus for four language pairs, English-Finnish, English-French, English-German, English-Spanish, and a scanned book collection of 50K English-German books show that the proposed method retrieves translations of books with an average MAP score of 1.0 and a speed of 10K book pair comparisons per second on a single core.

Keywords
  • Information Storage and Retrieval,
  • digital libraries,
  • Translation detection,
  • sequence alignment,
  • unique words,
  • book collections
Disciplines
Publication Date
2012
Comments
This is the pre-published version harvested from CIIR.
Citation Information
Ismet Zeki Yalniz and R. Manmatha. "Finding Translations in Scanned Book Collections" (2012)
Available at: http://works.bepress.com/r_manmatha/51/