"Partial Duplicate Detection for Large Book Collections" by Ismet Zeki Yalniz

Selected Works of R. Manmatha

Follow Contact

Unpublished Paper

Partial Duplicate Detection for Large Book Collections

(2011)

Ismet Zeki Yalniz
Ethem F. Can
R. Manmatha, University of Massachusetts - Amherst

Download

Abstract

A framework is presented for discovering partial duplicates in large collections of scanned books with optical character recognition (OCR) errors. Each book in the collection is represented by the sequence of words (in the order they appear in the text) which appear only once in the book. These words are referred to as ``unique words'' and they constitute a small percentage of all the words in a typical book. Along with the order information the set of unique words provides a compact representation which is highly descriptive of the content and the flow of ideas in the book. By aligning the sequence of unique words from two books using the longest common subsequence (LCS) one can discover whether two books are duplicates. Experiments on several datasets show that DUPNIQ is more accurate than traditional methods for duplicate detection such as shingling and is fast. On a collection of 100K scanned English books DUPNIQ detects partial duplicates in 30 min using 350 cores and has precision 0.996 and recall 0.833 compared to shingling with precision 0.992 and recall 0.720. The technique works on other languages as well and is demonstrated for a French dataset.

Keywords

Information Storage and retieval,
digital libraries,
partial duplicate detection,
sequence matching,
unique words

Disciplines

Computer Sciences

Publication Date

2011

Comments

This is the pre-published version harvested from CIIR.

Citation Information

Ismet Zeki Yalniz, Ethem F. Can and R. Manmatha. "Partial Duplicate Detection for Large Book Collections" (2011)
Available at: http://works.bepress.com/r_manmatha/41/