"Approximative distance computation by random hashing" by Selim Mimaroglu

Selected Works of Dan Simovici

Follow Contact

Article

Approximative distance computation by random hashing

The Journal of Supercomputing (2011)

Selim Mimaroglu
Murat Yagci
Dan A. Simovici, University of Massachusetts Boston

Link

Abstract

We propose an approximate computation technique for inter-object distances of binary data sets. Our approach is based on locality sensitive hashing. We randomly select a number of projections of the data set and group objects into buckets based on the hash values of these projections. For each pair of objects, occurrences in the same bucket are counted and the exact Hamming distance is approximated based on the number of co-occurrences in all buckets. We parallelize the computation using mainly two schemes. The first assigns each random subspace to a processor for calculating the local co-occurrence matrix, where all the local co-occurrence matrices are combined into the final co-occurrence matrix. The second method provides the same distance approximation in longer runtimes by limiting the total message size in a parallel computing environment, which is especially useful for very large data sets generating immense message traffic. Our methods produce very accurate results, scale up well with the number of objects, and tolerate processor failures. Experimental evaluations on supercomputers and workstations with several processors demonstrate the usefulness of our methods.

Keywords

binary data sets,
hashing,
computation,
supercomputers,
workstations

Disciplines

Computer Sciences

Publication Date

May 26, 2011

Citation Information

Selim Mimaroglu, Murat Yagci and Dan A. Simovici. "Approximative distance computation by random hashing" The Journal of Supercomputing (2011)
Available at: http://works.bepress.com/dan_simovici/1/