Skip to main content
Article
Approximative distance computation by random hashing
The Journal of Supercomputing (2011)
  • Selim Mimaroglu
  • Murat Yagci
  • Dan A. Simovici, University of Massachusetts Boston
Abstract

We propose an approximate computation technique for inter-object distances of binary data sets. Our approach is based on locality sensitive hashing. We randomly select a number of projections of the data set and group objects into buckets based on the hash values of these projections. For each pair of objects, occurrences in the same bucket are counted and the exact Hamming distance is approximated based on the number of co-occurrences in all buckets. We parallelize the computation using mainly two schemes. The first assigns each random subspace to a processor for calculating the local co-occurrence matrix, where all the local co-occurrence matrices are combined into the final co-occurrence matrix. The second method provides the same distance approximation in longer runtimes by limiting the total message size in a parallel computing environment, which is especially useful for very large data sets generating immense message traffic. Our methods produce very accurate results, scale up well with the number of objects, and tolerate processor failures. Experimental evaluations on supercomputers and workstations with several processors demonstrate the usefulness of our methods.

Keywords
  • binary data sets,
  • hashing,
  • computation,
  • supercomputers,
  • workstations
Disciplines
Publication Date
May 26, 2011
Citation Information
Selim Mimaroglu, Murat Yagci and Dan A. Simovici. "Approximative distance computation by random hashing" The Journal of Supercomputing (2011)
Available at: http://works.bepress.com/dan_simovici/1/