Skip to main content
Article
Parallel Hash-Based EST Clustering Algorithm for Gene Sequencing
DNA and Cell Biology
  • Rameshreddy Mudhireddy
  • Fikret Erçal, Missouri University of Science and Technology
  • Ronald L. Frank, Missouri University of Science and Technology
Abstract

EST clustering is a simple, yet effective method to discover all the genes present in a variety of species. Although using ESTs is a cost-effective approach in gene discovery, the amount of data, and hence the computational resources required, make it a very challenging problem. Time and storage requirements for EST clustering problems are prohibitively expensive. Existing tools have quadratic time complexity resulting from all against all sequence comparisons. with the rapid growth of EST data we need better and faster clustering tools. In this paper, we present HECT (Hash based EST Clustering Tool), a novel time- and memory-efficient algorithm for EST clustering. We report that HECT can cluster a 10,000 Human EST dataset (which is also used in benchmarking d2_cluster), in 207 minutes on a 1 GHz Pentium III processor which is 36 times faster than the original d2_cluster algorithm. A parallel version of HECT (PECT) is also developed and used to cluster 269,035 soybean EST sequences on IA-32 Linux cluster at National Center for Supercomputing Applications at UIUC. The parallel algorithm exhibited excellent speedup over its sequential counterpart and its memory requirements are almost negligible making it suitable to run virtually on any data size. The performance of the proposed clustering algorithms is compared against other known clustering techniques and results are reported in the paper.

Department(s)
Computer Science
Second Department
Biological Sciences
Keywords and Phrases
  • EST Clustering,
  • Hash,
  • Human EST Dataset,
  • Genetic programming (Computer science),
  • Human gene mapping
Document Type
Article - Journal
Document Version
Citation
File Type
text
Language(s)
English
Rights
© 2004 Mary Ann Liebert, Inc., All rights reserved.
Publication Date
10-1-2004
Publication Date
01 Oct 2004
PubMed ID
15585119
Disciplines
Citation Information
Rameshreddy Mudhireddy, Fikret Erçal and Ronald L. Frank. "Parallel Hash-Based EST Clustering Algorithm for Gene Sequencing" DNA and Cell Biology Vol. 23 Iss. 10 (2004) p. 615 - 623 ISSN: 1044-5498; 1B135:B176557-7430
Available at: http://works.bepress.com/ronald-frank/3/