Skip to main content
Article
Optimization of I/O Intensive Genome Assemblies on the Cori Supercomputer with Burst Buffer
Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (2016)
  • Joshua Pritchett, Joint Genome Institute, Lawrence Berkeley National Laboratory
  • Bill Andreopoulos, Joint Genome Institute, Lawrence Berkeley National Laboratory
Abstract
Since the development of next generation sequencing technologies, genome assembly has become one of the most computational and I/O intensive analyses done on the genomic data. The flood of genomic sequence data has increased the demand for more efficient genome assembly workflows. One of the assemblers being used for this purpose, Falcon, a long fragment sequence assembler, has several parts that are very I/O intensive as they read and write many files to disk. While the computationally expensive parts of assembly are more efficient with Falcon than other assemblers, the I/O while reading and writing files to disk is a bottleneck. It has been observed that the wall clock runtime of the Falcon pipeline can be reduced by copying data files to local disk rather than rely on the Lustre parallel file system or NFS. Making a higher bandwidth available to the application should allow the application to read/write a large amount of data faster, such that I/O is not a bottleneck for genome assembly.

In this article, we investigate using the NERSC's Cori, Cray XC40 supercomputer, to improve the runtimes of genome assemblies at the Joint Genome Institute. We find that Cori gives a significant runtime improvement over the older high-performance computing cluster when running I/O intensive genome assemblies. This confirms the benefits of running assemblies on Cori. Using a local disk on a cluster node also results in a runtime improvement as expected. However, the Burst Buffer technology that is meant to act as local disk for Cori did not give a significant improvement. This was due to the nature of I/O with Falcon that involves writing many small files rather than one very large file.
Publication Date
October 2, 2016
DOI
10.1145/2975167.2985685
Publisher Statement
SJSU users: use the following link to login and access the article via SJSU databases.
Citation Information
Joshua Pritchett and Bill Andreopoulos. "Optimization of I/O Intensive Genome Assemblies on the Cori Supercomputer with Burst Buffer" Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (2016) p. 554 - 561
Available at: http://works.bepress.com/william-andreopoulos/26/