"Shared Data Science Infrastructure for Genomics Data" by Hamid Bagheri

Selected Works of Andrew Severin

Follow Contact

Article

Shared Data Science Infrastructure for Genomics Data

bioRxiv

Hamid Bagheri, Iowa State University
Usha Muppirala, Iowa State University
Andrew J. Severin, Iowa State University
Hridesh Rajan, Iowa State University

Download

Document Type

Article

Disciplines

Publication Version

Submitted Manuscript

Publication Date

1-1-2018

DOI

10.1101/307777

Abstract

Creating a computational infrastructure to analyze the wealth of information contained in data repositories that scales well is difficult due to significant barriers in organizing, extracting and analyzing relevant data. Shared Data Science Infrastructures like Boa can be used to more efficiently process and parse data contained in large data repositories. The main features of Boa are inspired from existing languages for data intensive computing and can easily integrate data from biological data repositories. Here, we present an implementation of Boa for Genomic research (BoaG) on a relatively small data repository: RefSeq's 97,716 annotation (GFF) and assembly (FASTA) files and metadata. We used BoaG to query the entire RefSeq dataset and gain insight into the RefSeq genome assemblies and gene model annotations and show that assembly quality using the same assembler varies depending on species. In order to keep pace with our ability to produce biological data, innovative methods are required. The Shared Data Science Infrastructure, BoaG, can provide greater access to researchers to efficiently explore data in ways previously not possible for anyone but the most well funded research groups. We demonstrate the efficiency of BoaG to explore the RefSeq database of genome assemblies and annotations to identify interesting features of gene annotation as a proof of concept for much larger datasets.

Comments

This is a pre-print of the article Bagheri, Hamid, Usha Muppirala, Andrew J. Severin, and Hridesh Rajan. "Shared Data Science Infrastructure for Genomics Data." bioRxiv (2018): 307777. DOI: 10.1101/307777. Posted with permission.

Creative Commons License

Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International

The Authors

2018

Language

File Format

application/pdf

Citation Information

Hamid Bagheri, Usha Muppirala, Andrew J. Severin and Hridesh Rajan. "Shared Data Science Infrastructure for Genomics Data" bioRxiv (2018) p. 307777
Available at: http://works.bepress.com/andrew-severin/27/