"Comprehensive Analysis of Non Redundant Protein Database" by Hamid Bagheri

Selected Works of Andrew Severin

Follow Contact

Article

Comprehensive Analysis of Non Redundant Protein Database

Research Square

Hamid Bagheri, Iowa State University
Robert Dyer, University of Nebraska – Lincoln
Andrew J. Severin, Iowa State University
Hridesh Rajan, Iowa State University

Download

Document Type

Article

Disciplines

Publication Version

Submitted Manuscript

Publication Date

8-19-2020

DOI

10.21203/rs.3.rs-54568/v1

Abstract

Background: Scientists around the world use NCBI’s non-redundant (NR) database to identify the taxonomic origin and functional annotation of their favorite protein sequences using BLAST. Unfortunately, due to the exponential growth of this database, many scientists do not have a good understanding of the contents of the NR database. There is a need for tools to explore the contents of large biological datasets, such as NR, to better understand the assumptions and limitations of the data they contain.

Results: Protein sequence data, protein functional annotation, and taxonomic assignment from NCBI’s NR database were placed into a BoaG database, a domain-specific language and shared data science infrastructure for genomics, along with a CD-HIT clustering of all these protein sequences at different sequence similarity levels. We show that BoaG can efficiently perform queries on this large dataset to determine the average length of protein sequences and identify the most common taxonomic assignments and functional annotations. Using the clustering information, we also show that the non-redundant (NR) database has a considerable amount of annotation redundancy at the 95% similarity level.

Conclusions: We implemented BoaG and provided a web-based interface to BoaG’s infrastructure that will help researchers to explore the dataset further. Researchers can submit queries and download the results or share them with others.

Availability and implementation: The web-interface of the BoaG infrastructure can be accessed here: http://boa.cs.iastate.edu/boag. Please use user = boag and password = boag to login. Source code and other documentation are also provided as a GitHub repository: https://github.com/boalang/NR_Dataset.

Comments

This is a pre-print of the article Bagheri, Hamid, Robert Dyer, Andrew Severin, and Hridesh Rajan. "Comprehensive Analysis of Non Redundant Protein Database." Research Square (2020). DOI: 10.21203/rs.3.rs-54568/v1. Posted with permission.

Creative Commons License

Creative Commons Attribution 4.0 International

The Author(s)

2020

Language

File Format

application/pdf

Citation Information

Hamid Bagheri, Robert Dyer, Andrew J. Severin and Hridesh Rajan. "Comprehensive Analysis of Non Redundant Protein Database" Research Square (2020)
Available at: http://works.bepress.com/andrew-severin/32/