Data From: Pseudo-De Novo Assembly and Analysis of Unmapped Genome Sequence Reads in Wild Zebrafish Reveals Novel Gene ContentBiology Faculty Data Sets
- Zebra danio -- Genetics,
- Zabra danio -- Mitochondrial DNA -- Analysis,
- Zebra danio -- Development
AbstractZebrafish represents the third vertebrate with an officially completed genome, yet it remains incomplete with additions and corrections continuing with the current release, GRCz10, having 13% of zebrafish cDNA sequences unmapped. This disparity may result from population differences given the reference was generated from clonal individuals with limited genetic diversity. This is supported by the recent analysis of a single wild zebrafish which identified over 5.2 million SNPs and 1.6 million in/dels in the previous genome build, zv9. Re-examination of this sequence dataset indicated that 13.8% of quality sequence reads failed to align to GRCz10. Using a novel bioinformatics de novo assembly pipeline on these unmappable reads we identified 1,514,491 novel contigs covering ~224 Mb of genomic sequence. Among these, 1,083 contigs were found to contain potential gene coding sequence. RNA-seq data comparison confirmed 362 contigs contained transcribed DNA sequence, suggesting that a large amount of functional genomic sequence remains unannotated in zebrafish. By utilizing the bioinformatics pipeline developed in this study the zebrafish genome will be bolstered as a model for human disease research. Adaptation of the pipeline described here also offers a cost-efficient and effective method to identify and map novel genetic content across any genome and will ultimately aid in the completion of additional genomes for a broad range of species.
Citation InformationFaber-Hammond, Joshua J. and Brown, Kim H., "Data From: Pseudo-De Novo Assembly and Analysis of Unmapped Genome Sequence Reads in Wild Zebrafish Reveals Novel Gene Content" (2015). Dataset. https://doi.org/10.15760/data.2