Skip to main content
Data From: Pseudo-De Novo Assembly and Analysis of Unmapped Genome Sequence Reads in Wild Zebrafish Reveals Novel Gene Content
Biology Faculty Datasets
  • Joshua J. Faber-Hammond, Portland State University
  • Kim H. Brown, Portland State University
Document Type
Publication Date
  • Zebra danio -- Genetics,
  • Zebra danio -- Mitochondrial DNA -- Analysis,
  • Zebra danio -- Development

Zebrafish represents the third vertebrate with an officially completed genome, yet it remains incomplete with additions and corrections continuing with the current release, GRCz10, having 13% of zebrafish cDNA sequences unmapped. This disparity may result from population differences given the reference was generated from clonal individuals with limited genetic diversity. This is supported by the recent analysis of a single wild zebrafish which identified over 5.2 million SNPs and 1.6 million in/dels in the previous genome build, zv9. Re-examination of this sequence dataset indicated that 13.8% of quality sequence reads failed to align to GRCz10. Using a novel bioinformatics de novo assembly pipeline on these unmappable reads we identified 1,514,491 novel contigs covering ~224 Mb of genomic sequence. Among these, 1,083 contigs were found to contain potential gene coding sequence. RNA-seq data comparison confirmed 362 contigs contained transcribed DNA sequence, suggesting that a large amount of functional genomic sequence remains unannotated in zebrafish. By utilizing the bioinformatics pipeline developed in this study the zebrafish genome will be bolstered as a model for human disease research. Adaptation of the pipeline described here also offers a cost-efficient and effective method to identify and map novel genetic content across any genome and will ultimately aid in the completion of additional genomes for a broad range of species.


This dataset is associated with a manuscript published in Zebrafish. March 2016, 13(2): 95-102 (

The supplementary data sets in this file contain excel tables (.xlsx), text file tables (.txt), sequence files (.fasta) and sequence chromatogram files. Chromatogram files must be viewed in programs such as "Sequence Scanner Software" or "TraceViewer" which are available for free download on the internet.

For preservation purposes .xlsx files (Supplementary Tables 2- 5) were converted to OpenDocument Spreadsheet (.ods) files. The files are available and marked accordingly.


This work is marked with CC0 1.0 Universal

Persistent Identifier
Citation Information
Faber-Hammond, Joshua J. and Brown, Kim H., "Data From: Pseudo-De Novo Assembly and Analysis of Unmapped Genome Sequence Reads in Wild Zebrafish Reveals Novel Gene Content" (2015). Dataset.