Skip to main content
Dataset
Data From: Anchored Pseudo-De Novo Assembly of Human Genomes Identifies Extensive Sequence Variation from Unmapped Sequence Reads
Biology Faculty Data Sets
  • Joshua J. Faber-Hammond, Portland State University
  • Kim H. Brown, Portland State University
Document Type
Dataset
Publication Date
1-1-2015
Subjects
  • Human genome
Disciplines
Abstract
The Human Genome Reference (HGR) completion marked the genomics era beginning, yet despite its utility universal application is limited by the small number of individuals used in its development. This is highlighted by the presence of high quality sequence reads failing to map within the HGR. Sequences failing to map generally represent 2-5% of total reads, which may harbor regions that would enhance our understanding of population variation, evolution, and disease. Alternatively, complete de novo assemblies can be created, but these effectively ignore the groundwork of the HGR. In an effort to find a middle ground we developed a bioinformatic pipeline that maps paired-end reads to the HGR as separate single reads, exports unmappable reads, de novo assembles these reads per individual, then combines assemblies into a secondary reference assembly used for comparative analysis. Using 45 diverse 1000 Genomes Project individuals, we identified 351,361 contigs covering 195.5 Mb of sequence unincorporated in GRCh38. 30,879 contigs are represented in multiple individuals with ~40% showing high sequence complexity. Genomic coordinates were generated for 99.9%, with 52.5% exhibiting high quality mapping scores. Comparative genomic analyses with archaic humans and primates revealed significant sequence alignments and comparisons with model organism RefSeq gene datasets identified novel human genes. If incorporated, these sequences will expand the HGR, but more importantly, our data highlights that with this method low coverage (~10-20X) next generation sequencing can still be used to identify novel unmapped sequences to explore biological functions contributing to human phenotypic variation, disease and functionality for personal genomic medicine.
Description

The data supports a manuscript published in Human Geneticstitled "Anchored Pseudo-De Novo Assembly of Human Genomes Identifies Extensive Sequence Variation From Unmapped Sequence Reads" (2016). https://doi.org/10.1007/s00439-016-1667-5

The supplementary data sets in this file contain excel tables (.xlsx), text file tables (.txt), sequence files (.fasta) and a compressed text file (.gz).

DOI
10.15760/data.1
Persistent Identifier
http://archives.pdx.edu/ds/psu/16928
Citation Information
Faber-Hammond, Joshua J. and Brown, Kim H., "Data From: Anchored Pseudo-De Novo Assembly of Human Genomes Identifies Extensive Sequence Variation from Unmapped Sequence Reads" (2015). Dataset. https://doi.org/10.15760/data.1