Data From: Anchored Pseudo-De Novo Assembly of Human Genomes Identifies Extensive Sequence Variation from Unmapped Sequence ReadsBiology Faculty Data Sets
- Human genome
AbstractThe Human Genome Reference (HGR) completion marked the genomics era beginning, yet despite its utility universal application is limited by the small number of individuals used in its development. This is highlighted by the presence of high quality sequence reads failing to map within the HGR. Sequences failing to map generally represent 2-5% of total reads, which may harbor regions that would enhance our understanding of population variation, evolution, and disease. Alternatively, complete de novo assemblies can be created, but these effectively ignore the groundwork of the HGR. In an effort to find a middle ground we developed a bioinformatic pipeline that maps paired-end reads to the HGR as separate single reads, exports unmappable reads, de novo assembles these reads per individual, then combines assemblies into a secondary reference assembly used for comparative analysis. Using 45 diverse 1000 Genomes Project individuals, we identified 351,361 contigs covering 195.5 Mb of sequence unincorporated in GRCh38. 30,879 contigs are represented in multiple individuals with ~40% showing high sequence complexity. Genomic coordinates were generated for 99.9%, with 52.5% exhibiting high quality mapping scores. Comparative genomic analyses with archaic humans and primates revealed significant sequence alignments and comparisons with model organism RefSeq gene datasets identified novel human genes. If incorporated, these sequences will expand the HGR, but more importantly, our data highlights that with this method low coverage (~10-20X) next generation sequencing can still be used to identify novel unmapped sequences to explore biological functions contributing to human phenotypic variation, disease and functionality for personal genomic medicine.
Citation InformationFaber-Hammond, Joshua J. and Brown, Kim H., "Data From: Anchored Pseudo-De Novo Assembly of Human Genomes Identifies Extensive Sequence Variation from Unmapped Sequence Reads" (2015). Dataset. https://doi.org/10.15760/data.1