"Data From: Assembly and Analysis of Unmapped Genome Sequence Reads Reveal Novel Sequence and Variation in Dogs" by Lindsay Adrian Holden

Selected Works of Kim H. Brown

Follow Contact

Dataset

Data From: Assembly and Analysis of Unmapped Genome Sequence Reads Reveal Novel Sequence and Variation in Dogs

Biology Faculty Datasets

Lindsay Adrian Holden, Portland State University
Meharji Arumilli, University of Helsinki
Marjo K. Hytönen, University of Helsinki
Sruthi Hundi, University of Helsinki
Jarkko Salojärvi, University of Helsinki
Kim H Brown, Portland State University
Hannes Lohi, University of Helsinki

Document Type

Dataset

Publication Date

7-1-2018

Subjects

Dogs -- Genetics,
Genomics,
Genomes

Disciplines

Genomics

Abstract

Dogs are excellent animal models for human disease. They have extensive veterinary histories, pedigrees, and a unique genetic system due to breeding practices. Despite these advantages, one factor limiting their usefulness is the canine genome reference (CGR) which was assembled using a single purebred Boxer. Although a common practice, this results in many high-quality reads remaining unmapped. To address this whole-genome sequence data from three breeds, Border Collie (n=26), Bearded Collie (n=7), and Entlebucher Sennenhund (n=8), were analyzed to identify novel, non-CGR genomic contigs using the previously validated pseudo-de novo assembly pipeline. We identified 256,957 novel contigs and paired-end relationships together with BLAT scores provided 126,555 (49%) high-quality contigs with genomic coordinates containing 4.6 Mb of novel sequence absent from the CGR. These contigs close 12,503 known gaps, including 2.4 Mb containing partially missing sequences for 11.5% of Ensembl, 16.4% of RefSeq and 12.2% of canFam3.1+ CGR annotated genes and 1,748 unmapped contigs containing 2,366 novel gene variants. Examples for six disease-associated genes (SCARF2, RD3, COL9A3, FAM161A, RASGRP1 and DLX6) containing gaps or alternate splice variants missing from the CGR are also presented. These findings from non-reference breeds support the need for improvement of the current Boxer-only CGR to avoid missing important biological information. The inclusion of the missing gene sequences into the CGR will facilitate identification of putative disease mutations across diverse breeds and phenotypes.

Description

The data supports a manuscript published in Scientific Reports: Holden, L. A., Arumilli, M., Hytönen, M. K., Hundi, S., Salojärvi, J., Brown, K. H., & Lohi, H. (2018). Assembly and Analysis of Unmapped Genome Sequence Reads Reveal Novel Sequence and Variation in Dogs. Scientific reports, 8(1), 10862 (https://doi.org/10.1038/s41598-018-29190-3). The article is available in PDXScholar and can be found here: https://archives.pdx.edu/ds/psu/26154

Data Description:

Data 1 File Type: .txt for use in notepad or excel Description: List of predicted loci for secondary assembly contigs based on mapping one-end anchored read pairs in both the genome and assembly.
Data 2 File Type: .txt for use in notepad or excel Description: List of high quality predicted loci for secondary assembly contigs based on Bowtie2 mapping.
Data 3 File type: .txt for use in notepad or excel Description: Gene,contig and clone annotation of the gaps in CGR.
Data file type 4: .fasta for use in sequence mapping files such as Integrative Genomics Viewer and others. Description: Zipped file containing .fasta files of sequence contigs.

Rights

This work is marked with CC0 1.0 Universal

DOI

10.15760/data.3

Persistent Identifier

https://archives.pdx.edu/ds/psu/26103

Citation Information

Holden, Lindsay Adrian; Arumilli, Meharji; Hytönen, Marjo K.; Hundi, Sruthi; Salojärvi, Jarkko; Brown, Kim H.; and Lohi, Hannes, "Data From: Assembly and Analysis of Unmapped Genome Sequence Reads Reveal Novel Sequence and Variation in Dogs" (2018). Biology Faculty Datasets. 3. https://doi.org/10.15760/data.3