Are differences in genomic data sets due to true biological variants or errors in genome assembly: an example from two chloroplast genomes.

DNA sequencing has been revolutionized by the development of high-throughput sequencing technologies. Plummeting costs and the massive throughput capacities of second and third generation sequencing platforms have transformed many fields of biological research. Concurrently, new data processing pipe...

Full description

Bibliographic Details
Main Authors:	Zhiqiang Wu, Luke R Tembrock, Song Ge
Format:	Article
Language:	English
Published:	Public Library of Science (PLoS) 2015-01-01
Series:	PLoS ONE
Online Access:	http://europepmc.org/articles/PMC4320078?pdf=render

_version_	1818478951254720512
author	Zhiqiang Wu Luke R Tembrock Song Ge
author_facet	Zhiqiang Wu Luke R Tembrock Song Ge
author_sort	Zhiqiang Wu
collection	DOAJ
description	DNA sequencing has been revolutionized by the development of high-throughput sequencing technologies. Plummeting costs and the massive throughput capacities of second and third generation sequencing platforms have transformed many fields of biological research. Concurrently, new data processing pipelines made rapid de novo genome assemblies possible. However, high quality data are critically important for all investigations in the genomic era. We used chloroplast genomes of one Oryza species (O. australiensis) to compare differences in sequence quality: one genome (GU592209) was obtained through Illumina sequencing and reference-guided assembly and the other genome (KJ830774) was obtained via target enrichment libraries and shotgun sequencing. Based on the whole genome alignment, GU592209 was more similar to the reference genome (O. sativa: AY522330) with 99.2% sequence identity (SI value) compared with the 98.8% SI values in the KJ830774 genome; whereas the opposite result was obtained when the SI values in coding and noncoding regions of GU592209 and KJ830774 were compared. Additionally, the junctions of two single copies and repeat copies in the chloroplast genome exhibited differences. Phylogenetic analyses were conducted using these sequences, and the different data sets yielded dissimilar topologies: phylogenetic replacements of the two individuals were remarkably different based on whole genome sequencing or SNP data and insertions and deletions (indels) data. Thus, we concluded that the genomic composition of GU592209 was heterogeneous in coding and non-coding regions. These findings should impel biologists to carefully consider the quality of sequencing and assembly when working with next-generation data.
first_indexed	2024-12-10T09:54:39Z
format	Article
id	doaj.art-e93d655d42b6498e9b0e75e25cd6e0e4
institution	Directory Open Access Journal
issn	1932-6203
language	English
last_indexed	2024-12-10T09:54:39Z
publishDate	2015-01-01
publisher	Public Library of Science (PLoS)
record_format	Article
series	PLoS ONE
spelling	doaj.art-e93d655d42b6498e9b0e75e25cd6e0e42022-12-22T01:53:32ZengPublic Library of Science (PLoS)PLoS ONE1932-62032015-01-01102e011801910.1371/journal.pone.0118019Are differences in genomic data sets due to true biological variants or errors in genome assembly: an example from two chloroplast genomes.Zhiqiang WuLuke R TembrockSong GeDNA sequencing has been revolutionized by the development of high-throughput sequencing technologies. Plummeting costs and the massive throughput capacities of second and third generation sequencing platforms have transformed many fields of biological research. Concurrently, new data processing pipelines made rapid de novo genome assemblies possible. However, high quality data are critically important for all investigations in the genomic era. We used chloroplast genomes of one Oryza species (O. australiensis) to compare differences in sequence quality: one genome (GU592209) was obtained through Illumina sequencing and reference-guided assembly and the other genome (KJ830774) was obtained via target enrichment libraries and shotgun sequencing. Based on the whole genome alignment, GU592209 was more similar to the reference genome (O. sativa: AY522330) with 99.2% sequence identity (SI value) compared with the 98.8% SI values in the KJ830774 genome; whereas the opposite result was obtained when the SI values in coding and noncoding regions of GU592209 and KJ830774 were compared. Additionally, the junctions of two single copies and repeat copies in the chloroplast genome exhibited differences. Phylogenetic analyses were conducted using these sequences, and the different data sets yielded dissimilar topologies: phylogenetic replacements of the two individuals were remarkably different based on whole genome sequencing or SNP data and insertions and deletions (indels) data. Thus, we concluded that the genomic composition of GU592209 was heterogeneous in coding and non-coding regions. These findings should impel biologists to carefully consider the quality of sequencing and assembly when working with next-generation data.http://europepmc.org/articles/PMC4320078?pdf=render
spellingShingle	Zhiqiang Wu Luke R Tembrock Song Ge Are differences in genomic data sets due to true biological variants or errors in genome assembly: an example from two chloroplast genomes. PLoS ONE
title	Are differences in genomic data sets due to true biological variants or errors in genome assembly: an example from two chloroplast genomes.
title_full	Are differences in genomic data sets due to true biological variants or errors in genome assembly: an example from two chloroplast genomes.
title_fullStr	Are differences in genomic data sets due to true biological variants or errors in genome assembly: an example from two chloroplast genomes.
title_full_unstemmed	Are differences in genomic data sets due to true biological variants or errors in genome assembly: an example from two chloroplast genomes.
title_short	Are differences in genomic data sets due to true biological variants or errors in genome assembly: an example from two chloroplast genomes.
title_sort	are differences in genomic data sets due to true biological variants or errors in genome assembly an example from two chloroplast genomes
url	http://europepmc.org/articles/PMC4320078?pdf=render
work_keys_str_mv	AT zhiqiangwu aredifferencesingenomicdatasetsduetotruebiologicalvariantsorerrorsingenomeassemblyanexamplefromtwochloroplastgenomes AT lukertembrock aredifferencesingenomicdatasetsduetotruebiologicalvariantsorerrorsingenomeassemblyanexamplefromtwochloroplastgenomes AT songge aredifferencesingenomicdatasetsduetotruebiologicalvariantsorerrorsingenomeassemblyanexamplefromtwochloroplastgenomes

Are differences in genomic data sets due to true biological variants or errors in genome assembly: an example from two chloroplast genomes.

Similar Items