Choice of reference sequence and assembler for alignment of Listeria monocytogenes short-read sequence data greatly influences rates of error in SNP analyses.

The wide availability of whole-genome sequencing (WGS) and an abundance of open-source software have made detection of single-nucleotide polymorphisms (SNPs) in bacterial genomes an increasingly accessible and effective tool for comparative analyses. Thus, ensuring that real nucleotide differences b...

Full description

Bibliographic Details
Main Authors:	Arthur W Pightling, Nicholas Petronella, Franco Pagotto
Format:	Article
Language:	English
Published:	Public Library of Science (PLoS) 2014-01-01
Series:	PLoS ONE
Online Access:	http://europepmc.org/articles/PMC4140716?pdf=render

_version_	1811194254928642048
author	Arthur W Pightling Nicholas Petronella Franco Pagotto
author_facet	Arthur W Pightling Nicholas Petronella Franco Pagotto
author_sort	Arthur W Pightling
collection	DOAJ
description	The wide availability of whole-genome sequencing (WGS) and an abundance of open-source software have made detection of single-nucleotide polymorphisms (SNPs) in bacterial genomes an increasingly accessible and effective tool for comparative analyses. Thus, ensuring that real nucleotide differences between genomes (i.e., true SNPs) are detected at high rates and that the influences of errors (such as false positive SNPs, ambiguously called sites, and gaps) are mitigated is of utmost importance. The choices researchers make regarding the generation and analysis of WGS data can greatly influence the accuracy of short-read sequence alignments and, therefore, the efficacy of such experiments. We studied the effects of some of these choices, including: i) depth of sequencing coverage, ii) choice of reference-guided short-read sequence assembler, iii) choice of reference genome, and iv) whether to perform read-quality filtering and trimming, on our ability to detect true SNPs and on the frequencies of errors. We performed benchmarking experiments, during which we assembled simulated and real Listeria monocytogenes strain 08-5578 short-read sequence datasets of varying quality with four commonly used assemblers (BWA, MOSAIK, Novoalign, and SMALT), using reference genomes of varying genetic distances, and with or without read pre-processing (i.e., quality filtering and trimming). We found that assemblies of at least 50-fold coverage provided the most accurate results. In addition, MOSAIK yielded the fewest errors when reads were aligned to a nearly identical reference genome, while using SMALT to align reads against a reference sequence that is ∼0.82% distant from 08-5578 at the nucleotide level resulted in the detection of the greatest numbers of true SNPs and the fewest errors. Finally, we show that whether read pre-processing improves SNP detection depends upon the choice of reference sequence and assembler. In total, this study demonstrates that researchers should test a variety of conditions to achieve optimal results.
first_indexed	2024-04-12T00:22:57Z
format	Article
id	doaj.art-03f4b591cbf74164b1c8e38a88823106
institution	Directory Open Access Journal
issn	1932-6203
language	English
last_indexed	2024-04-12T00:22:57Z
publishDate	2014-01-01
publisher	Public Library of Science (PLoS)
record_format	Article
series	PLoS ONE
spelling	doaj.art-03f4b591cbf74164b1c8e38a888231062022-12-22T03:55:37ZengPublic Library of Science (PLoS)PLoS ONE1932-62032014-01-0198e10457910.1371/journal.pone.0104579Choice of reference sequence and assembler for alignment of Listeria monocytogenes short-read sequence data greatly influences rates of error in SNP analyses.Arthur W PightlingNicholas PetronellaFranco PagottoThe wide availability of whole-genome sequencing (WGS) and an abundance of open-source software have made detection of single-nucleotide polymorphisms (SNPs) in bacterial genomes an increasingly accessible and effective tool for comparative analyses. Thus, ensuring that real nucleotide differences between genomes (i.e., true SNPs) are detected at high rates and that the influences of errors (such as false positive SNPs, ambiguously called sites, and gaps) are mitigated is of utmost importance. The choices researchers make regarding the generation and analysis of WGS data can greatly influence the accuracy of short-read sequence alignments and, therefore, the efficacy of such experiments. We studied the effects of some of these choices, including: i) depth of sequencing coverage, ii) choice of reference-guided short-read sequence assembler, iii) choice of reference genome, and iv) whether to perform read-quality filtering and trimming, on our ability to detect true SNPs and on the frequencies of errors. We performed benchmarking experiments, during which we assembled simulated and real Listeria monocytogenes strain 08-5578 short-read sequence datasets of varying quality with four commonly used assemblers (BWA, MOSAIK, Novoalign, and SMALT), using reference genomes of varying genetic distances, and with or without read pre-processing (i.e., quality filtering and trimming). We found that assemblies of at least 50-fold coverage provided the most accurate results. In addition, MOSAIK yielded the fewest errors when reads were aligned to a nearly identical reference genome, while using SMALT to align reads against a reference sequence that is ∼0.82% distant from 08-5578 at the nucleotide level resulted in the detection of the greatest numbers of true SNPs and the fewest errors. Finally, we show that whether read pre-processing improves SNP detection depends upon the choice of reference sequence and assembler. In total, this study demonstrates that researchers should test a variety of conditions to achieve optimal results.http://europepmc.org/articles/PMC4140716?pdf=render
spellingShingle	Arthur W Pightling Nicholas Petronella Franco Pagotto Choice of reference sequence and assembler for alignment of Listeria monocytogenes short-read sequence data greatly influences rates of error in SNP analyses. PLoS ONE
title	Choice of reference sequence and assembler for alignment of Listeria monocytogenes short-read sequence data greatly influences rates of error in SNP analyses.
title_full	Choice of reference sequence and assembler for alignment of Listeria monocytogenes short-read sequence data greatly influences rates of error in SNP analyses.
title_fullStr	Choice of reference sequence and assembler for alignment of Listeria monocytogenes short-read sequence data greatly influences rates of error in SNP analyses.
title_full_unstemmed	Choice of reference sequence and assembler for alignment of Listeria monocytogenes short-read sequence data greatly influences rates of error in SNP analyses.
title_short	Choice of reference sequence and assembler for alignment of Listeria monocytogenes short-read sequence data greatly influences rates of error in SNP analyses.
title_sort	choice of reference sequence and assembler for alignment of listeria monocytogenes short read sequence data greatly influences rates of error in snp analyses
url	http://europepmc.org/articles/PMC4140716?pdf=render
work_keys_str_mv	AT arthurwpightling choiceofreferencesequenceandassemblerforalignmentoflisteriamonocytogenesshortreadsequencedatagreatlyinfluencesratesoferrorinsnpanalyses AT nicholaspetronella choiceofreferencesequenceandassemblerforalignmentoflisteriamonocytogenesshortreadsequencedatagreatlyinfluencesratesoferrorinsnpanalyses AT francopagotto choiceofreferencesequenceandassemblerforalignmentoflisteriamonocytogenesshortreadsequencedatagreatlyinfluencesratesoferrorinsnpanalyses

Choice of reference sequence and assembler for alignment of Listeria monocytogenes short-read sequence data greatly influences rates of error in SNP analyses.

Similar Items