ConFindr: rapid detection of intraspecies and cross-species contamination in bacterial whole-genome sequence data

Whole-genome sequencing (WGS) of bacterial pathogens is currently widely used to support public-health investigations. The ability to assess WGS data quality is critical to underpin the reliability of downstream analyses. Sequence contamination is a quality issue that could potentially impact WGS-ba...

Full description

Bibliographic Details
Main Authors: Andrew J. Low, Adam G. Koziol, Paul A. Manninger, Burton Blais, Catherine D. Carrillo
Format: Article
Language:English
Published: PeerJ Inc. 2019-05-01
Series:PeerJ
Subjects:
Online Access:https://peerj.com/articles/6995.pdf
_version_ 1797417856464322560
author Andrew J. Low
Adam G. Koziol
Paul A. Manninger
Burton Blais
Catherine D. Carrillo
author_facet Andrew J. Low
Adam G. Koziol
Paul A. Manninger
Burton Blais
Catherine D. Carrillo
author_sort Andrew J. Low
collection DOAJ
description Whole-genome sequencing (WGS) of bacterial pathogens is currently widely used to support public-health investigations. The ability to assess WGS data quality is critical to underpin the reliability of downstream analyses. Sequence contamination is a quality issue that could potentially impact WGS-based findings; however, existing tools do not readily identify contamination from closely-related organisms. To address this gap, we have developed a computational pipeline, ConFindr, for detection of intraspecies contamination. ConFindr determines the presence of contaminating sequences based on the identification of multiple alleles of core, single-copy, ribosomal-protein genes in raw sequencing reads. The performance of this tool was assessed using simulated and lab-generated Illumina short-read WGS data with varying levels of contamination (0–20% of reads) and varying genetic distance between the designated target and contaminant strains. Intraspecies and cross-species contamination was reliably detected in datasets containing 5% or more reads from a second, unrelated strain. ConFindr detected intraspecies contamination with higher sensitivity than existing tools, while also being able to automatically detect cross-species contamination with similar sensitivity. The implementation of ConFindr in quality-control pipelines will help to improve the reliability of WGS databases as well as the accuracy of downstream analyses. ConFindr is written in Python, and is freely available under the MIT License at github.com/OLC-Bioinformatics/ConFindr.
first_indexed 2024-03-09T06:25:45Z
format Article
id doaj.art-4a0def97e062450bac42d18214198c15
institution Directory Open Access Journal
issn 2167-8359
language English
last_indexed 2024-03-09T06:25:45Z
publishDate 2019-05-01
publisher PeerJ Inc.
record_format Article
series PeerJ
spelling doaj.art-4a0def97e062450bac42d18214198c152023-12-03T11:21:33ZengPeerJ Inc.PeerJ2167-83592019-05-017e699510.7717/peerj.6995ConFindr: rapid detection of intraspecies and cross-species contamination in bacterial whole-genome sequence dataAndrew J. Low0Adam G. Koziol1Paul A. Manninger2Burton Blais3Catherine D. Carrillo4Ottawa Laboratory (Carling), Canadian Food Inspection Agency, Ottawa, Ontario, CanadaOttawa Laboratory (Carling), Canadian Food Inspection Agency, Ottawa, Ontario, CanadaOttawa Laboratory (Carling), Canadian Food Inspection Agency, Ottawa, Ontario, CanadaOttawa Laboratory (Carling), Canadian Food Inspection Agency, Ottawa, Ontario, CanadaOttawa Laboratory (Carling), Canadian Food Inspection Agency, Ottawa, Ontario, CanadaWhole-genome sequencing (WGS) of bacterial pathogens is currently widely used to support public-health investigations. The ability to assess WGS data quality is critical to underpin the reliability of downstream analyses. Sequence contamination is a quality issue that could potentially impact WGS-based findings; however, existing tools do not readily identify contamination from closely-related organisms. To address this gap, we have developed a computational pipeline, ConFindr, for detection of intraspecies contamination. ConFindr determines the presence of contaminating sequences based on the identification of multiple alleles of core, single-copy, ribosomal-protein genes in raw sequencing reads. The performance of this tool was assessed using simulated and lab-generated Illumina short-read WGS data with varying levels of contamination (0–20% of reads) and varying genetic distance between the designated target and contaminant strains. Intraspecies and cross-species contamination was reliably detected in datasets containing 5% or more reads from a second, unrelated strain. ConFindr detected intraspecies contamination with higher sensitivity than existing tools, while also being able to automatically detect cross-species contamination with similar sensitivity. The implementation of ConFindr in quality-control pipelines will help to improve the reliability of WGS databases as well as the accuracy of downstream analyses. ConFindr is written in Python, and is freely available under the MIT License at github.com/OLC-Bioinformatics/ConFindr.https://peerj.com/articles/6995.pdfWhole Genome SequenceContaminationBacteriaQualityBioinformaticConFindr
spellingShingle Andrew J. Low
Adam G. Koziol
Paul A. Manninger
Burton Blais
Catherine D. Carrillo
ConFindr: rapid detection of intraspecies and cross-species contamination in bacterial whole-genome sequence data
PeerJ
Whole Genome Sequence
Contamination
Bacteria
Quality
Bioinformatic
ConFindr
title ConFindr: rapid detection of intraspecies and cross-species contamination in bacterial whole-genome sequence data
title_full ConFindr: rapid detection of intraspecies and cross-species contamination in bacterial whole-genome sequence data
title_fullStr ConFindr: rapid detection of intraspecies and cross-species contamination in bacterial whole-genome sequence data
title_full_unstemmed ConFindr: rapid detection of intraspecies and cross-species contamination in bacterial whole-genome sequence data
title_short ConFindr: rapid detection of intraspecies and cross-species contamination in bacterial whole-genome sequence data
title_sort confindr rapid detection of intraspecies and cross species contamination in bacterial whole genome sequence data
topic Whole Genome Sequence
Contamination
Bacteria
Quality
Bioinformatic
ConFindr
url https://peerj.com/articles/6995.pdf
work_keys_str_mv AT andrewjlow confindrrapiddetectionofintraspeciesandcrossspeciescontaminationinbacterialwholegenomesequencedata
AT adamgkoziol confindrrapiddetectionofintraspeciesandcrossspeciescontaminationinbacterialwholegenomesequencedata
AT paulamanninger confindrrapiddetectionofintraspeciesandcrossspeciescontaminationinbacterialwholegenomesequencedata
AT burtonblais confindrrapiddetectionofintraspeciesandcrossspeciescontaminationinbacterialwholegenomesequencedata
AT catherinedcarrillo confindrrapiddetectionofintraspeciesandcrossspeciescontaminationinbacterialwholegenomesequencedata