cDNA-detector: detection and removal of cDNA contamination in DNA sequencing libraries

Abstract Background Exogenous cDNA introduced into an experimental system, either intentionally or accidentally, can appear as added read coverage over that gene in next-generation sequencing libraries derived from this system. If not properly recognized and managed, this cross-contamination with ex...

Full description

Bibliographic Details
Main Authors: Meifang Qi, Utthara Nayar, Leif S. Ludwig, Nikhil Wagle, Esther Rheinbay
Format: Article
Language:English
Published: BMC 2021-12-01
Series:BMC Bioinformatics
Subjects:
Online Access:https://doi.org/10.1186/s12859-021-04529-2
_version_ 1818969369138429952
author Meifang Qi
Utthara Nayar
Leif S. Ludwig
Nikhil Wagle
Esther Rheinbay
author_facet Meifang Qi
Utthara Nayar
Leif S. Ludwig
Nikhil Wagle
Esther Rheinbay
author_sort Meifang Qi
collection DOAJ
description Abstract Background Exogenous cDNA introduced into an experimental system, either intentionally or accidentally, can appear as added read coverage over that gene in next-generation sequencing libraries derived from this system. If not properly recognized and managed, this cross-contamination with exogenous signal can lead to incorrect interpretation of research results. Yet, this problem is not routinely addressed in current sequence processing pipelines. Results We present cDNA-detector, a computational tool to identify and remove exogenous cDNA contamination in DNA sequencing experiments. We demonstrate that cDNA-detector can identify cDNAs quickly and accurately from alignment files. A source inference step attempts to separate endogenous cDNAs (retrocopied genes) from potential cloned, exogenous cDNAs. cDNA-detector provides a mechanism to decontaminate the alignment from detected cDNAs. Simulation studies show that cDNA-detector is highly sensitive and specific, outperforming existing tools. We apply cDNA-detector to several highly-cited public databases (TCGA, ENCODE, NCBI SRA) and show that contaminant genes appear in sequencing experiments where they lead to incorrect coverage peak calls. Conclusions cDNA-detector is a user-friendly and accurate tool to detect and remove cDNA detection in NGS libraries. This two-step design reduces the risk of true variant removal since it allows for manual review of candidates. We find that contamination with intentionally and accidentally introduced cDNAs is an underappreciated problem even in widely-used consortium datasets, where it can lead to spurious results. Our findings highlight the importance of sensitive detection and removal of contaminant cDNA from NGS libraries before downstream analysis.
first_indexed 2024-12-20T14:19:29Z
format Article
id doaj.art-e63abeda4c1d400b9d4b0c88a680ac57
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-12-20T14:19:29Z
publishDate 2021-12-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-e63abeda4c1d400b9d4b0c88a680ac572022-12-21T19:37:58ZengBMCBMC Bioinformatics1471-21052021-12-0122111410.1186/s12859-021-04529-2cDNA-detector: detection and removal of cDNA contamination in DNA sequencing librariesMeifang Qi0Utthara Nayar1Leif S. Ludwig2Nikhil Wagle3Esther Rheinbay4Center for Cancer Research, Massachusetts General HospitalHarvard Medical SchoolHarvard Medical SchoolHarvard Medical SchoolCenter for Cancer Research, Massachusetts General HospitalAbstract Background Exogenous cDNA introduced into an experimental system, either intentionally or accidentally, can appear as added read coverage over that gene in next-generation sequencing libraries derived from this system. If not properly recognized and managed, this cross-contamination with exogenous signal can lead to incorrect interpretation of research results. Yet, this problem is not routinely addressed in current sequence processing pipelines. Results We present cDNA-detector, a computational tool to identify and remove exogenous cDNA contamination in DNA sequencing experiments. We demonstrate that cDNA-detector can identify cDNAs quickly and accurately from alignment files. A source inference step attempts to separate endogenous cDNAs (retrocopied genes) from potential cloned, exogenous cDNAs. cDNA-detector provides a mechanism to decontaminate the alignment from detected cDNAs. Simulation studies show that cDNA-detector is highly sensitive and specific, outperforming existing tools. We apply cDNA-detector to several highly-cited public databases (TCGA, ENCODE, NCBI SRA) and show that contaminant genes appear in sequencing experiments where they lead to incorrect coverage peak calls. Conclusions cDNA-detector is a user-friendly and accurate tool to detect and remove cDNA detection in NGS libraries. This two-step design reduces the risk of true variant removal since it allows for manual review of candidates. We find that contamination with intentionally and accidentally introduced cDNAs is an underappreciated problem even in widely-used consortium datasets, where it can lead to spurious results. Our findings highlight the importance of sensitive detection and removal of contaminant cDNA from NGS libraries before downstream analysis.https://doi.org/10.1186/s12859-021-04529-2ContaminationGenomicsSoftwareQuality controlcDNA
spellingShingle Meifang Qi
Utthara Nayar
Leif S. Ludwig
Nikhil Wagle
Esther Rheinbay
cDNA-detector: detection and removal of cDNA contamination in DNA sequencing libraries
BMC Bioinformatics
Contamination
Genomics
Software
Quality control
cDNA
title cDNA-detector: detection and removal of cDNA contamination in DNA sequencing libraries
title_full cDNA-detector: detection and removal of cDNA contamination in DNA sequencing libraries
title_fullStr cDNA-detector: detection and removal of cDNA contamination in DNA sequencing libraries
title_full_unstemmed cDNA-detector: detection and removal of cDNA contamination in DNA sequencing libraries
title_short cDNA-detector: detection and removal of cDNA contamination in DNA sequencing libraries
title_sort cdna detector detection and removal of cdna contamination in dna sequencing libraries
topic Contamination
Genomics
Software
Quality control
cDNA
url https://doi.org/10.1186/s12859-021-04529-2
work_keys_str_mv AT meifangqi cdnadetectordetectionandremovalofcdnacontaminationindnasequencinglibraries
AT uttharanayar cdnadetectordetectionandremovalofcdnacontaminationindnasequencinglibraries
AT leifsludwig cdnadetectordetectionandremovalofcdnacontaminationindnasequencinglibraries
AT nikhilwagle cdnadetectordetectionandremovalofcdnacontaminationindnasequencinglibraries
AT estherrheinbay cdnadetectordetectionandremovalofcdnacontaminationindnasequencinglibraries