Analysis of error profiles in deep next-generation sequencing data

Abstract Background Sequencing errors are key confounding factors for detecting low-frequency genetic variants that are important for cancer molecular diagnosis, treatment, and surveillance using deep next-generation sequencing (NGS). However, there is a lack of comprehensive understanding of errors...

Full description

Bibliographic Details
Main Authors: Xiaotu Ma, Ying Shao, Liqing Tian, Diane A. Flasch, Heather L. Mulder, Michael N. Edmonson, Yu Liu, Xiang Chen, Scott Newman, Joy Nakitandwe, Yongjin Li, Benshang Li, Shuhong Shen, Zhaoming Wang, Sheila Shurtleff, Leslie L. Robison, Shawn Levy, John Easton, Jinghui Zhang
Format: Article
Language:English
Published: BMC 2019-03-01
Series:Genome Biology
Subjects:
Online Access:http://link.springer.com/article/10.1186/s13059-019-1659-6
_version_ 1818856005841190912
author Xiaotu Ma
Ying Shao
Liqing Tian
Diane A. Flasch
Heather L. Mulder
Michael N. Edmonson
Yu Liu
Xiang Chen
Scott Newman
Joy Nakitandwe
Yongjin Li
Benshang Li
Shuhong Shen
Zhaoming Wang
Sheila Shurtleff
Leslie L. Robison
Shawn Levy
John Easton
Jinghui Zhang
author_facet Xiaotu Ma
Ying Shao
Liqing Tian
Diane A. Flasch
Heather L. Mulder
Michael N. Edmonson
Yu Liu
Xiang Chen
Scott Newman
Joy Nakitandwe
Yongjin Li
Benshang Li
Shuhong Shen
Zhaoming Wang
Sheila Shurtleff
Leslie L. Robison
Shawn Levy
John Easton
Jinghui Zhang
author_sort Xiaotu Ma
collection DOAJ
description Abstract Background Sequencing errors are key confounding factors for detecting low-frequency genetic variants that are important for cancer molecular diagnosis, treatment, and surveillance using deep next-generation sequencing (NGS). However, there is a lack of comprehensive understanding of errors introduced at various steps of a conventional NGS workflow, such as sample handling, library preparation, PCR enrichment, and sequencing. In this study, we use current NGS technology to systematically investigate these questions. Results By evaluating read-specific error distributions, we discover that the substitution error rate can be computationally suppressed to 10−5 to 10−4, which is 10- to 100-fold lower than generally considered achievable (10−3) in the current literature. We then quantify substitution errors attributable to sample handling, library preparation, enrichment PCR, and sequencing by using multiple deep sequencing datasets. We find that error rates differ by nucleotide substitution types, ranging from 10−5 for A>C/T>G, C>A/G>T, and C>G/G>C changes to 10−4 for A>G/T>C changes. Furthermore, C>T/G>A errors exhibit strong sequence context dependency, sample-specific effects dominate elevated C>A/G>T errors, and target-enrichment PCR led to ~ 6-fold increase of overall error rate. We also find that more than 70% of hotspot variants can be detected at 0.1 ~ 0.01% frequency with the current NGS technology by applying in silico error suppression. Conclusions We present the first comprehensive analysis of sequencing error sources in conventional NGS workflows. The error profiles revealed by our study highlight new directions for further improving NGS analysis accuracy both experimentally and computationally, ultimately enhancing the precision of deep sequencing.
first_indexed 2024-12-19T08:17:37Z
format Article
id doaj.art-691d279d3c374eabafe2b420cf356568
institution Directory Open Access Journal
issn 1474-760X
language English
last_indexed 2024-12-19T08:17:37Z
publishDate 2019-03-01
publisher BMC
record_format Article
series Genome Biology
spelling doaj.art-691d279d3c374eabafe2b420cf3565682022-12-21T20:29:27ZengBMCGenome Biology1474-760X2019-03-0120111510.1186/s13059-019-1659-6Analysis of error profiles in deep next-generation sequencing dataXiaotu Ma0Ying Shao1Liqing Tian2Diane A. Flasch3Heather L. Mulder4Michael N. Edmonson5Yu Liu6Xiang Chen7Scott Newman8Joy Nakitandwe9Yongjin Li10Benshang Li11Shuhong Shen12Zhaoming Wang13Sheila Shurtleff14Leslie L. Robison15Shawn Levy16John Easton17Jinghui Zhang18Department of Computational Biology, St. Jude Children’s Research HospitalDepartment of Computational Biology, St. Jude Children’s Research HospitalDepartment of Computational Biology, St. Jude Children’s Research HospitalDepartment of Computational Biology, St. Jude Children’s Research HospitalDepartment of Computational Biology, St. Jude Children’s Research HospitalDepartment of Computational Biology, St. Jude Children’s Research HospitalDepartment of Computational Biology, St. Jude Children’s Research HospitalDepartment of Computational Biology, St. Jude Children’s Research HospitalDepartment of Computational Biology, St. Jude Children’s Research HospitalDepartment of Pathology, St. Jude Children’s Research HospitalDepartment of Computational Biology, St. Jude Children’s Research HospitalKey Laboratory of Pediatric Hematology and Oncology Ministry of Health, Department of Hematology and Oncology, Shanghai Children’s Medical Center, Shanghai Jiao Tong University School of MedicineKey Laboratory of Pediatric Hematology and Oncology Ministry of Health, Department of Hematology and Oncology, Shanghai Children’s Medical Center, Shanghai Jiao Tong University School of MedicineDepartment of Computational Biology, St. Jude Children’s Research HospitalDepartment of Pathology, St. Jude Children’s Research HospitalDepartment of Epidemiology and Cancer Control, St. Jude Children’s Research HospitalHudsonAlpha Institute for BiotechnologyDepartment of Computational Biology, St. Jude Children’s Research HospitalDepartment of Computational Biology, St. Jude Children’s Research HospitalAbstract Background Sequencing errors are key confounding factors for detecting low-frequency genetic variants that are important for cancer molecular diagnosis, treatment, and surveillance using deep next-generation sequencing (NGS). However, there is a lack of comprehensive understanding of errors introduced at various steps of a conventional NGS workflow, such as sample handling, library preparation, PCR enrichment, and sequencing. In this study, we use current NGS technology to systematically investigate these questions. Results By evaluating read-specific error distributions, we discover that the substitution error rate can be computationally suppressed to 10−5 to 10−4, which is 10- to 100-fold lower than generally considered achievable (10−3) in the current literature. We then quantify substitution errors attributable to sample handling, library preparation, enrichment PCR, and sequencing by using multiple deep sequencing datasets. We find that error rates differ by nucleotide substitution types, ranging from 10−5 for A>C/T>G, C>A/G>T, and C>G/G>C changes to 10−4 for A>G/T>C changes. Furthermore, C>T/G>A errors exhibit strong sequence context dependency, sample-specific effects dominate elevated C>A/G>T errors, and target-enrichment PCR led to ~ 6-fold increase of overall error rate. We also find that more than 70% of hotspot variants can be detected at 0.1 ~ 0.01% frequency with the current NGS technology by applying in silico error suppression. Conclusions We present the first comprehensive analysis of sequencing error sources in conventional NGS workflows. The error profiles revealed by our study highlight new directions for further improving NGS analysis accuracy both experimentally and computationally, ultimately enhancing the precision of deep sequencing.http://link.springer.com/article/10.1186/s13059-019-1659-6Deep sequencingError rateSubstitutionSubclonalDetectionHotspot mutation
spellingShingle Xiaotu Ma
Ying Shao
Liqing Tian
Diane A. Flasch
Heather L. Mulder
Michael N. Edmonson
Yu Liu
Xiang Chen
Scott Newman
Joy Nakitandwe
Yongjin Li
Benshang Li
Shuhong Shen
Zhaoming Wang
Sheila Shurtleff
Leslie L. Robison
Shawn Levy
John Easton
Jinghui Zhang
Analysis of error profiles in deep next-generation sequencing data
Genome Biology
Deep sequencing
Error rate
Substitution
Subclonal
Detection
Hotspot mutation
title Analysis of error profiles in deep next-generation sequencing data
title_full Analysis of error profiles in deep next-generation sequencing data
title_fullStr Analysis of error profiles in deep next-generation sequencing data
title_full_unstemmed Analysis of error profiles in deep next-generation sequencing data
title_short Analysis of error profiles in deep next-generation sequencing data
title_sort analysis of error profiles in deep next generation sequencing data
topic Deep sequencing
Error rate
Substitution
Subclonal
Detection
Hotspot mutation
url http://link.springer.com/article/10.1186/s13059-019-1659-6
work_keys_str_mv AT xiaotuma analysisoferrorprofilesindeepnextgenerationsequencingdata
AT yingshao analysisoferrorprofilesindeepnextgenerationsequencingdata
AT liqingtian analysisoferrorprofilesindeepnextgenerationsequencingdata
AT dianeaflasch analysisoferrorprofilesindeepnextgenerationsequencingdata
AT heatherlmulder analysisoferrorprofilesindeepnextgenerationsequencingdata
AT michaelnedmonson analysisoferrorprofilesindeepnextgenerationsequencingdata
AT yuliu analysisoferrorprofilesindeepnextgenerationsequencingdata
AT xiangchen analysisoferrorprofilesindeepnextgenerationsequencingdata
AT scottnewman analysisoferrorprofilesindeepnextgenerationsequencingdata
AT joynakitandwe analysisoferrorprofilesindeepnextgenerationsequencingdata
AT yongjinli analysisoferrorprofilesindeepnextgenerationsequencingdata
AT benshangli analysisoferrorprofilesindeepnextgenerationsequencingdata
AT shuhongshen analysisoferrorprofilesindeepnextgenerationsequencingdata
AT zhaomingwang analysisoferrorprofilesindeepnextgenerationsequencingdata
AT sheilashurtleff analysisoferrorprofilesindeepnextgenerationsequencingdata
AT leslielrobison analysisoferrorprofilesindeepnextgenerationsequencingdata
AT shawnlevy analysisoferrorprofilesindeepnextgenerationsequencingdata
AT johneaston analysisoferrorprofilesindeepnextgenerationsequencingdata
AT jinghuizhang analysisoferrorprofilesindeepnextgenerationsequencingdata