Analysis of error profiles in deep next-generation sequencing data
Abstract Background Sequencing errors are key confounding factors for detecting low-frequency genetic variants that are important for cancer molecular diagnosis, treatment, and surveillance using deep next-generation sequencing (NGS). However, there is a lack of comprehensive understanding of errors...
Main Authors: | , , , , , , , , , , , , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2019-03-01
|
Series: | Genome Biology |
Subjects: | |
Online Access: | http://link.springer.com/article/10.1186/s13059-019-1659-6 |
_version_ | 1818856005841190912 |
---|---|
author | Xiaotu Ma Ying Shao Liqing Tian Diane A. Flasch Heather L. Mulder Michael N. Edmonson Yu Liu Xiang Chen Scott Newman Joy Nakitandwe Yongjin Li Benshang Li Shuhong Shen Zhaoming Wang Sheila Shurtleff Leslie L. Robison Shawn Levy John Easton Jinghui Zhang |
author_facet | Xiaotu Ma Ying Shao Liqing Tian Diane A. Flasch Heather L. Mulder Michael N. Edmonson Yu Liu Xiang Chen Scott Newman Joy Nakitandwe Yongjin Li Benshang Li Shuhong Shen Zhaoming Wang Sheila Shurtleff Leslie L. Robison Shawn Levy John Easton Jinghui Zhang |
author_sort | Xiaotu Ma |
collection | DOAJ |
description | Abstract Background Sequencing errors are key confounding factors for detecting low-frequency genetic variants that are important for cancer molecular diagnosis, treatment, and surveillance using deep next-generation sequencing (NGS). However, there is a lack of comprehensive understanding of errors introduced at various steps of a conventional NGS workflow, such as sample handling, library preparation, PCR enrichment, and sequencing. In this study, we use current NGS technology to systematically investigate these questions. Results By evaluating read-specific error distributions, we discover that the substitution error rate can be computationally suppressed to 10−5 to 10−4, which is 10- to 100-fold lower than generally considered achievable (10−3) in the current literature. We then quantify substitution errors attributable to sample handling, library preparation, enrichment PCR, and sequencing by using multiple deep sequencing datasets. We find that error rates differ by nucleotide substitution types, ranging from 10−5 for A>C/T>G, C>A/G>T, and C>G/G>C changes to 10−4 for A>G/T>C changes. Furthermore, C>T/G>A errors exhibit strong sequence context dependency, sample-specific effects dominate elevated C>A/G>T errors, and target-enrichment PCR led to ~ 6-fold increase of overall error rate. We also find that more than 70% of hotspot variants can be detected at 0.1 ~ 0.01% frequency with the current NGS technology by applying in silico error suppression. Conclusions We present the first comprehensive analysis of sequencing error sources in conventional NGS workflows. The error profiles revealed by our study highlight new directions for further improving NGS analysis accuracy both experimentally and computationally, ultimately enhancing the precision of deep sequencing. |
first_indexed | 2024-12-19T08:17:37Z |
format | Article |
id | doaj.art-691d279d3c374eabafe2b420cf356568 |
institution | Directory Open Access Journal |
issn | 1474-760X |
language | English |
last_indexed | 2024-12-19T08:17:37Z |
publishDate | 2019-03-01 |
publisher | BMC |
record_format | Article |
series | Genome Biology |
spelling | doaj.art-691d279d3c374eabafe2b420cf3565682022-12-21T20:29:27ZengBMCGenome Biology1474-760X2019-03-0120111510.1186/s13059-019-1659-6Analysis of error profiles in deep next-generation sequencing dataXiaotu Ma0Ying Shao1Liqing Tian2Diane A. Flasch3Heather L. Mulder4Michael N. Edmonson5Yu Liu6Xiang Chen7Scott Newman8Joy Nakitandwe9Yongjin Li10Benshang Li11Shuhong Shen12Zhaoming Wang13Sheila Shurtleff14Leslie L. Robison15Shawn Levy16John Easton17Jinghui Zhang18Department of Computational Biology, St. Jude Children’s Research HospitalDepartment of Computational Biology, St. Jude Children’s Research HospitalDepartment of Computational Biology, St. Jude Children’s Research HospitalDepartment of Computational Biology, St. Jude Children’s Research HospitalDepartment of Computational Biology, St. Jude Children’s Research HospitalDepartment of Computational Biology, St. Jude Children’s Research HospitalDepartment of Computational Biology, St. Jude Children’s Research HospitalDepartment of Computational Biology, St. Jude Children’s Research HospitalDepartment of Computational Biology, St. Jude Children’s Research HospitalDepartment of Pathology, St. Jude Children’s Research HospitalDepartment of Computational Biology, St. Jude Children’s Research HospitalKey Laboratory of Pediatric Hematology and Oncology Ministry of Health, Department of Hematology and Oncology, Shanghai Children’s Medical Center, Shanghai Jiao Tong University School of MedicineKey Laboratory of Pediatric Hematology and Oncology Ministry of Health, Department of Hematology and Oncology, Shanghai Children’s Medical Center, Shanghai Jiao Tong University School of MedicineDepartment of Computational Biology, St. Jude Children’s Research HospitalDepartment of Pathology, St. Jude Children’s Research HospitalDepartment of Epidemiology and Cancer Control, St. Jude Children’s Research HospitalHudsonAlpha Institute for BiotechnologyDepartment of Computational Biology, St. Jude Children’s Research HospitalDepartment of Computational Biology, St. Jude Children’s Research HospitalAbstract Background Sequencing errors are key confounding factors for detecting low-frequency genetic variants that are important for cancer molecular diagnosis, treatment, and surveillance using deep next-generation sequencing (NGS). However, there is a lack of comprehensive understanding of errors introduced at various steps of a conventional NGS workflow, such as sample handling, library preparation, PCR enrichment, and sequencing. In this study, we use current NGS technology to systematically investigate these questions. Results By evaluating read-specific error distributions, we discover that the substitution error rate can be computationally suppressed to 10−5 to 10−4, which is 10- to 100-fold lower than generally considered achievable (10−3) in the current literature. We then quantify substitution errors attributable to sample handling, library preparation, enrichment PCR, and sequencing by using multiple deep sequencing datasets. We find that error rates differ by nucleotide substitution types, ranging from 10−5 for A>C/T>G, C>A/G>T, and C>G/G>C changes to 10−4 for A>G/T>C changes. Furthermore, C>T/G>A errors exhibit strong sequence context dependency, sample-specific effects dominate elevated C>A/G>T errors, and target-enrichment PCR led to ~ 6-fold increase of overall error rate. We also find that more than 70% of hotspot variants can be detected at 0.1 ~ 0.01% frequency with the current NGS technology by applying in silico error suppression. Conclusions We present the first comprehensive analysis of sequencing error sources in conventional NGS workflows. The error profiles revealed by our study highlight new directions for further improving NGS analysis accuracy both experimentally and computationally, ultimately enhancing the precision of deep sequencing.http://link.springer.com/article/10.1186/s13059-019-1659-6Deep sequencingError rateSubstitutionSubclonalDetectionHotspot mutation |
spellingShingle | Xiaotu Ma Ying Shao Liqing Tian Diane A. Flasch Heather L. Mulder Michael N. Edmonson Yu Liu Xiang Chen Scott Newman Joy Nakitandwe Yongjin Li Benshang Li Shuhong Shen Zhaoming Wang Sheila Shurtleff Leslie L. Robison Shawn Levy John Easton Jinghui Zhang Analysis of error profiles in deep next-generation sequencing data Genome Biology Deep sequencing Error rate Substitution Subclonal Detection Hotspot mutation |
title | Analysis of error profiles in deep next-generation sequencing data |
title_full | Analysis of error profiles in deep next-generation sequencing data |
title_fullStr | Analysis of error profiles in deep next-generation sequencing data |
title_full_unstemmed | Analysis of error profiles in deep next-generation sequencing data |
title_short | Analysis of error profiles in deep next-generation sequencing data |
title_sort | analysis of error profiles in deep next generation sequencing data |
topic | Deep sequencing Error rate Substitution Subclonal Detection Hotspot mutation |
url | http://link.springer.com/article/10.1186/s13059-019-1659-6 |
work_keys_str_mv | AT xiaotuma analysisoferrorprofilesindeepnextgenerationsequencingdata AT yingshao analysisoferrorprofilesindeepnextgenerationsequencingdata AT liqingtian analysisoferrorprofilesindeepnextgenerationsequencingdata AT dianeaflasch analysisoferrorprofilesindeepnextgenerationsequencingdata AT heatherlmulder analysisoferrorprofilesindeepnextgenerationsequencingdata AT michaelnedmonson analysisoferrorprofilesindeepnextgenerationsequencingdata AT yuliu analysisoferrorprofilesindeepnextgenerationsequencingdata AT xiangchen analysisoferrorprofilesindeepnextgenerationsequencingdata AT scottnewman analysisoferrorprofilesindeepnextgenerationsequencingdata AT joynakitandwe analysisoferrorprofilesindeepnextgenerationsequencingdata AT yongjinli analysisoferrorprofilesindeepnextgenerationsequencingdata AT benshangli analysisoferrorprofilesindeepnextgenerationsequencingdata AT shuhongshen analysisoferrorprofilesindeepnextgenerationsequencingdata AT zhaomingwang analysisoferrorprofilesindeepnextgenerationsequencingdata AT sheilashurtleff analysisoferrorprofilesindeepnextgenerationsequencingdata AT leslielrobison analysisoferrorprofilesindeepnextgenerationsequencingdata AT shawnlevy analysisoferrorprofilesindeepnextgenerationsequencingdata AT johneaston analysisoferrorprofilesindeepnextgenerationsequencingdata AT jinghuizhang analysisoferrorprofilesindeepnextgenerationsequencingdata |