Optimized quantification of intra-host viral diversity in SARS-CoV-2 and influenza virus sequence data

ABSTRACT High error rates of viral RNA-dependent RNA polymerases lead to diverse intra-host viral populations during infection. Errors made during replication that are not strongly deleterious to the virus can lead to the generation of minority variants. However, accurate detection of minority varia...

Full description

Bibliographic Details
Main Authors: A. E. Roder, K. E. E. Johnson, M. Knoll, M. Khalfan, B. Wang, S. Schultz-Cherry, S. Banakis, A. Kreitman, C. Mederos, J.-H. Youn, R. Mercado, W. Wang, M. Chung, D. Ruchnewitz, M. I. Samanovic, M. J. Mulligan, M. Lässig, M. Luksza, S. Das, D. Gresham, E. Ghedin
Format: Article
Language:English
Published: American Society for Microbiology 2023-08-01
Series:mBio
Subjects:
Online Access:https://journals.asm.org/doi/10.1128/mbio.01046-23
_version_ 1827855207106609152
author A. E. Roder
K. E. E. Johnson
M. Knoll
M. Khalfan
B. Wang
S. Schultz-Cherry
S. Banakis
A. Kreitman
C. Mederos
J.-H. Youn
R. Mercado
W. Wang
M. Chung
D. Ruchnewitz
M. I. Samanovic
M. J. Mulligan
M. Lässig
M. Luksza
S. Das
D. Gresham
E. Ghedin
author_facet A. E. Roder
K. E. E. Johnson
M. Knoll
M. Khalfan
B. Wang
S. Schultz-Cherry
S. Banakis
A. Kreitman
C. Mederos
J.-H. Youn
R. Mercado
W. Wang
M. Chung
D. Ruchnewitz
M. I. Samanovic
M. J. Mulligan
M. Lässig
M. Luksza
S. Das
D. Gresham
E. Ghedin
author_sort A. E. Roder
collection DOAJ
description ABSTRACT High error rates of viral RNA-dependent RNA polymerases lead to diverse intra-host viral populations during infection. Errors made during replication that are not strongly deleterious to the virus can lead to the generation of minority variants. However, accurate detection of minority variants in viral sequence data is complicated by errors introduced during sample preparation and data analysis. We used synthetic RNA controls and simulated data to test seven variant-calling tools across a range of allele frequencies and simulated coverages. We show that choice of variant caller and use of replicate sequencing have the most significant impact on single-nucleotide variant (SNV) discovery and demonstrate how both allele frequency and coverage thresholds impact both false discovery and false-negative rates. When replicates are not available, using a combination of multiple callers with more stringent cutoffs is recommended. We use these parameters to find minority variants in sequencing data from SARS-CoV-2 clinical specimens and provide guidance for studies of intra-host viral diversity using either single replicate data or data from technical replicates. Our study provides a framework for rigorous assessment of technical factors that impact SNV identification in viral samples and establishes heuristics that will inform and improve future studies of intra-host variation, viral diversity, and viral evolution. IMPORTANCE When viruses replicate inside a host cell, the virus replication machinery makes mistakes. Over time, these mistakes create mutations that result in a diverse population of viruses inside the host. Mutations that are neither lethal to the virus nor strongly beneficial can lead to minority variants that are minor members of the virus population. However, preparing samples for sequencing can also introduce errors that resemble minority variants, resulting in the inclusion of false-positive data if not filtered correctly. In this study, we aimed to determine the best methods for identification and quantification of these minority variants by testing the performance of seven commonly used variant-calling tools. We used simulated and synthetic data to test their performance against a true set of variants and then used these studies to inform variant identification in data from SARS-CoV-2 clinical specimens. Together, analyses of our data provide extensive guidance for future studies of viral diversity and evolution.
first_indexed 2024-03-12T11:41:48Z
format Article
id doaj.art-960cd1ab0a3e4d508008c3865eb0df2a
institution Directory Open Access Journal
issn 2150-7511
language English
last_indexed 2024-03-12T11:41:48Z
publishDate 2023-08-01
publisher American Society for Microbiology
record_format Article
series mBio
spelling doaj.art-960cd1ab0a3e4d508008c3865eb0df2a2023-08-31T15:04:20ZengAmerican Society for MicrobiologymBio2150-75112023-08-0114410.1128/mbio.01046-23Optimized quantification of intra-host viral diversity in SARS-CoV-2 and influenza virus sequence dataA. E. Roder0K. E. E. Johnson1M. Knoll2M. Khalfan3B. Wang4S. Schultz-Cherry5S. Banakis6A. Kreitman7C. Mederos8J.-H. Youn9R. Mercado10W. Wang11M. Chung12D. Ruchnewitz13M. I. Samanovic14M. J. Mulligan15M. Lässig16M. Luksza17S. Das18D. Gresham19E. Ghedin20Systems Genomics Section, Laboratory of Parasitic Diseases, DIR, NIAID, NIH , Bethesda, Maryland, USASystems Genomics Section, Laboratory of Parasitic Diseases, DIR, NIAID, NIH , Bethesda, Maryland, USADepartment of Biology, Center for Genomics and Systems Biology, New York University , New York, New York, USADepartment of Biology, Center for Genomics and Systems Biology, New York University , New York, New York, USADepartment of Biology, Center for Genomics and Systems Biology, New York University , New York, New York, USADepartment of Infectious Diseases, St Jude Children Research Hospital , Memphis, Tennessee, USASystems Genomics Section, Laboratory of Parasitic Diseases, DIR, NIAID, NIH , Bethesda, Maryland, USASystems Genomics Section, Laboratory of Parasitic Diseases, DIR, NIAID, NIH , Bethesda, Maryland, USASystems Genomics Section, Laboratory of Parasitic Diseases, DIR, NIAID, NIH , Bethesda, Maryland, USADepartment of Laboratory Medicine, NIH , Bethesda, Maryland, USADepartment of Laboratory Medicine, NIH , Bethesda, Maryland, USASystems Genomics Section, Laboratory of Parasitic Diseases, DIR, NIAID, NIH , Bethesda, Maryland, USASystems Genomics Section, Laboratory of Parasitic Diseases, DIR, NIAID, NIH , Bethesda, Maryland, USAInstitute for Biological Physics, University of Cologne , Cologne, GermanyDepartment of Medicine, New York University Langone Vaccine Center , New York, New York, USADepartment of Medicine, New York University Langone Vaccine Center , New York, New York, USAInstitute for Biological Physics, University of Cologne , Cologne, GermanyDepartment of Oncological Sciences, Icahn School of Medicine at Mount Sinai , New York, New York, USADepartment of Laboratory Medicine, NIH , Bethesda, Maryland, USADepartment of Biology, Center for Genomics and Systems Biology, New York University , New York, New York, USASystems Genomics Section, Laboratory of Parasitic Diseases, DIR, NIAID, NIH , Bethesda, Maryland, USAABSTRACT High error rates of viral RNA-dependent RNA polymerases lead to diverse intra-host viral populations during infection. Errors made during replication that are not strongly deleterious to the virus can lead to the generation of minority variants. However, accurate detection of minority variants in viral sequence data is complicated by errors introduced during sample preparation and data analysis. We used synthetic RNA controls and simulated data to test seven variant-calling tools across a range of allele frequencies and simulated coverages. We show that choice of variant caller and use of replicate sequencing have the most significant impact on single-nucleotide variant (SNV) discovery and demonstrate how both allele frequency and coverage thresholds impact both false discovery and false-negative rates. When replicates are not available, using a combination of multiple callers with more stringent cutoffs is recommended. We use these parameters to find minority variants in sequencing data from SARS-CoV-2 clinical specimens and provide guidance for studies of intra-host viral diversity using either single replicate data or data from technical replicates. Our study provides a framework for rigorous assessment of technical factors that impact SNV identification in viral samples and establishes heuristics that will inform and improve future studies of intra-host variation, viral diversity, and viral evolution. IMPORTANCE When viruses replicate inside a host cell, the virus replication machinery makes mistakes. Over time, these mistakes create mutations that result in a diverse population of viruses inside the host. Mutations that are neither lethal to the virus nor strongly beneficial can lead to minority variants that are minor members of the virus population. However, preparing samples for sequencing can also introduce errors that resemble minority variants, resulting in the inclusion of false-positive data if not filtered correctly. In this study, we aimed to determine the best methods for identification and quantification of these minority variants by testing the performance of seven commonly used variant-calling tools. We used simulated and synthetic data to test their performance against a true set of variants and then used these studies to inform variant identification in data from SARS-CoV-2 clinical specimens. Together, analyses of our data provide extensive guidance for future studies of viral diversity and evolution.https://journals.asm.org/doi/10.1128/mbio.01046-23SARS-CoV-2influenzagenomicsbioinformatics
spellingShingle A. E. Roder
K. E. E. Johnson
M. Knoll
M. Khalfan
B. Wang
S. Schultz-Cherry
S. Banakis
A. Kreitman
C. Mederos
J.-H. Youn
R. Mercado
W. Wang
M. Chung
D. Ruchnewitz
M. I. Samanovic
M. J. Mulligan
M. Lässig
M. Luksza
S. Das
D. Gresham
E. Ghedin
Optimized quantification of intra-host viral diversity in SARS-CoV-2 and influenza virus sequence data
mBio
SARS-CoV-2
influenza
genomics
bioinformatics
title Optimized quantification of intra-host viral diversity in SARS-CoV-2 and influenza virus sequence data
title_full Optimized quantification of intra-host viral diversity in SARS-CoV-2 and influenza virus sequence data
title_fullStr Optimized quantification of intra-host viral diversity in SARS-CoV-2 and influenza virus sequence data
title_full_unstemmed Optimized quantification of intra-host viral diversity in SARS-CoV-2 and influenza virus sequence data
title_short Optimized quantification of intra-host viral diversity in SARS-CoV-2 and influenza virus sequence data
title_sort optimized quantification of intra host viral diversity in sars cov 2 and influenza virus sequence data
topic SARS-CoV-2
influenza
genomics
bioinformatics
url https://journals.asm.org/doi/10.1128/mbio.01046-23
work_keys_str_mv AT aeroder optimizedquantificationofintrahostviraldiversityinsarscov2andinfluenzavirussequencedata
AT keejohnson optimizedquantificationofintrahostviraldiversityinsarscov2andinfluenzavirussequencedata
AT mknoll optimizedquantificationofintrahostviraldiversityinsarscov2andinfluenzavirussequencedata
AT mkhalfan optimizedquantificationofintrahostviraldiversityinsarscov2andinfluenzavirussequencedata
AT bwang optimizedquantificationofintrahostviraldiversityinsarscov2andinfluenzavirussequencedata
AT sschultzcherry optimizedquantificationofintrahostviraldiversityinsarscov2andinfluenzavirussequencedata
AT sbanakis optimizedquantificationofintrahostviraldiversityinsarscov2andinfluenzavirussequencedata
AT akreitman optimizedquantificationofintrahostviraldiversityinsarscov2andinfluenzavirussequencedata
AT cmederos optimizedquantificationofintrahostviraldiversityinsarscov2andinfluenzavirussequencedata
AT jhyoun optimizedquantificationofintrahostviraldiversityinsarscov2andinfluenzavirussequencedata
AT rmercado optimizedquantificationofintrahostviraldiversityinsarscov2andinfluenzavirussequencedata
AT wwang optimizedquantificationofintrahostviraldiversityinsarscov2andinfluenzavirussequencedata
AT mchung optimizedquantificationofintrahostviraldiversityinsarscov2andinfluenzavirussequencedata
AT druchnewitz optimizedquantificationofintrahostviraldiversityinsarscov2andinfluenzavirussequencedata
AT misamanovic optimizedquantificationofintrahostviraldiversityinsarscov2andinfluenzavirussequencedata
AT mjmulligan optimizedquantificationofintrahostviraldiversityinsarscov2andinfluenzavirussequencedata
AT mlassig optimizedquantificationofintrahostviraldiversityinsarscov2andinfluenzavirussequencedata
AT mluksza optimizedquantificationofintrahostviraldiversityinsarscov2andinfluenzavirussequencedata
AT sdas optimizedquantificationofintrahostviraldiversityinsarscov2andinfluenzavirussequencedata
AT dgresham optimizedquantificationofintrahostviraldiversityinsarscov2andinfluenzavirussequencedata
AT eghedin optimizedquantificationofintrahostviraldiversityinsarscov2andinfluenzavirussequencedata