Expanding standards in viromics: in silico evaluation of dsDNA viral genome identification, classification, and auxiliary metabolic gene curation

Background Viruses influence global patterns of microbial diversity and nutrient cycles. Though viral metagenomics (viromics), specifically targeting dsDNA viruses, has been critical for revealing viral roles across diverse ecosystems, its analyses differ in many ways from those used for microbes. T...

Full description

Bibliographic Details
Main Authors: Akbar Adjie Pratama, Benjamin Bolduc, Ahmed A. Zayed, Zhi-Ping Zhong, Jiarong Guo, Dean R. Vik, Maria Consuelo Gazitúa, James M. Wainaina, Simon Roux, Matthew B. Sullivan
Format: Article
Language:English
Published: PeerJ Inc. 2021-06-01
Series:PeerJ
Subjects:
Online Access:https://peerj.com/articles/11447.pdf
_version_ 1797418489366970368
author Akbar Adjie Pratama
Benjamin Bolduc
Ahmed A. Zayed
Zhi-Ping Zhong
Jiarong Guo
Dean R. Vik
Maria Consuelo Gazitúa
James M. Wainaina
Simon Roux
Matthew B. Sullivan
author_facet Akbar Adjie Pratama
Benjamin Bolduc
Ahmed A. Zayed
Zhi-Ping Zhong
Jiarong Guo
Dean R. Vik
Maria Consuelo Gazitúa
James M. Wainaina
Simon Roux
Matthew B. Sullivan
author_sort Akbar Adjie Pratama
collection DOAJ
description Background Viruses influence global patterns of microbial diversity and nutrient cycles. Though viral metagenomics (viromics), specifically targeting dsDNA viruses, has been critical for revealing viral roles across diverse ecosystems, its analyses differ in many ways from those used for microbes. To date, viromics benchmarking has covered read pre-processing, assembly, relative abundance, read mapping thresholds and diversity estimation, but other steps would benefit from benchmarking and standardization. Here we use in silico-generated datasets and an extensive literature survey to evaluate and highlight how dataset composition (i.e., viromes vs bulk metagenomes) and assembly fragmentation impact (i) viral contig identification tool, (ii) virus taxonomic classification, and (iii) identification and curation of auxiliary metabolic genes (AMGs). Results The in silico benchmarking of five commonly used virus identification tools show that gene-content-based tools consistently performed well for long (≥3 kbp) contigs, while k-mer- and blast-based tools were uniquely able to detect viruses from short (≤3 kbp) contigs. Notably, however, the performance increase of k-mer- and blast-based tools for short contigs was obtained at the cost of increased false positives (sometimes up to ∼5% for virome and ∼75% bulk samples), particularly when eukaryotic or mobile genetic element sequences were included in the test datasets. For viral classification, variously sized genome fragments were assessed using gene-sharing network analytics to quantify drop-offs in taxonomic assignments, which revealed correct assignations ranging from ∼95% (whole genomes) down to ∼80% (3 kbp sized genome fragments). A similar trend was also observed for other viral classification tools such as VPF-class, ViPTree and VIRIDIC, suggesting that caution is warranted when classifying short genome fragments and not full genomes. Finally, we highlight how fragmented assemblies can lead to erroneous identification of AMGs and outline a best-practices workflow to curate candidate AMGs in viral genomes assembled from metagenomes. Conclusion Together, these benchmarking experiments and annotation guidelines should aid researchers seeking to best detect, classify, and characterize the myriad viruses ‘hidden’ in diverse sequence datasets.
first_indexed 2024-03-09T06:33:24Z
format Article
id doaj.art-cf9912e52ab448e0a4015084874e068b
institution Directory Open Access Journal
issn 2167-8359
language English
last_indexed 2024-03-09T06:33:24Z
publishDate 2021-06-01
publisher PeerJ Inc.
record_format Article
series PeerJ
spelling doaj.art-cf9912e52ab448e0a4015084874e068b2023-12-03T11:02:12ZengPeerJ Inc.PeerJ2167-83592021-06-019e1144710.7717/peerj.11447Expanding standards in viromics: in silico evaluation of dsDNA viral genome identification, classification, and auxiliary metabolic gene curationAkbar Adjie Pratama0Benjamin Bolduc1Ahmed A. Zayed2Zhi-Ping Zhong3Jiarong Guo4Dean R. Vik5Maria Consuelo Gazitúa6James M. Wainaina7Simon Roux8Matthew B. Sullivan9Department of Microbiology, Ohio State University, Columbus, OH, United States of AmericaDepartment of Microbiology, Ohio State University, Columbus, OH, United States of AmericaDepartment of Microbiology, Ohio State University, Columbus, OH, United States of AmericaDepartment of Microbiology, Ohio State University, Columbus, OH, United States of AmericaDepartment of Microbiology, Ohio State University, Columbus, OH, United States of AmericaDepartment of Microbiology, Ohio State University, Columbus, OH, United States of AmericaViromica Consulting, Santiago, ChileDepartment of Microbiology, Ohio State University, Columbus, OH, United States of AmericaDOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, United States of AmericaDepartment of Microbiology, Ohio State University, Columbus, OH, United States of AmericaBackground Viruses influence global patterns of microbial diversity and nutrient cycles. Though viral metagenomics (viromics), specifically targeting dsDNA viruses, has been critical for revealing viral roles across diverse ecosystems, its analyses differ in many ways from those used for microbes. To date, viromics benchmarking has covered read pre-processing, assembly, relative abundance, read mapping thresholds and diversity estimation, but other steps would benefit from benchmarking and standardization. Here we use in silico-generated datasets and an extensive literature survey to evaluate and highlight how dataset composition (i.e., viromes vs bulk metagenomes) and assembly fragmentation impact (i) viral contig identification tool, (ii) virus taxonomic classification, and (iii) identification and curation of auxiliary metabolic genes (AMGs). Results The in silico benchmarking of five commonly used virus identification tools show that gene-content-based tools consistently performed well for long (≥3 kbp) contigs, while k-mer- and blast-based tools were uniquely able to detect viruses from short (≤3 kbp) contigs. Notably, however, the performance increase of k-mer- and blast-based tools for short contigs was obtained at the cost of increased false positives (sometimes up to ∼5% for virome and ∼75% bulk samples), particularly when eukaryotic or mobile genetic element sequences were included in the test datasets. For viral classification, variously sized genome fragments were assessed using gene-sharing network analytics to quantify drop-offs in taxonomic assignments, which revealed correct assignations ranging from ∼95% (whole genomes) down to ∼80% (3 kbp sized genome fragments). A similar trend was also observed for other viral classification tools such as VPF-class, ViPTree and VIRIDIC, suggesting that caution is warranted when classifying short genome fragments and not full genomes. Finally, we highlight how fragmented assemblies can lead to erroneous identification of AMGs and outline a best-practices workflow to curate candidate AMGs in viral genomes assembled from metagenomes. Conclusion Together, these benchmarking experiments and annotation guidelines should aid researchers seeking to best detect, classify, and characterize the myriad viruses ‘hidden’ in diverse sequence datasets.https://peerj.com/articles/11447.pdfBenchmarksStandard operating procedureVirusesViromicsEcology
spellingShingle Akbar Adjie Pratama
Benjamin Bolduc
Ahmed A. Zayed
Zhi-Ping Zhong
Jiarong Guo
Dean R. Vik
Maria Consuelo Gazitúa
James M. Wainaina
Simon Roux
Matthew B. Sullivan
Expanding standards in viromics: in silico evaluation of dsDNA viral genome identification, classification, and auxiliary metabolic gene curation
PeerJ
Benchmarks
Standard operating procedure
Viruses
Viromics
Ecology
title Expanding standards in viromics: in silico evaluation of dsDNA viral genome identification, classification, and auxiliary metabolic gene curation
title_full Expanding standards in viromics: in silico evaluation of dsDNA viral genome identification, classification, and auxiliary metabolic gene curation
title_fullStr Expanding standards in viromics: in silico evaluation of dsDNA viral genome identification, classification, and auxiliary metabolic gene curation
title_full_unstemmed Expanding standards in viromics: in silico evaluation of dsDNA viral genome identification, classification, and auxiliary metabolic gene curation
title_short Expanding standards in viromics: in silico evaluation of dsDNA viral genome identification, classification, and auxiliary metabolic gene curation
title_sort expanding standards in viromics in silico evaluation of dsdna viral genome identification classification and auxiliary metabolic gene curation
topic Benchmarks
Standard operating procedure
Viruses
Viromics
Ecology
url https://peerj.com/articles/11447.pdf
work_keys_str_mv AT akbaradjiepratama expandingstandardsinviromicsinsilicoevaluationofdsdnaviralgenomeidentificationclassificationandauxiliarymetabolicgenecuration
AT benjaminbolduc expandingstandardsinviromicsinsilicoevaluationofdsdnaviralgenomeidentificationclassificationandauxiliarymetabolicgenecuration
AT ahmedazayed expandingstandardsinviromicsinsilicoevaluationofdsdnaviralgenomeidentificationclassificationandauxiliarymetabolicgenecuration
AT zhipingzhong expandingstandardsinviromicsinsilicoevaluationofdsdnaviralgenomeidentificationclassificationandauxiliarymetabolicgenecuration
AT jiarongguo expandingstandardsinviromicsinsilicoevaluationofdsdnaviralgenomeidentificationclassificationandauxiliarymetabolicgenecuration
AT deanrvik expandingstandardsinviromicsinsilicoevaluationofdsdnaviralgenomeidentificationclassificationandauxiliarymetabolicgenecuration
AT mariaconsuelogazitua expandingstandardsinviromicsinsilicoevaluationofdsdnaviralgenomeidentificationclassificationandauxiliarymetabolicgenecuration
AT jamesmwainaina expandingstandardsinviromicsinsilicoevaluationofdsdnaviralgenomeidentificationclassificationandauxiliarymetabolicgenecuration
AT simonroux expandingstandardsinviromicsinsilicoevaluationofdsdnaviralgenomeidentificationclassificationandauxiliarymetabolicgenecuration
AT matthewbsullivan expandingstandardsinviromicsinsilicoevaluationofdsdnaviralgenomeidentificationclassificationandauxiliarymetabolicgenecuration