Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life

Abstract Background It is a computational challenge for current metagenomic classifiers to keep up with the pace of training data generated from genome sequencing projects, such as the exponentially-growing NCBI RefSeq bacterial genome database. When new reference sequences are added to training dat...

Full description

Bibliographic Details
Main Authors: Zhengqiao Zhao, Alexandru Cristian, Gail Rosen
Format: Article
Language:English
Published: BMC 2020-09-01
Series:BMC Bioinformatics
Subjects:
Online Access:http://link.springer.com/article/10.1186/s12859-020-03744-7
_version_ 1818995204132175872
author Zhengqiao Zhao
Alexandru Cristian
Gail Rosen
author_facet Zhengqiao Zhao
Alexandru Cristian
Gail Rosen
author_sort Zhengqiao Zhao
collection DOAJ
description Abstract Background It is a computational challenge for current metagenomic classifiers to keep up with the pace of training data generated from genome sequencing projects, such as the exponentially-growing NCBI RefSeq bacterial genome database. When new reference sequences are added to training data, statically trained classifiers must be rerun on all data, resulting in a highly inefficient process. The rich literature of “incremental learning” addresses the need to update an existing classifier to accommodate new data without sacrificing much accuracy compared to retraining the classifier with all data. Results We demonstrate how classification improves over time by incrementally training a classifier on progressive RefSeq snapshots and testing it on: (a) all known current genomes (as a ground truth set) and (b) a real experimental metagenomic gut sample. We demonstrate that as a classifier model’s knowledge of genomes grows, classification accuracy increases. The proof-of-concept naïve Bayes implementation, when updated yearly, now runs in 1/4 t h of the non-incremental time with no accuracy loss. Conclusions It is evident that classification improves by having the most current knowledge at its disposal. Therefore, it is of utmost importance to make classifiers computationally tractable to keep up with the data deluge. The incremental learning classifier can be efficiently updated without the cost of reprocessing nor the access to the existing database and therefore save storage as well as computation resources.
first_indexed 2024-12-20T21:10:07Z
format Article
id doaj.art-e847785e00c340b587bc1bd2e52b027b
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-12-20T21:10:07Z
publishDate 2020-09-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-e847785e00c340b587bc1bd2e52b027b2022-12-21T19:26:33ZengBMCBMC Bioinformatics1471-21052020-09-0121112310.1186/s12859-020-03744-7Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of lifeZhengqiao Zhao0Alexandru Cristian1Gail Rosen2Ecological and Evolutionary Signal-process and Informatics (EESI) Lab, Department of Electrical and Computer Engineering, Drexel UniversityDepartment of Computer Science, Drexel UniversityEcological and Evolutionary Signal-process and Informatics (EESI) Lab, Department of Electrical and Computer Engineering, Drexel UniversityAbstract Background It is a computational challenge for current metagenomic classifiers to keep up with the pace of training data generated from genome sequencing projects, such as the exponentially-growing NCBI RefSeq bacterial genome database. When new reference sequences are added to training data, statically trained classifiers must be rerun on all data, resulting in a highly inefficient process. The rich literature of “incremental learning” addresses the need to update an existing classifier to accommodate new data without sacrificing much accuracy compared to retraining the classifier with all data. Results We demonstrate how classification improves over time by incrementally training a classifier on progressive RefSeq snapshots and testing it on: (a) all known current genomes (as a ground truth set) and (b) a real experimental metagenomic gut sample. We demonstrate that as a classifier model’s knowledge of genomes grows, classification accuracy increases. The proof-of-concept naïve Bayes implementation, when updated yearly, now runs in 1/4 t h of the non-incremental time with no accuracy loss. Conclusions It is evident that classification improves by having the most current knowledge at its disposal. Therefore, it is of utmost importance to make classifiers computationally tractable to keep up with the data deluge. The incremental learning classifier can be efficiently updated without the cost of reprocessing nor the access to the existing database and therefore save storage as well as computation resources.http://link.springer.com/article/10.1186/s12859-020-03744-7Incremental learningNaïve Bayes taxanomic classifierRefSeqMetagenomics
spellingShingle Zhengqiao Zhao
Alexandru Cristian
Gail Rosen
Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life
BMC Bioinformatics
Incremental learning
Naïve Bayes taxanomic classifier
RefSeq
Metagenomics
title Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life
title_full Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life
title_fullStr Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life
title_full_unstemmed Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life
title_short Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life
title_sort keeping up with the genomes efficient learning of our increasing knowledge of the tree of life
topic Incremental learning
Naïve Bayes taxanomic classifier
RefSeq
Metagenomics
url http://link.springer.com/article/10.1186/s12859-020-03744-7
work_keys_str_mv AT zhengqiaozhao keepingupwiththegenomesefficientlearningofourincreasingknowledgeofthetreeoflife
AT alexandrucristian keepingupwiththegenomesefficientlearningofourincreasingknowledgeofthetreeoflife
AT gailrosen keepingupwiththegenomesefficientlearningofourincreasingknowledgeofthetreeoflife