geneRFinder: gene finding in distinct metagenomic data complexities

Abstract Background Microbes perform a fundamental economic, social, and environmental role in our society. Metagenomics makes it possible to investigate microbes in their natural environments (the complex communities) and their interactions. The way they act is usually estimated by looking at the f...

Full description

Bibliographic Details
Main Authors: Raíssa Silva, Kleber Padovani, Fabiana Góes, Ronnie Alves
Format: Article
Language:English
Published: BMC 2021-02-01
Series:BMC Bioinformatics
Subjects:
Online Access:https://doi.org/10.1186/s12859-021-03997-w
_version_ 1818676388121542656
author Raíssa Silva
Kleber Padovani
Fabiana Góes
Ronnie Alves
author_facet Raíssa Silva
Kleber Padovani
Fabiana Góes
Ronnie Alves
author_sort Raíssa Silva
collection DOAJ
description Abstract Background Microbes perform a fundamental economic, social, and environmental role in our society. Metagenomics makes it possible to investigate microbes in their natural environments (the complex communities) and their interactions. The way they act is usually estimated by looking at the functions they play in those environments and their responsibility is measured by their genes. The advances of next-generation sequencing technology have facilitated metagenomics research however it also creates a heavy computational burden. Large and complex biological datasets are available as never before. There are many gene predictors available that can aid the gene annotation process though they lack handling appropriately metagenomic data complexities. There is no standard metagenomic benchmark data for gene prediction. Thus, gene predictors may inflate their results by obfuscating low false discovery rates. Results We introduce geneRFinder, an ML-based gene predictor able to outperform state-of-the-art gene prediction tools across this benchmark by using only one pre-trained Random Forest model. Average prediction rates of geneRFinder differed in percentage terms by 54% and 64%, respectively, against Prodigal and FragGeneScan while handling high complexity metagenomes. The specificity rate of geneRFinder had the largest distance against FragGeneScan, 79 percentage points, and 66 more than Prodigal. According to McNemar’s test, all percentual differences between predictors performances are statistically significant for all datasets with a 99% confidence interval. Conclusions We provide geneRFinder, an approach for gene prediction in distinct metagenomic complexities, available at gitlab.com/r.lorenna/generfinder and https://osf.io/w2yd6/ , and also we provide a novel, comprehensive benchmark data for gene prediction—which is based on The Critical Assessment of Metagenome Interpretation (CAMI) challenge, and contains labeled data from gene regions—available at https://sourceforge.net/p/generfinder-benchmark .
first_indexed 2024-12-17T08:42:41Z
format Article
id doaj.art-f8fdb0cb92414d48b53e2bf28b324477
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-12-17T08:42:41Z
publishDate 2021-02-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-f8fdb0cb92414d48b53e2bf28b3244772022-12-21T21:56:17ZengBMCBMC Bioinformatics1471-21052021-02-0122111710.1186/s12859-021-03997-wgeneRFinder: gene finding in distinct metagenomic data complexitiesRaíssa Silva0Kleber Padovani1Fabiana Góes2Ronnie Alves3Vale Institute of TechnologyPPGCC, Federal University of ParáICMC, University of São PauloVale Institute of TechnologyAbstract Background Microbes perform a fundamental economic, social, and environmental role in our society. Metagenomics makes it possible to investigate microbes in their natural environments (the complex communities) and their interactions. The way they act is usually estimated by looking at the functions they play in those environments and their responsibility is measured by their genes. The advances of next-generation sequencing technology have facilitated metagenomics research however it also creates a heavy computational burden. Large and complex biological datasets are available as never before. There are many gene predictors available that can aid the gene annotation process though they lack handling appropriately metagenomic data complexities. There is no standard metagenomic benchmark data for gene prediction. Thus, gene predictors may inflate their results by obfuscating low false discovery rates. Results We introduce geneRFinder, an ML-based gene predictor able to outperform state-of-the-art gene prediction tools across this benchmark by using only one pre-trained Random Forest model. Average prediction rates of geneRFinder differed in percentage terms by 54% and 64%, respectively, against Prodigal and FragGeneScan while handling high complexity metagenomes. The specificity rate of geneRFinder had the largest distance against FragGeneScan, 79 percentage points, and 66 more than Prodigal. According to McNemar’s test, all percentual differences between predictors performances are statistically significant for all datasets with a 99% confidence interval. Conclusions We provide geneRFinder, an approach for gene prediction in distinct metagenomic complexities, available at gitlab.com/r.lorenna/generfinder and https://osf.io/w2yd6/ , and also we provide a novel, comprehensive benchmark data for gene prediction—which is based on The Critical Assessment of Metagenome Interpretation (CAMI) challenge, and contains labeled data from gene regions—available at https://sourceforge.net/p/generfinder-benchmark .https://doi.org/10.1186/s12859-021-03997-wGene predictionMachine learningMetagenomics
spellingShingle Raíssa Silva
Kleber Padovani
Fabiana Góes
Ronnie Alves
geneRFinder: gene finding in distinct metagenomic data complexities
BMC Bioinformatics
Gene prediction
Machine learning
Metagenomics
title geneRFinder: gene finding in distinct metagenomic data complexities
title_full geneRFinder: gene finding in distinct metagenomic data complexities
title_fullStr geneRFinder: gene finding in distinct metagenomic data complexities
title_full_unstemmed geneRFinder: gene finding in distinct metagenomic data complexities
title_short geneRFinder: gene finding in distinct metagenomic data complexities
title_sort generfinder gene finding in distinct metagenomic data complexities
topic Gene prediction
Machine learning
Metagenomics
url https://doi.org/10.1186/s12859-021-03997-w
work_keys_str_mv AT raissasilva generfindergenefindingindistinctmetagenomicdatacomplexities
AT kleberpadovani generfindergenefindingindistinctmetagenomicdatacomplexities
AT fabianagoes generfindergenefindingindistinctmetagenomicdatacomplexities
AT ronniealves generfindergenefindingindistinctmetagenomicdatacomplexities