An Improved Machine Learning-Based Approach to Assess the Microbial Diversity in Major North Indian River Ecosystems

The rapidly evolving high-throughput sequencing (HTS) technologies generate voluminous genomic and metagenomic sequences, which can help classify the microbial communities with high accuracy in many ecosystems. Conventionally, the rule-based binning techniques are used to classify the contigs or sca...

Full description

Bibliographic Details
Main Authors: Nalinikanta Choudhury, Tanmaya Kumar Sahu, Atmakuri Ramakrishna Rao, Ajaya Kumar Rout, Bijay Kumar Behera
Format: Article
Language:English
Published: MDPI AG 2023-05-01
Series:Genes
Subjects:
Online Access:https://www.mdpi.com/2073-4425/14/5/1082
_version_ 1797599928735760384
author Nalinikanta Choudhury
Tanmaya Kumar Sahu
Atmakuri Ramakrishna Rao
Ajaya Kumar Rout
Bijay Kumar Behera
author_facet Nalinikanta Choudhury
Tanmaya Kumar Sahu
Atmakuri Ramakrishna Rao
Ajaya Kumar Rout
Bijay Kumar Behera
author_sort Nalinikanta Choudhury
collection DOAJ
description The rapidly evolving high-throughput sequencing (HTS) technologies generate voluminous genomic and metagenomic sequences, which can help classify the microbial communities with high accuracy in many ecosystems. Conventionally, the rule-based binning techniques are used to classify the contigs or scaffolds based on either sequence composition or sequence similarity. However, the accurate classification of the microbial communities remains a major challenge due to massive data volumes at hand as well as a requirement of efficient binning methods and classification algorithms. Therefore, we attempted here to implement iterative K-Means clustering for the initial binning of metagenomics sequences and applied various machine learning algorithms (MLAs) to classify the newly identified unknown microbes. The cluster annotation was achieved through the BLAST program of NCBI, which resulted in the grouping of assembled scaffolds into five classes, i.e., bacteria, archaea, eukaryota, viruses and others. The annotated cluster sequences were used to train machine learning algorithms (MLAs) to develop prediction models to classify unknown metagenomic sequences. In this study, we used metagenomic datasets of samples collected from the Ganga (Kanpur and Farakka) and the Yamuna (Delhi) rivers in India for clustering and training the MLA models. Further, the performance of MLAs was evaluated by 10-fold cross validation. The results revealed that the developed model based on the Random Forest had a superior performance compared to the other considered learning algorithms. The proposed method can be used for annotating the metagenomic scaffolds/contigs being complementary to existing methods of metagenomic data analysis. An offline predictor source code with the best prediction model is available at (https://github.com/Nalinikanta7/metagenomics).
first_indexed 2024-03-11T03:42:21Z
format Article
id doaj.art-257c17b637984e6e9879bd2eb3ebe330
institution Directory Open Access Journal
issn 2073-4425
language English
last_indexed 2024-03-11T03:42:21Z
publishDate 2023-05-01
publisher MDPI AG
record_format Article
series Genes
spelling doaj.art-257c17b637984e6e9879bd2eb3ebe3302023-11-18T01:30:13ZengMDPI AGGenes2073-44252023-05-01145108210.3390/genes14051082An Improved Machine Learning-Based Approach to Assess the Microbial Diversity in Major North Indian River EcosystemsNalinikanta Choudhury0Tanmaya Kumar Sahu1Atmakuri Ramakrishna Rao2Ajaya Kumar Rout3Bijay Kumar Behera4ICAR—Indian Agricultural Research Institute, New Delhi 110012, IndiaICAR—Indian Grassland and Fodder Research Institute, Jhansi 284003, IndiaICAR—Indian Agricultural Statistics Research Institute, New Delhi 110012, IndiaICAR—Central Inland Fisheries Research Institute, West Bengal 700120, IndiaICAR—Central Inland Fisheries Research Institute, West Bengal 700120, IndiaThe rapidly evolving high-throughput sequencing (HTS) technologies generate voluminous genomic and metagenomic sequences, which can help classify the microbial communities with high accuracy in many ecosystems. Conventionally, the rule-based binning techniques are used to classify the contigs or scaffolds based on either sequence composition or sequence similarity. However, the accurate classification of the microbial communities remains a major challenge due to massive data volumes at hand as well as a requirement of efficient binning methods and classification algorithms. Therefore, we attempted here to implement iterative K-Means clustering for the initial binning of metagenomics sequences and applied various machine learning algorithms (MLAs) to classify the newly identified unknown microbes. The cluster annotation was achieved through the BLAST program of NCBI, which resulted in the grouping of assembled scaffolds into five classes, i.e., bacteria, archaea, eukaryota, viruses and others. The annotated cluster sequences were used to train machine learning algorithms (MLAs) to develop prediction models to classify unknown metagenomic sequences. In this study, we used metagenomic datasets of samples collected from the Ganga (Kanpur and Farakka) and the Yamuna (Delhi) rivers in India for clustering and training the MLA models. Further, the performance of MLAs was evaluated by 10-fold cross validation. The results revealed that the developed model based on the Random Forest had a superior performance compared to the other considered learning algorithms. The proposed method can be used for annotating the metagenomic scaffolds/contigs being complementary to existing methods of metagenomic data analysis. An offline predictor source code with the best prediction model is available at (https://github.com/Nalinikanta7/metagenomics).https://www.mdpi.com/2073-4425/14/5/1082metagenomicsK-Means clusteringsupport vector machinebinningriver sediment
spellingShingle Nalinikanta Choudhury
Tanmaya Kumar Sahu
Atmakuri Ramakrishna Rao
Ajaya Kumar Rout
Bijay Kumar Behera
An Improved Machine Learning-Based Approach to Assess the Microbial Diversity in Major North Indian River Ecosystems
Genes
metagenomics
K-Means clustering
support vector machine
binning
river sediment
title An Improved Machine Learning-Based Approach to Assess the Microbial Diversity in Major North Indian River Ecosystems
title_full An Improved Machine Learning-Based Approach to Assess the Microbial Diversity in Major North Indian River Ecosystems
title_fullStr An Improved Machine Learning-Based Approach to Assess the Microbial Diversity in Major North Indian River Ecosystems
title_full_unstemmed An Improved Machine Learning-Based Approach to Assess the Microbial Diversity in Major North Indian River Ecosystems
title_short An Improved Machine Learning-Based Approach to Assess the Microbial Diversity in Major North Indian River Ecosystems
title_sort improved machine learning based approach to assess the microbial diversity in major north indian river ecosystems
topic metagenomics
K-Means clustering
support vector machine
binning
river sediment
url https://www.mdpi.com/2073-4425/14/5/1082
work_keys_str_mv AT nalinikantachoudhury animprovedmachinelearningbasedapproachtoassessthemicrobialdiversityinmajornorthindianriverecosystems
AT tanmayakumarsahu animprovedmachinelearningbasedapproachtoassessthemicrobialdiversityinmajornorthindianriverecosystems
AT atmakuriramakrishnarao animprovedmachinelearningbasedapproachtoassessthemicrobialdiversityinmajornorthindianriverecosystems
AT ajayakumarrout animprovedmachinelearningbasedapproachtoassessthemicrobialdiversityinmajornorthindianriverecosystems
AT bijaykumarbehera animprovedmachinelearningbasedapproachtoassessthemicrobialdiversityinmajornorthindianriverecosystems
AT nalinikantachoudhury improvedmachinelearningbasedapproachtoassessthemicrobialdiversityinmajornorthindianriverecosystems
AT tanmayakumarsahu improvedmachinelearningbasedapproachtoassessthemicrobialdiversityinmajornorthindianriverecosystems
AT atmakuriramakrishnarao improvedmachinelearningbasedapproachtoassessthemicrobialdiversityinmajornorthindianriverecosystems
AT ajayakumarrout improvedmachinelearningbasedapproachtoassessthemicrobialdiversityinmajornorthindianriverecosystems
AT bijaykumarbehera improvedmachinelearningbasedapproachtoassessthemicrobialdiversityinmajornorthindianriverecosystems