Metagenomic Geolocation Prediction Using an Adaptive Ensemble Classifier

Microbiome samples harvested from urban environments can be informative in predicting the geographic location of unknown samples. The idea that different cities may have geographically disparate microbial signatures can be utilized to predict the geographical location based on city-specific microbio...

Full description

Bibliographic Details
Main Authors: Samuel Anyaso-Samuel, Archie Sachdeva, Subharup Guha, Somnath Datta
Format: Article
Language:English
Published: Frontiers Media S.A. 2021-04-01
Series:Frontiers in Genetics
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fgene.2021.642282/full
_version_ 1819295073816281088
author Samuel Anyaso-Samuel
Archie Sachdeva
Subharup Guha
Somnath Datta
author_facet Samuel Anyaso-Samuel
Archie Sachdeva
Subharup Guha
Somnath Datta
author_sort Samuel Anyaso-Samuel
collection DOAJ
description Microbiome samples harvested from urban environments can be informative in predicting the geographic location of unknown samples. The idea that different cities may have geographically disparate microbial signatures can be utilized to predict the geographical location based on city-specific microbiome samples. We implemented this idea first; by utilizing standard bioinformatics procedures to pre-process the raw metagenomics samples provided by the CAMDA organizers. We trained several component classifiers and a robust ensemble classifier with data generated from taxonomy-dependent and taxonomy-free approaches. Also, we implemented class weighting and an optimal oversampling technique to overcome the class imbalance in the primary data. In each instance, we observed that the component classifiers performed differently, whereas the ensemble classifier consistently yielded optimal performance. Finally, we predicted the source cities of mystery samples provided by the organizers. Our results highlight the unreliability of restricting the classification of metagenomic samples to source origins to a single classification algorithm. By combining several component classifiers via the ensemble approach, we obtained classification results that were as good as the best-performing component classifier.
first_indexed 2024-12-24T04:36:25Z
format Article
id doaj.art-4b3a846c050043d0a2e1aad008363dbf
institution Directory Open Access Journal
issn 1664-8021
language English
last_indexed 2024-12-24T04:36:25Z
publishDate 2021-04-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Genetics
spelling doaj.art-4b3a846c050043d0a2e1aad008363dbf2022-12-21T17:15:09ZengFrontiers Media S.A.Frontiers in Genetics1664-80212021-04-011210.3389/fgene.2021.642282642282Metagenomic Geolocation Prediction Using an Adaptive Ensemble ClassifierSamuel Anyaso-SamuelArchie SachdevaSubharup GuhaSomnath DattaMicrobiome samples harvested from urban environments can be informative in predicting the geographic location of unknown samples. The idea that different cities may have geographically disparate microbial signatures can be utilized to predict the geographical location based on city-specific microbiome samples. We implemented this idea first; by utilizing standard bioinformatics procedures to pre-process the raw metagenomics samples provided by the CAMDA organizers. We trained several component classifiers and a robust ensemble classifier with data generated from taxonomy-dependent and taxonomy-free approaches. Also, we implemented class weighting and an optimal oversampling technique to overcome the class imbalance in the primary data. In each instance, we observed that the component classifiers performed differently, whereas the ensemble classifier consistently yielded optimal performance. Finally, we predicted the source cities of mystery samples provided by the organizers. Our results highlight the unreliability of restricting the classification of metagenomic samples to source origins to a single classification algorithm. By combining several component classifiers via the ensemble approach, we obtained classification results that were as good as the best-performing component classifier.https://www.frontiersin.org/articles/10.3389/fgene.2021.642282/fullmetagenomicsmachine learningensemble classifiermicrobiomegeolocation
spellingShingle Samuel Anyaso-Samuel
Archie Sachdeva
Subharup Guha
Somnath Datta
Metagenomic Geolocation Prediction Using an Adaptive Ensemble Classifier
Frontiers in Genetics
metagenomics
machine learning
ensemble classifier
microbiome
geolocation
title Metagenomic Geolocation Prediction Using an Adaptive Ensemble Classifier
title_full Metagenomic Geolocation Prediction Using an Adaptive Ensemble Classifier
title_fullStr Metagenomic Geolocation Prediction Using an Adaptive Ensemble Classifier
title_full_unstemmed Metagenomic Geolocation Prediction Using an Adaptive Ensemble Classifier
title_short Metagenomic Geolocation Prediction Using an Adaptive Ensemble Classifier
title_sort metagenomic geolocation prediction using an adaptive ensemble classifier
topic metagenomics
machine learning
ensemble classifier
microbiome
geolocation
url https://www.frontiersin.org/articles/10.3389/fgene.2021.642282/full
work_keys_str_mv AT samuelanyasosamuel metagenomicgeolocationpredictionusinganadaptiveensembleclassifier
AT archiesachdeva metagenomicgeolocationpredictionusinganadaptiveensembleclassifier
AT subharupguha metagenomicgeolocationpredictionusinganadaptiveensembleclassifier
AT somnathdatta metagenomicgeolocationpredictionusinganadaptiveensembleclassifier