Machine learning approaches in microbiome research: challenges and best practices

Microbiome data predictive analysis within a machine learning (ML) workflow presents numerous domain-specific challenges involving preprocessing, feature selection, predictive modeling, performance estimation, model interpretation, and the extraction of biological information from the results. To as...

Full description

Bibliographic Details
Main Authors: Georgios Papoutsoglou, Sonia Tarazona, Marta B. Lopes, Thomas Klammsteiner, Eliana Ibrahimi, Julia Eckenberger, Pierfrancesco Novielli, Alberto Tonda, Andrea Simeon, Rajesh Shigdel, Stéphane Béreux, Giacomo Vitali, Sabina Tangaro, Leo Lahti, Andriy Temko, Marcus J. Claesson, Magali Berland
Format: Article
Language:English
Published: Frontiers Media S.A. 2023-09-01
Series:Frontiers in Microbiology
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fmicb.2023.1261889/full
_version_ 1797676893775855616
author Georgios Papoutsoglou
Georgios Papoutsoglou
Sonia Tarazona
Marta B. Lopes
Marta B. Lopes
Thomas Klammsteiner
Thomas Klammsteiner
Eliana Ibrahimi
Julia Eckenberger
Julia Eckenberger
Pierfrancesco Novielli
Pierfrancesco Novielli
Alberto Tonda
Alberto Tonda
Andrea Simeon
Rajesh Shigdel
Stéphane Béreux
Stéphane Béreux
Giacomo Vitali
Sabina Tangaro
Sabina Tangaro
Leo Lahti
Andriy Temko
Marcus J. Claesson
Marcus J. Claesson
Magali Berland
author_facet Georgios Papoutsoglou
Georgios Papoutsoglou
Sonia Tarazona
Marta B. Lopes
Marta B. Lopes
Thomas Klammsteiner
Thomas Klammsteiner
Eliana Ibrahimi
Julia Eckenberger
Julia Eckenberger
Pierfrancesco Novielli
Pierfrancesco Novielli
Alberto Tonda
Alberto Tonda
Andrea Simeon
Rajesh Shigdel
Stéphane Béreux
Stéphane Béreux
Giacomo Vitali
Sabina Tangaro
Sabina Tangaro
Leo Lahti
Andriy Temko
Marcus J. Claesson
Marcus J. Claesson
Magali Berland
author_sort Georgios Papoutsoglou
collection DOAJ
description Microbiome data predictive analysis within a machine learning (ML) workflow presents numerous domain-specific challenges involving preprocessing, feature selection, predictive modeling, performance estimation, model interpretation, and the extraction of biological information from the results. To assist decision-making, we offer a set of recommendations on algorithm selection, pipeline creation and evaluation, stemming from the COST Action ML4Microbiome. We compared the suggested approaches on a multi-cohort shotgun metagenomics dataset of colorectal cancer patients, focusing on their performance in disease diagnosis and biomarker discovery. It is demonstrated that the use of compositional transformations and filtering methods as part of data preprocessing does not always improve the predictive performance of a model. In contrast, the multivariate feature selection, such as the Statistically Equivalent Signatures algorithm, was effective in reducing the classification error. When validated on a separate test dataset, this algorithm in combination with random forest modeling, provided the most accurate performance estimates. Lastly, we showed how linear modeling by logistic regression coupled with visualization techniques such as Individual Conditional Expectation (ICE) plots can yield interpretable results and offer biological insights. These findings are significant for clinicians and non-experts alike in translational applications.
first_indexed 2024-03-11T22:37:05Z
format Article
id doaj.art-a35769259f1946d3b9af729120cec4fd
institution Directory Open Access Journal
issn 1664-302X
language English
last_indexed 2024-03-11T22:37:05Z
publishDate 2023-09-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Microbiology
spelling doaj.art-a35769259f1946d3b9af729120cec4fd2023-09-22T13:10:54ZengFrontiers Media S.A.Frontiers in Microbiology1664-302X2023-09-011410.3389/fmicb.2023.12618891261889Machine learning approaches in microbiome research: challenges and best practicesGeorgios Papoutsoglou0Georgios Papoutsoglou1Sonia Tarazona2Marta B. Lopes3Marta B. Lopes4Thomas Klammsteiner5Thomas Klammsteiner6Eliana Ibrahimi7Julia Eckenberger8Julia Eckenberger9Pierfrancesco Novielli10Pierfrancesco Novielli11Alberto Tonda12Alberto Tonda13Andrea Simeon14Rajesh Shigdel15Stéphane Béreux16Stéphane Béreux17Giacomo Vitali18Sabina Tangaro19Sabina Tangaro20Leo Lahti21Andriy Temko22Marcus J. Claesson23Marcus J. Claesson24Magali Berland25Department of Computer Science, University of Crete, Heraklion, GreeceJADBio Gnosis DA S.A., Science and Technology Park of Crete, Heraklion, GreeceDepartment of Applied Statistics and Operations Research and Quality, Polytechnic University of Valencia, Valencia, SpainCenter for Mathematics and Applications (NOVA Math), NOVA School of Science and Technology, Caparica, PortugalResearch and Development Unit for Mechanical and Industrial Engineering (UNIDEMI), Department of Mechanical and Industrial Engineering, NOVA School of Science and Technology, Caparica, PortugalDepartment of Ecology, Universität Innsbruck, Innsbruck, AustriaDepartment of Microbiology, Universität Innsbruck, Innsbruck, AustriaDepartment of Biology, University of Tirana, Tirana, AlbaniaSchool of Microbiology, University College Cork, Cork, Ireland0APC Microbiome Ireland, Cork, Ireland1Department of Soil, Plant, and Food Sciences, University of Bari Aldo Moro, Bari, Italy2National Institute for Nuclear Physics, Bari Division, Bari, Italy3UMR 518 MIA-PS, INRAE, Paris-Saclay University, Palaiseau, France4Complex Systems Institute of Paris Ile-de-France (ISC-PIF) - UAR 3611 CNRS, Paris, France5BioSense Institute, University of Novi Sad, Novi Sad, Serbia6Department of Clinical Science, University of Bergen, Bergen, Norway7MetaGenoPolis, INRAE, Paris-Saclay University, Jouy-en-Josas, France8MaIAGE, INRAE, Paris-Saclay University, Jouy-en-Josas, France7MetaGenoPolis, INRAE, Paris-Saclay University, Jouy-en-Josas, France1Department of Soil, Plant, and Food Sciences, University of Bari Aldo Moro, Bari, Italy2National Institute for Nuclear Physics, Bari Division, Bari, Italy9Department of Computing, University of Turku, Turku, Finland0Department of Electrical and Electronic Engineering, University College Cork, Cork, IrelandSchool of Microbiology, University College Cork, Cork, Ireland0APC Microbiome Ireland, Cork, Ireland7MetaGenoPolis, INRAE, Paris-Saclay University, Jouy-en-Josas, FranceMicrobiome data predictive analysis within a machine learning (ML) workflow presents numerous domain-specific challenges involving preprocessing, feature selection, predictive modeling, performance estimation, model interpretation, and the extraction of biological information from the results. To assist decision-making, we offer a set of recommendations on algorithm selection, pipeline creation and evaluation, stemming from the COST Action ML4Microbiome. We compared the suggested approaches on a multi-cohort shotgun metagenomics dataset of colorectal cancer patients, focusing on their performance in disease diagnosis and biomarker discovery. It is demonstrated that the use of compositional transformations and filtering methods as part of data preprocessing does not always improve the predictive performance of a model. In contrast, the multivariate feature selection, such as the Statistically Equivalent Signatures algorithm, was effective in reducing the classification error. When validated on a separate test dataset, this algorithm in combination with random forest modeling, provided the most accurate performance estimates. Lastly, we showed how linear modeling by logistic regression coupled with visualization techniques such as Individual Conditional Expectation (ICE) plots can yield interpretable results and offer biological insights. These findings are significant for clinicians and non-experts alike in translational applications.https://www.frontiersin.org/articles/10.3389/fmicb.2023.1261889/fullmicrobiome data analysismachine learning methodspreprocessingfeature selectionpredictive modelingmodel selection
spellingShingle Georgios Papoutsoglou
Georgios Papoutsoglou
Sonia Tarazona
Marta B. Lopes
Marta B. Lopes
Thomas Klammsteiner
Thomas Klammsteiner
Eliana Ibrahimi
Julia Eckenberger
Julia Eckenberger
Pierfrancesco Novielli
Pierfrancesco Novielli
Alberto Tonda
Alberto Tonda
Andrea Simeon
Rajesh Shigdel
Stéphane Béreux
Stéphane Béreux
Giacomo Vitali
Sabina Tangaro
Sabina Tangaro
Leo Lahti
Andriy Temko
Marcus J. Claesson
Marcus J. Claesson
Magali Berland
Machine learning approaches in microbiome research: challenges and best practices
Frontiers in Microbiology
microbiome data analysis
machine learning methods
preprocessing
feature selection
predictive modeling
model selection
title Machine learning approaches in microbiome research: challenges and best practices
title_full Machine learning approaches in microbiome research: challenges and best practices
title_fullStr Machine learning approaches in microbiome research: challenges and best practices
title_full_unstemmed Machine learning approaches in microbiome research: challenges and best practices
title_short Machine learning approaches in microbiome research: challenges and best practices
title_sort machine learning approaches in microbiome research challenges and best practices
topic microbiome data analysis
machine learning methods
preprocessing
feature selection
predictive modeling
model selection
url https://www.frontiersin.org/articles/10.3389/fmicb.2023.1261889/full
work_keys_str_mv AT georgiospapoutsoglou machinelearningapproachesinmicrobiomeresearchchallengesandbestpractices
AT georgiospapoutsoglou machinelearningapproachesinmicrobiomeresearchchallengesandbestpractices
AT soniatarazona machinelearningapproachesinmicrobiomeresearchchallengesandbestpractices
AT martablopes machinelearningapproachesinmicrobiomeresearchchallengesandbestpractices
AT martablopes machinelearningapproachesinmicrobiomeresearchchallengesandbestpractices
AT thomasklammsteiner machinelearningapproachesinmicrobiomeresearchchallengesandbestpractices
AT thomasklammsteiner machinelearningapproachesinmicrobiomeresearchchallengesandbestpractices
AT elianaibrahimi machinelearningapproachesinmicrobiomeresearchchallengesandbestpractices
AT juliaeckenberger machinelearningapproachesinmicrobiomeresearchchallengesandbestpractices
AT juliaeckenberger machinelearningapproachesinmicrobiomeresearchchallengesandbestpractices
AT pierfrancesconovielli machinelearningapproachesinmicrobiomeresearchchallengesandbestpractices
AT pierfrancesconovielli machinelearningapproachesinmicrobiomeresearchchallengesandbestpractices
AT albertotonda machinelearningapproachesinmicrobiomeresearchchallengesandbestpractices
AT albertotonda machinelearningapproachesinmicrobiomeresearchchallengesandbestpractices
AT andreasimeon machinelearningapproachesinmicrobiomeresearchchallengesandbestpractices
AT rajeshshigdel machinelearningapproachesinmicrobiomeresearchchallengesandbestpractices
AT stephanebereux machinelearningapproachesinmicrobiomeresearchchallengesandbestpractices
AT stephanebereux machinelearningapproachesinmicrobiomeresearchchallengesandbestpractices
AT giacomovitali machinelearningapproachesinmicrobiomeresearchchallengesandbestpractices
AT sabinatangaro machinelearningapproachesinmicrobiomeresearchchallengesandbestpractices
AT sabinatangaro machinelearningapproachesinmicrobiomeresearchchallengesandbestpractices
AT leolahti machinelearningapproachesinmicrobiomeresearchchallengesandbestpractices
AT andriytemko machinelearningapproachesinmicrobiomeresearchchallengesandbestpractices
AT marcusjclaesson machinelearningapproachesinmicrobiomeresearchchallengesandbestpractices
AT marcusjclaesson machinelearningapproachesinmicrobiomeresearchchallengesandbestpractices
AT magaliberland machinelearningapproachesinmicrobiomeresearchchallengesandbestpractices