Gene-based microbiome representation enhances host phenotype classification

ABSTRACT With the concomitant advances in both the microbiome and machine learning fields, the gut microbiome has become of great interest for the potential discovery of biomarkers to be used in the classification of the host health status. Shotgun metagenomics data derived from the human microbiome...

Full description

Bibliographic Details
Main Authors: Thomas Deschênes, Fred Wilfried Elom Tohoundjona, Pier-Luc Plante, Vincenzo Di Marzo, Frédéric Raymond
Format: Article
Language:English
Published: American Society for Microbiology 2023-08-01
Series:mSystems
Subjects:
Online Access:https://journals.asm.org/doi/10.1128/msystems.00531-23
_version_ 1797730300998975488
author Thomas Deschênes
Fred Wilfried Elom Tohoundjona
Pier-Luc Plante
Vincenzo Di Marzo
Frédéric Raymond
author_facet Thomas Deschênes
Fred Wilfried Elom Tohoundjona
Pier-Luc Plante
Vincenzo Di Marzo
Frédéric Raymond
author_sort Thomas Deschênes
collection DOAJ
description ABSTRACT With the concomitant advances in both the microbiome and machine learning fields, the gut microbiome has become of great interest for the potential discovery of biomarkers to be used in the classification of the host health status. Shotgun metagenomics data derived from the human microbiome is composed of a high-dimensional set of microbial features. The use of such complex data for the modeling of host-microbiome interactions remains a challenge as retaining de novo content yields a highly granular set of microbial features. In this study, we compared the prediction performances of machine learning approaches according to different types of data representations derived from shotgun metagenomics. These representations include commonly used taxonomic and functional profiles and the more granular gene cluster approach. For the five case-control datasets used in this study (Type 2 diabetes, obesity, liver cirrhosis, colorectal cancer, and inflammatory bowel disease), gene-based approaches, whether used alone or in combination with reference-based data types, allowed improved or similar classification performances as the taxonomic and functional profiles. In addition, we show that using subsets of gene families from specific functional categories of genes highlight the importance of these functions on the host phenotype. This study demonstrates that both reference-free microbiome representations and curated metagenomic annotations can provide relevant representations for machine learning based on metagenomic data. IMPORTANCE Data representation is an essential part of machine learning performance when using metagenomic data. In this work, we show that different microbiome representations provide varied host phenotype classification performance depending on the dataset. In classification tasks, untargeted microbiome gene content can provide similar or improved classification compared to taxonomical profiling. Feature selection based on biological function also improves classification performance for some pathologies. Function-based feature selection combined with interpretable machine learning algorithms can generate new hypotheses that can potentially be assayed mechanistically. This work thus proposes new approaches to represent microbiome data for machine learning that can potentiate the findings associated with metagenomic data.
first_indexed 2024-03-12T11:42:17Z
format Article
id doaj.art-18e7e7187a95493599fc20eac3097c7f
institution Directory Open Access Journal
issn 2379-5077
language English
last_indexed 2024-03-12T11:42:17Z
publishDate 2023-08-01
publisher American Society for Microbiology
record_format Article
series mSystems
spelling doaj.art-18e7e7187a95493599fc20eac3097c7f2023-08-31T13:00:43ZengAmerican Society for MicrobiologymSystems2379-50772023-08-018410.1128/msystems.00531-23Gene-based microbiome representation enhances host phenotype classificationThomas Deschênes0Fred Wilfried Elom Tohoundjona1Pier-Luc Plante2Vincenzo Di Marzo3Frédéric Raymond4Centre Nutrition, Santé et Société (NUTRISS) – Institut sur la Nutrition et les Aliments Fonctionnels (INAF), Université Laval , Québec, CanadaCentre Nutrition, Santé et Société (NUTRISS) – Institut sur la Nutrition et les Aliments Fonctionnels (INAF), Université Laval , Québec, CanadaCentre Nutrition, Santé et Société (NUTRISS) – Institut sur la Nutrition et les Aliments Fonctionnels (INAF), Université Laval , Québec, CanadaCentre Nutrition, Santé et Société (NUTRISS) – Institut sur la Nutrition et les Aliments Fonctionnels (INAF), Université Laval , Québec, CanadaCentre Nutrition, Santé et Société (NUTRISS) – Institut sur la Nutrition et les Aliments Fonctionnels (INAF), Université Laval , Québec, CanadaABSTRACT With the concomitant advances in both the microbiome and machine learning fields, the gut microbiome has become of great interest for the potential discovery of biomarkers to be used in the classification of the host health status. Shotgun metagenomics data derived from the human microbiome is composed of a high-dimensional set of microbial features. The use of such complex data for the modeling of host-microbiome interactions remains a challenge as retaining de novo content yields a highly granular set of microbial features. In this study, we compared the prediction performances of machine learning approaches according to different types of data representations derived from shotgun metagenomics. These representations include commonly used taxonomic and functional profiles and the more granular gene cluster approach. For the five case-control datasets used in this study (Type 2 diabetes, obesity, liver cirrhosis, colorectal cancer, and inflammatory bowel disease), gene-based approaches, whether used alone or in combination with reference-based data types, allowed improved or similar classification performances as the taxonomic and functional profiles. In addition, we show that using subsets of gene families from specific functional categories of genes highlight the importance of these functions on the host phenotype. This study demonstrates that both reference-free microbiome representations and curated metagenomic annotations can provide relevant representations for machine learning based on metagenomic data. IMPORTANCE Data representation is an essential part of machine learning performance when using metagenomic data. In this work, we show that different microbiome representations provide varied host phenotype classification performance depending on the dataset. In classification tasks, untargeted microbiome gene content can provide similar or improved classification compared to taxonomical profiling. Feature selection based on biological function also improves classification performance for some pathologies. Function-based feature selection combined with interpretable machine learning algorithms can generate new hypotheses that can potentially be assayed mechanistically. This work thus proposes new approaches to represent microbiome data for machine learning that can potentiate the findings associated with metagenomic data.https://journals.asm.org/doi/10.1128/msystems.00531-23microbiomemachine learningmetagenomicsshotgun microbiomefeature selectiongene clusters
spellingShingle Thomas Deschênes
Fred Wilfried Elom Tohoundjona
Pier-Luc Plante
Vincenzo Di Marzo
Frédéric Raymond
Gene-based microbiome representation enhances host phenotype classification
mSystems
microbiome
machine learning
metagenomics
shotgun microbiome
feature selection
gene clusters
title Gene-based microbiome representation enhances host phenotype classification
title_full Gene-based microbiome representation enhances host phenotype classification
title_fullStr Gene-based microbiome representation enhances host phenotype classification
title_full_unstemmed Gene-based microbiome representation enhances host phenotype classification
title_short Gene-based microbiome representation enhances host phenotype classification
title_sort gene based microbiome representation enhances host phenotype classification
topic microbiome
machine learning
metagenomics
shotgun microbiome
feature selection
gene clusters
url https://journals.asm.org/doi/10.1128/msystems.00531-23
work_keys_str_mv AT thomasdeschenes genebasedmicrobiomerepresentationenhanceshostphenotypeclassification
AT fredwilfriedelomtohoundjona genebasedmicrobiomerepresentationenhanceshostphenotypeclassification
AT pierlucplante genebasedmicrobiomerepresentationenhanceshostphenotypeclassification
AT vincenzodimarzo genebasedmicrobiomerepresentationenhanceshostphenotypeclassification
AT fredericraymond genebasedmicrobiomerepresentationenhanceshostphenotypeclassification