A Framework for Effective Application of Machine Learning to Microbiome-Based Classification Problems

ABSTRACT Machine learning (ML) modeling of the human microbiome has the potential to identify microbial biomarkers and aid in the diagnosis of many diseases such as inflammatory bowel disease, diabetes, and colorectal cancer. Progress has been made toward developing ML models that predict health out...

Full description

Bibliographic Details
Main Authors:	Begüm D. Topçuoğlu, Nicholas A. Lesniak, Mack T. Ruffin, Jenna Wiens, Patrick D. Schloss
Format:	Article
Language:	English
Published:	American Society for Microbiology 2020-06-01
Series:	mBio
Subjects:	16S rRNA gene colon cancer machine learning microbial ecology microbiome
Online Access:	https://journals.asm.org/doi/10.1128/mBio.00434-20

_version_	1818990198386589696
author	Begüm D. Topçuoğlu Nicholas A. Lesniak Mack T. Ruffin Jenna Wiens Patrick D. Schloss
author_facet	Begüm D. Topçuoğlu Nicholas A. Lesniak Mack T. Ruffin Jenna Wiens Patrick D. Schloss
author_sort	Begüm D. Topçuoğlu
collection	DOAJ
description	ABSTRACT Machine learning (ML) modeling of the human microbiome has the potential to identify microbial biomarkers and aid in the diagnosis of many diseases such as inflammatory bowel disease, diabetes, and colorectal cancer. Progress has been made toward developing ML models that predict health outcomes using bacterial abundances, but inconsistent adoption of training and evaluation methods call the validity of these models into question. Furthermore, there appears to be a preference by many researchers to favor increased model complexity over interpretability. To overcome these challenges, we trained seven models that used fecal 16S rRNA sequence data to predict the presence of colonic screen relevant neoplasias (SRNs) (n = 490 patients, 261 controls and 229 cases). We developed a reusable open-source pipeline to train, validate, and interpret ML models. To show the effect of model selection, we assessed the predictive performance, interpretability, and training time of L2-regularized logistic regression, L1- and L2-regularized support vector machines (SVM) with linear and radial basis function kernels, a decision tree, random forest, and gradient boosted trees (XGBoost). The random forest model performed best at detecting SRNs with an area under the receiver operating characteristic curve (AUROC) of 0.695 (interquartile range [IQR], 0.651 to 0.739) but was slow to train (83.2 h) and not inherently interpretable. Despite its simplicity, L2-regularized logistic regression followed random forest in predictive performance with an AUROC of 0.680 (IQR, 0.625 to 0.735), trained faster (12 min), and was inherently interpretable. Our analysis highlights the importance of choosing an ML approach based on the goal of the study, as the choice will inform expectations of performance and interpretability. IMPORTANCE Diagnosing diseases using machine learning (ML) is rapidly being adopted in microbiome studies. However, the estimated performance associated with these models is likely overoptimistic. Moreover, there is a trend toward using black box models without a discussion of the difficulty of interpreting such models when trying to identify microbial biomarkers of disease. This work represents a step toward developing more-reproducible ML practices in applying ML to microbiome research. We implement a rigorous pipeline and emphasize the importance of selecting ML models that reflect the goal of the study. These concepts are not particular to the study of human health but can also be applied to environmental microbiology studies.
first_indexed	2024-12-20T19:50:33Z
format	Article
id	doaj.art-47b3a3d71e7741b48a1bbd42dbb2aeb9
institution	Directory Open Access Journal
issn	2150-7511
language	English
last_indexed	2024-12-20T19:50:33Z
publishDate	2020-06-01
publisher	American Society for Microbiology
record_format	Article
series	mBio
spelling	doaj.art-47b3a3d71e7741b48a1bbd42dbb2aeb92022-12-21T19:28:19ZengAmerican Society for MicrobiologymBio2150-75112020-06-0111310.1128/mBio.00434-20A Framework for Effective Application of Machine Learning to Microbiome-Based Classification ProblemsBegüm D. Topçuoğlu0Nicholas A. Lesniak1Mack T. Ruffin2Jenna Wiens3Patrick D. Schloss4Department of Microbiology and Immunology, University of Michigan, Ann Arbor, Michigan, USADepartment of Microbiology and Immunology, University of Michigan, Ann Arbor, Michigan, USADepartment of Family Medicine and Community Medicine, Penn State Hershey Medical Center, Hershey, Pennsylvania, USADepartment of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, Michigan, USADepartment of Microbiology and Immunology, University of Michigan, Ann Arbor, Michigan, USAABSTRACT Machine learning (ML) modeling of the human microbiome has the potential to identify microbial biomarkers and aid in the diagnosis of many diseases such as inflammatory bowel disease, diabetes, and colorectal cancer. Progress has been made toward developing ML models that predict health outcomes using bacterial abundances, but inconsistent adoption of training and evaluation methods call the validity of these models into question. Furthermore, there appears to be a preference by many researchers to favor increased model complexity over interpretability. To overcome these challenges, we trained seven models that used fecal 16S rRNA sequence data to predict the presence of colonic screen relevant neoplasias (SRNs) (n = 490 patients, 261 controls and 229 cases). We developed a reusable open-source pipeline to train, validate, and interpret ML models. To show the effect of model selection, we assessed the predictive performance, interpretability, and training time of L2-regularized logistic regression, L1- and L2-regularized support vector machines (SVM) with linear and radial basis function kernels, a decision tree, random forest, and gradient boosted trees (XGBoost). The random forest model performed best at detecting SRNs with an area under the receiver operating characteristic curve (AUROC) of 0.695 (interquartile range [IQR], 0.651 to 0.739) but was slow to train (83.2 h) and not inherently interpretable. Despite its simplicity, L2-regularized logistic regression followed random forest in predictive performance with an AUROC of 0.680 (IQR, 0.625 to 0.735), trained faster (12 min), and was inherently interpretable. Our analysis highlights the importance of choosing an ML approach based on the goal of the study, as the choice will inform expectations of performance and interpretability. IMPORTANCE Diagnosing diseases using machine learning (ML) is rapidly being adopted in microbiome studies. However, the estimated performance associated with these models is likely overoptimistic. Moreover, there is a trend toward using black box models without a discussion of the difficulty of interpreting such models when trying to identify microbial biomarkers of disease. This work represents a step toward developing more-reproducible ML practices in applying ML to microbiome research. We implement a rigorous pipeline and emphasize the importance of selecting ML models that reflect the goal of the study. These concepts are not particular to the study of human health but can also be applied to environmental microbiology studies.https://journals.asm.org/doi/10.1128/mBio.00434-2016S rRNA genecolon cancermachine learningmicrobial ecologymicrobiome
spellingShingle	Begüm D. Topçuoğlu Nicholas A. Lesniak Mack T. Ruffin Jenna Wiens Patrick D. Schloss A Framework for Effective Application of Machine Learning to Microbiome-Based Classification Problems mBio 16S rRNA gene colon cancer machine learning microbial ecology microbiome
title	A Framework for Effective Application of Machine Learning to Microbiome-Based Classification Problems
title_full	A Framework for Effective Application of Machine Learning to Microbiome-Based Classification Problems
title_fullStr	A Framework for Effective Application of Machine Learning to Microbiome-Based Classification Problems
title_full_unstemmed	A Framework for Effective Application of Machine Learning to Microbiome-Based Classification Problems
title_short	A Framework for Effective Application of Machine Learning to Microbiome-Based Classification Problems
title_sort	framework for effective application of machine learning to microbiome based classification problems
topic	16S rRNA gene colon cancer machine learning microbial ecology microbiome
url	https://journals.asm.org/doi/10.1128/mBio.00434-20
work_keys_str_mv	AT begumdtopcuoglu aframeworkforeffectiveapplicationofmachinelearningtomicrobiomebasedclassificationproblems AT nicholasalesniak aframeworkforeffectiveapplicationofmachinelearningtomicrobiomebasedclassificationproblems AT macktruffin aframeworkforeffectiveapplicationofmachinelearningtomicrobiomebasedclassificationproblems AT jennawiens aframeworkforeffectiveapplicationofmachinelearningtomicrobiomebasedclassificationproblems AT patrickdschloss aframeworkforeffectiveapplicationofmachinelearningtomicrobiomebasedclassificationproblems AT begumdtopcuoglu frameworkforeffectiveapplicationofmachinelearningtomicrobiomebasedclassificationproblems AT nicholasalesniak frameworkforeffectiveapplicationofmachinelearningtomicrobiomebasedclassificationproblems AT macktruffin frameworkforeffectiveapplicationofmachinelearningtomicrobiomebasedclassificationproblems AT jennawiens frameworkforeffectiveapplicationofmachinelearningtomicrobiomebasedclassificationproblems AT patrickdschloss frameworkforeffectiveapplicationofmachinelearningtomicrobiomebasedclassificationproblems

A Framework for Effective Application of Machine Learning to Microbiome-Based Classification Problems

Similar Items