Increasing prediction performance of colorectal cancer disease status using random forests classification based on metagenomic shotgun sequencing data

Dysfunction of microbial communities in various human body sites has been shown to be associated with a variety of diseases raising the possibility of predicting diseases based on metagenomic samples. Although many studies have investigated this problem, there are no consensus on the optimal approac...

Full description

Bibliographic Details
Main Authors: Yilin Gao, Zifan Zhu, Fengzhu Sun
Format: Article
Language:English
Published: KeAi Communications Co., Ltd. 2022-03-01
Series:Synthetic and Systems Biotechnology
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2405805X22000059
_version_ 1797205002492575744
author Yilin Gao
Zifan Zhu
Fengzhu Sun
author_facet Yilin Gao
Zifan Zhu
Fengzhu Sun
author_sort Yilin Gao
collection DOAJ
description Dysfunction of microbial communities in various human body sites has been shown to be associated with a variety of diseases raising the possibility of predicting diseases based on metagenomic samples. Although many studies have investigated this problem, there are no consensus on the optimal approaches for predicting disease status based on metagenomic samples. Using six human gut metagenomic datasets consisting of large numbers of colorectal cancer patients and healthy controls from different countries, we investigated different software packages for extracting relative abundances of known microbial genomes and for integrating mapping and assembly approaches to obtain the relative abundance profiles of both known and novel genomes. The random forests (RF) classification algorithm was then used to predict colorectal cancer status based on the microbial relative abundance profiles. Based on within data cross-validation and cross-dataset prediction, we show that the RF prediction performance using the microbial relative abundance profiles estimated by Centrifuge is generally higher than that using the microbial relative abundance profiles estimated by MetaPhlAn2 and Bracken. We also develop a novel method to integrate the relative abundance profiles of both known and novel microbial organisms to further increase the prediction performance for colorectal cancer from metagenomes.
first_indexed 2024-04-24T08:44:12Z
format Article
id doaj.art-52bcb23ccb714555842dcc4d248d47a7
institution Directory Open Access Journal
issn 2405-805X
language English
last_indexed 2024-04-24T08:44:12Z
publishDate 2022-03-01
publisher KeAi Communications Co., Ltd.
record_format Article
series Synthetic and Systems Biotechnology
spelling doaj.art-52bcb23ccb714555842dcc4d248d47a72024-04-16T14:11:40ZengKeAi Communications Co., Ltd.Synthetic and Systems Biotechnology2405-805X2022-03-0171574585Increasing prediction performance of colorectal cancer disease status using random forests classification based on metagenomic shotgun sequencing dataYilin Gao0Zifan Zhu1Fengzhu Sun2Department of Quantitative and Computational Biology, University of Southern California, 1050 Childs Way, RRI 201, Los Angeles, CA, United StatesDepartment of Quantitative and Computational Biology, University of Southern California, 1050 Childs Way, RRI 201, Los Angeles, CA, United StatesCorresponding author.; Department of Quantitative and Computational Biology, University of Southern California, 1050 Childs Way, RRI 201, Los Angeles, CA, United StatesDysfunction of microbial communities in various human body sites has been shown to be associated with a variety of diseases raising the possibility of predicting diseases based on metagenomic samples. Although many studies have investigated this problem, there are no consensus on the optimal approaches for predicting disease status based on metagenomic samples. Using six human gut metagenomic datasets consisting of large numbers of colorectal cancer patients and healthy controls from different countries, we investigated different software packages for extracting relative abundances of known microbial genomes and for integrating mapping and assembly approaches to obtain the relative abundance profiles of both known and novel genomes. The random forests (RF) classification algorithm was then used to predict colorectal cancer status based on the microbial relative abundance profiles. Based on within data cross-validation and cross-dataset prediction, we show that the RF prediction performance using the microbial relative abundance profiles estimated by Centrifuge is generally higher than that using the microbial relative abundance profiles estimated by MetaPhlAn2 and Bracken. We also develop a novel method to integrate the relative abundance profiles of both known and novel microbial organisms to further increase the prediction performance for colorectal cancer from metagenomes.http://www.sciencedirect.com/science/article/pii/S2405805X22000059MicrobiomeColorectal cancerMetagenomic shotgun sequencingRandom forests
spellingShingle Yilin Gao
Zifan Zhu
Fengzhu Sun
Increasing prediction performance of colorectal cancer disease status using random forests classification based on metagenomic shotgun sequencing data
Synthetic and Systems Biotechnology
Microbiome
Colorectal cancer
Metagenomic shotgun sequencing
Random forests
title Increasing prediction performance of colorectal cancer disease status using random forests classification based on metagenomic shotgun sequencing data
title_full Increasing prediction performance of colorectal cancer disease status using random forests classification based on metagenomic shotgun sequencing data
title_fullStr Increasing prediction performance of colorectal cancer disease status using random forests classification based on metagenomic shotgun sequencing data
title_full_unstemmed Increasing prediction performance of colorectal cancer disease status using random forests classification based on metagenomic shotgun sequencing data
title_short Increasing prediction performance of colorectal cancer disease status using random forests classification based on metagenomic shotgun sequencing data
title_sort increasing prediction performance of colorectal cancer disease status using random forests classification based on metagenomic shotgun sequencing data
topic Microbiome
Colorectal cancer
Metagenomic shotgun sequencing
Random forests
url http://www.sciencedirect.com/science/article/pii/S2405805X22000059
work_keys_str_mv AT yilingao increasingpredictionperformanceofcolorectalcancerdiseasestatususingrandomforestsclassificationbasedonmetagenomicshotgunsequencingdata
AT zifanzhu increasingpredictionperformanceofcolorectalcancerdiseasestatususingrandomforestsclassificationbasedonmetagenomicshotgunsequencingdata
AT fengzhusun increasingpredictionperformanceofcolorectalcancerdiseasestatususingrandomforestsclassificationbasedonmetagenomicshotgunsequencingdata