Increasing prediction performance of colorectal cancer disease status using random forests classification based on metagenomic shotgun sequencing data
Dysfunction of microbial communities in various human body sites has been shown to be associated with a variety of diseases raising the possibility of predicting diseases based on metagenomic samples. Although many studies have investigated this problem, there are no consensus on the optimal approac...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
KeAi Communications Co., Ltd.
2022-03-01
|
Series: | Synthetic and Systems Biotechnology |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2405805X22000059 |
_version_ | 1797205002492575744 |
---|---|
author | Yilin Gao Zifan Zhu Fengzhu Sun |
author_facet | Yilin Gao Zifan Zhu Fengzhu Sun |
author_sort | Yilin Gao |
collection | DOAJ |
description | Dysfunction of microbial communities in various human body sites has been shown to be associated with a variety of diseases raising the possibility of predicting diseases based on metagenomic samples. Although many studies have investigated this problem, there are no consensus on the optimal approaches for predicting disease status based on metagenomic samples. Using six human gut metagenomic datasets consisting of large numbers of colorectal cancer patients and healthy controls from different countries, we investigated different software packages for extracting relative abundances of known microbial genomes and for integrating mapping and assembly approaches to obtain the relative abundance profiles of both known and novel genomes. The random forests (RF) classification algorithm was then used to predict colorectal cancer status based on the microbial relative abundance profiles. Based on within data cross-validation and cross-dataset prediction, we show that the RF prediction performance using the microbial relative abundance profiles estimated by Centrifuge is generally higher than that using the microbial relative abundance profiles estimated by MetaPhlAn2 and Bracken. We also develop a novel method to integrate the relative abundance profiles of both known and novel microbial organisms to further increase the prediction performance for colorectal cancer from metagenomes. |
first_indexed | 2024-04-24T08:44:12Z |
format | Article |
id | doaj.art-52bcb23ccb714555842dcc4d248d47a7 |
institution | Directory Open Access Journal |
issn | 2405-805X |
language | English |
last_indexed | 2024-04-24T08:44:12Z |
publishDate | 2022-03-01 |
publisher | KeAi Communications Co., Ltd. |
record_format | Article |
series | Synthetic and Systems Biotechnology |
spelling | doaj.art-52bcb23ccb714555842dcc4d248d47a72024-04-16T14:11:40ZengKeAi Communications Co., Ltd.Synthetic and Systems Biotechnology2405-805X2022-03-0171574585Increasing prediction performance of colorectal cancer disease status using random forests classification based on metagenomic shotgun sequencing dataYilin Gao0Zifan Zhu1Fengzhu Sun2Department of Quantitative and Computational Biology, University of Southern California, 1050 Childs Way, RRI 201, Los Angeles, CA, United StatesDepartment of Quantitative and Computational Biology, University of Southern California, 1050 Childs Way, RRI 201, Los Angeles, CA, United StatesCorresponding author.; Department of Quantitative and Computational Biology, University of Southern California, 1050 Childs Way, RRI 201, Los Angeles, CA, United StatesDysfunction of microbial communities in various human body sites has been shown to be associated with a variety of diseases raising the possibility of predicting diseases based on metagenomic samples. Although many studies have investigated this problem, there are no consensus on the optimal approaches for predicting disease status based on metagenomic samples. Using six human gut metagenomic datasets consisting of large numbers of colorectal cancer patients and healthy controls from different countries, we investigated different software packages for extracting relative abundances of known microbial genomes and for integrating mapping and assembly approaches to obtain the relative abundance profiles of both known and novel genomes. The random forests (RF) classification algorithm was then used to predict colorectal cancer status based on the microbial relative abundance profiles. Based on within data cross-validation and cross-dataset prediction, we show that the RF prediction performance using the microbial relative abundance profiles estimated by Centrifuge is generally higher than that using the microbial relative abundance profiles estimated by MetaPhlAn2 and Bracken. We also develop a novel method to integrate the relative abundance profiles of both known and novel microbial organisms to further increase the prediction performance for colorectal cancer from metagenomes.http://www.sciencedirect.com/science/article/pii/S2405805X22000059MicrobiomeColorectal cancerMetagenomic shotgun sequencingRandom forests |
spellingShingle | Yilin Gao Zifan Zhu Fengzhu Sun Increasing prediction performance of colorectal cancer disease status using random forests classification based on metagenomic shotgun sequencing data Synthetic and Systems Biotechnology Microbiome Colorectal cancer Metagenomic shotgun sequencing Random forests |
title | Increasing prediction performance of colorectal cancer disease status using random forests classification based on metagenomic shotgun sequencing data |
title_full | Increasing prediction performance of colorectal cancer disease status using random forests classification based on metagenomic shotgun sequencing data |
title_fullStr | Increasing prediction performance of colorectal cancer disease status using random forests classification based on metagenomic shotgun sequencing data |
title_full_unstemmed | Increasing prediction performance of colorectal cancer disease status using random forests classification based on metagenomic shotgun sequencing data |
title_short | Increasing prediction performance of colorectal cancer disease status using random forests classification based on metagenomic shotgun sequencing data |
title_sort | increasing prediction performance of colorectal cancer disease status using random forests classification based on metagenomic shotgun sequencing data |
topic | Microbiome Colorectal cancer Metagenomic shotgun sequencing Random forests |
url | http://www.sciencedirect.com/science/article/pii/S2405805X22000059 |
work_keys_str_mv | AT yilingao increasingpredictionperformanceofcolorectalcancerdiseasestatususingrandomforestsclassificationbasedonmetagenomicshotgunsequencingdata AT zifanzhu increasingpredictionperformanceofcolorectalcancerdiseasestatususingrandomforestsclassificationbasedonmetagenomicshotgunsequencingdata AT fengzhusun increasingpredictionperformanceofcolorectalcancerdiseasestatususingrandomforestsclassificationbasedonmetagenomicshotgunsequencingdata |