Robust Machine Learning for Colorectal Cancer Risk Prediction and Stratification

While colorectal cancer (CRC) is third in prevalence and mortality among cancers in the United States, there is no effective method to screen the general public for CRC risk. In this study, to identify an effective mass screening method for CRC risk, we evaluated seven supervised machine learning al...

Full description

Bibliographic Details
Main Authors: Bradley J. Nartowt, Gregory R. Hart, Wazir Muhammad, Ying Liang, Gigi F. Stark, Jun Deng
Format: Article
Language:English
Published: Frontiers Media S.A. 2020-03-01
Series:Frontiers in Big Data
Subjects:
Online Access:https://www.frontiersin.org/article/10.3389/fdata.2020.00006/full
_version_ 1818776273313333248
author Bradley J. Nartowt
Gregory R. Hart
Wazir Muhammad
Ying Liang
Gigi F. Stark
Jun Deng
author_facet Bradley J. Nartowt
Gregory R. Hart
Wazir Muhammad
Ying Liang
Gigi F. Stark
Jun Deng
author_sort Bradley J. Nartowt
collection DOAJ
description While colorectal cancer (CRC) is third in prevalence and mortality among cancers in the United States, there is no effective method to screen the general public for CRC risk. In this study, to identify an effective mass screening method for CRC risk, we evaluated seven supervised machine learning algorithms: linear discriminant analysis, support vector machine, naive Bayes, decision tree, random forest, logistic regression, and artificial neural network. Models were trained and cross-tested with the National Health Interview Survey (NHIS) and the Prostate, Lung, Colorectal, Ovarian Cancer Screening (PLCO) datasets. Six imputation methods were used to handle missing data: mean, Gaussian, Lorentzian, one-hot encoding, Gaussian expectation-maximization, and listwise deletion. Among all of the model configurations and imputation method combinations, the artificial neural network with expectation-maximization imputation emerged as the best, having a concordance of 0.70 ± 0.02, sensitivity of 0.63 ± 0.06, and specificity of 0.82 ± 0.04. In stratifying CRC risk in the NHIS and PLCO datasets, only 2% of negative cases were misclassified as high risk and 6% of positive cases were misclassified as low risk. In modeling the CRC-free probability with Kaplan-Meier estimators, low-, medium-, and high CRC-risk groups have statistically-significant separation. Our results indicated that the trained artificial neural network can be used as an effective screening tool for early intervention and prevention of CRC in large populations.
first_indexed 2024-12-18T11:10:19Z
format Article
id doaj.art-700acff3a031459db666037f57ad6ac9
institution Directory Open Access Journal
issn 2624-909X
language English
last_indexed 2024-12-18T11:10:19Z
publishDate 2020-03-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Big Data
spelling doaj.art-700acff3a031459db666037f57ad6ac92022-12-21T21:10:00ZengFrontiers Media S.A.Frontiers in Big Data2624-909X2020-03-01310.3389/fdata.2020.00006504048Robust Machine Learning for Colorectal Cancer Risk Prediction and StratificationBradley J. Nartowt0Gregory R. Hart1Wazir Muhammad2Ying Liang3Gigi F. Stark4Jun Deng5Department of Therapeutic Radiology, Yale University, New Haven, CT, United StatesDepartment of Therapeutic Radiology, Yale University, New Haven, CT, United StatesDepartment of Therapeutic Radiology, Yale University, New Haven, CT, United StatesDepartment of Radiation Oncology, Medial College of Wisconsin, Milwaukee, WI, United StatesDepartment of Statistics & Data Science, Yale University, New Haven, CT, United StatesDepartment of Therapeutic Radiology, Yale University, New Haven, CT, United StatesWhile colorectal cancer (CRC) is third in prevalence and mortality among cancers in the United States, there is no effective method to screen the general public for CRC risk. In this study, to identify an effective mass screening method for CRC risk, we evaluated seven supervised machine learning algorithms: linear discriminant analysis, support vector machine, naive Bayes, decision tree, random forest, logistic regression, and artificial neural network. Models were trained and cross-tested with the National Health Interview Survey (NHIS) and the Prostate, Lung, Colorectal, Ovarian Cancer Screening (PLCO) datasets. Six imputation methods were used to handle missing data: mean, Gaussian, Lorentzian, one-hot encoding, Gaussian expectation-maximization, and listwise deletion. Among all of the model configurations and imputation method combinations, the artificial neural network with expectation-maximization imputation emerged as the best, having a concordance of 0.70 ± 0.02, sensitivity of 0.63 ± 0.06, and specificity of 0.82 ± 0.04. In stratifying CRC risk in the NHIS and PLCO datasets, only 2% of negative cases were misclassified as high risk and 6% of positive cases were misclassified as low risk. In modeling the CRC-free probability with Kaplan-Meier estimators, low-, medium-, and high CRC-risk groups have statistically-significant separation. Our results indicated that the trained artificial neural network can be used as an effective screening tool for early intervention and prevention of CRC in large populations.https://www.frontiersin.org/article/10.3389/fdata.2020.00006/fullcolorectal cancerrisk stratificationneural networkconcordanceself-reportable health dataexternal validation
spellingShingle Bradley J. Nartowt
Gregory R. Hart
Wazir Muhammad
Ying Liang
Gigi F. Stark
Jun Deng
Robust Machine Learning for Colorectal Cancer Risk Prediction and Stratification
Frontiers in Big Data
colorectal cancer
risk stratification
neural network
concordance
self-reportable health data
external validation
title Robust Machine Learning for Colorectal Cancer Risk Prediction and Stratification
title_full Robust Machine Learning for Colorectal Cancer Risk Prediction and Stratification
title_fullStr Robust Machine Learning for Colorectal Cancer Risk Prediction and Stratification
title_full_unstemmed Robust Machine Learning for Colorectal Cancer Risk Prediction and Stratification
title_short Robust Machine Learning for Colorectal Cancer Risk Prediction and Stratification
title_sort robust machine learning for colorectal cancer risk prediction and stratification
topic colorectal cancer
risk stratification
neural network
concordance
self-reportable health data
external validation
url https://www.frontiersin.org/article/10.3389/fdata.2020.00006/full
work_keys_str_mv AT bradleyjnartowt robustmachinelearningforcolorectalcancerriskpredictionandstratification
AT gregoryrhart robustmachinelearningforcolorectalcancerriskpredictionandstratification
AT wazirmuhammad robustmachinelearningforcolorectalcancerriskpredictionandstratification
AT yingliang robustmachinelearningforcolorectalcancerriskpredictionandstratification
AT gigifstark robustmachinelearningforcolorectalcancerriskpredictionandstratification
AT jundeng robustmachinelearningforcolorectalcancerriskpredictionandstratification