A Predictive Model for Pancreatic Cancer Diagnosis

Pancreatic ductal adenocarcinoma (PDAC), a specific type of pancreatic cancer, has a five-year survival rate of 8.5% and is the third-deadliest cancer in the United States. However, earlier detection can raise survival rates dramatically. In this thesis, we investigate the hypothesis that predictive...

Full description

Bibliographic Details
Main Author: Xiong, Thomas
Other Authors: Rinard, Martin
Format: Thesis
Published: Massachusetts Institute of Technology 2022
Online Access:https://hdl.handle.net/1721.1/146664
_version_ 1826208248517099520
author Xiong, Thomas
author2 Rinard, Martin
author_facet Rinard, Martin
Xiong, Thomas
author_sort Xiong, Thomas
collection MIT
description Pancreatic ductal adenocarcinoma (PDAC), a specific type of pancreatic cancer, has a five-year survival rate of 8.5% and is the third-deadliest cancer in the United States. However, earlier detection can raise survival rates dramatically. In this thesis, we investigate the hypothesis that predictive models from a variety of model classes can use different indicators from electronic health record (EHR) data in order to predict PDAC diagnosis. We find that logistic regression, random forest, and XGBoost models perform the best when using patients’ unique diagnoses, lab test frequencies, medication frequencies, and race and ethnicity as data, with our best logistic regression model achieving an AUROC of 0.801 on a held-out test set. To better approximate these models’ use case in practice, we construct a time-dependent regime for model evaluation. Overall, we found that model performance decreased in the time-dependent regime as compared to the time-independent regime, suggesting the possibility of concept drift in our dataset. Moreover, through ℓ₀ regularization, we found that lab test frequencies tended to be the most important features in the best logistic regression model. The intended use for our deployed model is to serve as a prescreening tool to deliver an enriched population for further targeted PDAC screening. Our best model for this purpose delivers a sensitivity of 0.46 at a specificity of 0.9. According to our medical collaborators, this combination of sensitivity and specificity qualifies this model as suitable for our intended prescreening use. In this context, the ability of our model to work only with information derived from electronic health records, collected as part of routine medical care, is a significant advantage. We describe the steps taken to begin to model deployment into an existing federated EHR database. In this scenario, we envision that our model would be integrated into hospital EHR systems and routinely and automatically run over broad patient populations as EHR data is collected over time to produce a history of patient risk scores as patient data becomes available. Patient selection for further targeted PDAC screening can then consider both absolute scores and their evolution.
first_indexed 2024-09-23T14:02:52Z
format Thesis
id mit-1721.1/146664
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T14:02:52Z
publishDate 2022
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/1466642022-12-01T03:17:46Z A Predictive Model for Pancreatic Cancer Diagnosis Xiong, Thomas Rinard, Martin Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Pancreatic ductal adenocarcinoma (PDAC), a specific type of pancreatic cancer, has a five-year survival rate of 8.5% and is the third-deadliest cancer in the United States. However, earlier detection can raise survival rates dramatically. In this thesis, we investigate the hypothesis that predictive models from a variety of model classes can use different indicators from electronic health record (EHR) data in order to predict PDAC diagnosis. We find that logistic regression, random forest, and XGBoost models perform the best when using patients’ unique diagnoses, lab test frequencies, medication frequencies, and race and ethnicity as data, with our best logistic regression model achieving an AUROC of 0.801 on a held-out test set. To better approximate these models’ use case in practice, we construct a time-dependent regime for model evaluation. Overall, we found that model performance decreased in the time-dependent regime as compared to the time-independent regime, suggesting the possibility of concept drift in our dataset. Moreover, through ℓ₀ regularization, we found that lab test frequencies tended to be the most important features in the best logistic regression model. The intended use for our deployed model is to serve as a prescreening tool to deliver an enriched population for further targeted PDAC screening. Our best model for this purpose delivers a sensitivity of 0.46 at a specificity of 0.9. According to our medical collaborators, this combination of sensitivity and specificity qualifies this model as suitable for our intended prescreening use. In this context, the ability of our model to work only with information derived from electronic health records, collected as part of routine medical care, is a significant advantage. We describe the steps taken to begin to model deployment into an existing federated EHR database. In this scenario, we envision that our model would be integrated into hospital EHR systems and routinely and automatically run over broad patient populations as EHR data is collected over time to produce a history of patient risk scores as patient data becomes available. Patient selection for further targeted PDAC screening can then consider both absolute scores and their evolution. M.Eng. 2022-11-30T19:39:52Z 2022-11-30T19:39:52Z 2021-06 2021-06-17T20:14:59.711Z Thesis https://hdl.handle.net/1721.1/146664 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle Xiong, Thomas
A Predictive Model for Pancreatic Cancer Diagnosis
title A Predictive Model for Pancreatic Cancer Diagnosis
title_full A Predictive Model for Pancreatic Cancer Diagnosis
title_fullStr A Predictive Model for Pancreatic Cancer Diagnosis
title_full_unstemmed A Predictive Model for Pancreatic Cancer Diagnosis
title_short A Predictive Model for Pancreatic Cancer Diagnosis
title_sort predictive model for pancreatic cancer diagnosis
url https://hdl.handle.net/1721.1/146664
work_keys_str_mv AT xiongthomas apredictivemodelforpancreaticcancerdiagnosis
AT xiongthomas predictivemodelforpancreaticcancerdiagnosis