A Predictive Model for Pancreatic Cancer Diagnosis
Pancreatic ductal adenocarcinoma (PDAC), a specific type of pancreatic cancer, has a five-year survival rate of 8.5% and is the third-deadliest cancer in the United States. However, earlier detection can raise survival rates dramatically. In this thesis, we investigate the hypothesis that predictive...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Published: |
Massachusetts Institute of Technology
2022
|
Online Access: | https://hdl.handle.net/1721.1/146664 |
_version_ | 1826208248517099520 |
---|---|
author | Xiong, Thomas |
author2 | Rinard, Martin |
author_facet | Rinard, Martin Xiong, Thomas |
author_sort | Xiong, Thomas |
collection | MIT |
description | Pancreatic ductal adenocarcinoma (PDAC), a specific type of pancreatic cancer, has a five-year survival rate of 8.5% and is the third-deadliest cancer in the United States. However, earlier detection can raise survival rates dramatically. In this thesis, we investigate the hypothesis that predictive models from a variety of model classes can use different indicators from electronic health record (EHR) data in order to predict PDAC diagnosis. We find that logistic regression, random forest, and XGBoost models perform the best when using patients’ unique diagnoses, lab test frequencies, medication frequencies, and race and ethnicity as data, with our best logistic regression model achieving an AUROC of 0.801 on a held-out test set. To better approximate these models’ use case in practice, we construct a time-dependent regime for model evaluation. Overall, we found that model performance decreased in the time-dependent regime as compared to the time-independent regime, suggesting the possibility of concept drift in our dataset. Moreover, through ℓ₀ regularization, we found that lab test frequencies tended to be the most important features in the best logistic regression model. The intended use for our deployed model is to serve as a prescreening tool to deliver an enriched population for further targeted PDAC screening. Our best model for this purpose delivers a sensitivity of 0.46 at a specificity of 0.9. According to our medical collaborators, this combination of sensitivity and specificity qualifies this model as suitable for our intended prescreening use. In this context, the ability of our model to work only with information derived from electronic health records, collected as part of routine medical care, is a significant advantage. We describe the steps taken to begin to model deployment into an existing federated EHR database. In this scenario, we envision that our model would be integrated into hospital EHR systems and routinely and automatically run over broad patient populations as EHR data is collected over time to produce a history of patient risk scores as patient data becomes available. Patient selection for further targeted PDAC screening can then consider both absolute scores and their evolution. |
first_indexed | 2024-09-23T14:02:52Z |
format | Thesis |
id | mit-1721.1/146664 |
institution | Massachusetts Institute of Technology |
last_indexed | 2024-09-23T14:02:52Z |
publishDate | 2022 |
publisher | Massachusetts Institute of Technology |
record_format | dspace |
spelling | mit-1721.1/1466642022-12-01T03:17:46Z A Predictive Model for Pancreatic Cancer Diagnosis Xiong, Thomas Rinard, Martin Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Pancreatic ductal adenocarcinoma (PDAC), a specific type of pancreatic cancer, has a five-year survival rate of 8.5% and is the third-deadliest cancer in the United States. However, earlier detection can raise survival rates dramatically. In this thesis, we investigate the hypothesis that predictive models from a variety of model classes can use different indicators from electronic health record (EHR) data in order to predict PDAC diagnosis. We find that logistic regression, random forest, and XGBoost models perform the best when using patients’ unique diagnoses, lab test frequencies, medication frequencies, and race and ethnicity as data, with our best logistic regression model achieving an AUROC of 0.801 on a held-out test set. To better approximate these models’ use case in practice, we construct a time-dependent regime for model evaluation. Overall, we found that model performance decreased in the time-dependent regime as compared to the time-independent regime, suggesting the possibility of concept drift in our dataset. Moreover, through ℓ₀ regularization, we found that lab test frequencies tended to be the most important features in the best logistic regression model. The intended use for our deployed model is to serve as a prescreening tool to deliver an enriched population for further targeted PDAC screening. Our best model for this purpose delivers a sensitivity of 0.46 at a specificity of 0.9. According to our medical collaborators, this combination of sensitivity and specificity qualifies this model as suitable for our intended prescreening use. In this context, the ability of our model to work only with information derived from electronic health records, collected as part of routine medical care, is a significant advantage. We describe the steps taken to begin to model deployment into an existing federated EHR database. In this scenario, we envision that our model would be integrated into hospital EHR systems and routinely and automatically run over broad patient populations as EHR data is collected over time to produce a history of patient risk scores as patient data becomes available. Patient selection for further targeted PDAC screening can then consider both absolute scores and their evolution. M.Eng. 2022-11-30T19:39:52Z 2022-11-30T19:39:52Z 2021-06 2021-06-17T20:14:59.711Z Thesis https://hdl.handle.net/1721.1/146664 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology |
spellingShingle | Xiong, Thomas A Predictive Model for Pancreatic Cancer Diagnosis |
title | A Predictive Model for Pancreatic Cancer Diagnosis |
title_full | A Predictive Model for Pancreatic Cancer Diagnosis |
title_fullStr | A Predictive Model for Pancreatic Cancer Diagnosis |
title_full_unstemmed | A Predictive Model for Pancreatic Cancer Diagnosis |
title_short | A Predictive Model for Pancreatic Cancer Diagnosis |
title_sort | predictive model for pancreatic cancer diagnosis |
url | https://hdl.handle.net/1721.1/146664 |
work_keys_str_mv | AT xiongthomas apredictivemodelforpancreaticcancerdiagnosis AT xiongthomas predictivemodelforpancreaticcancerdiagnosis |