Benchmarking AutoML frameworks for disease prediction using medical claims

Abstract Objectives Ascertain and compare the performances of Automated Machine Learning (AutoML) tools on large, highly imbalanced healthcare datasets. Materials and Methods We generated a large dataset using historical de-identified administrative claims including demographic information and flags...

Full description

Bibliographic Details
Main Authors: Roland Albert A. Romero, Mariefel Nicole Y. Deypalan, Suchit Mehrotra, John Titus Jungao, Natalie E. Sheils, Elisabetta Manduchi, Jason H. Moore
Format: Article
Language:English
Published: BMC 2022-07-01
Series:BioData Mining
Subjects:
Online Access:https://doi.org/10.1186/s13040-022-00300-2
_version_ 1818492145910153216
author Roland Albert A. Romero
Mariefel Nicole Y. Deypalan
Suchit Mehrotra
John Titus Jungao
Natalie E. Sheils
Elisabetta Manduchi
Jason H. Moore
author_facet Roland Albert A. Romero
Mariefel Nicole Y. Deypalan
Suchit Mehrotra
John Titus Jungao
Natalie E. Sheils
Elisabetta Manduchi
Jason H. Moore
author_sort Roland Albert A. Romero
collection DOAJ
description Abstract Objectives Ascertain and compare the performances of Automated Machine Learning (AutoML) tools on large, highly imbalanced healthcare datasets. Materials and Methods We generated a large dataset using historical de-identified administrative claims including demographic information and flags for disease codes in four different time windows prior to 2019. We then trained three AutoML tools on this dataset to predict six different disease outcomes in 2019 and evaluated model performances on several metrics. Results The AutoML tools showed improvement from the baseline random forest model but did not differ significantly from each other. All models recorded low area under the precision-recall curve and failed to predict true positives while keeping the true negative rate high. Model performance was not directly related to prevalence. We provide a specific use-case to illustrate how to select a threshold that gives the best balance between true and false positive rates, as this is an important consideration in medical applications. Discussion Healthcare datasets present several challenges for AutoML tools, including large sample size, high imbalance, and limitations in the available features. Improvements in scalability, combinations of imbalance-learning resampling and ensemble approaches, and curated feature selection are possible next steps to achieve better performance. Conclusion Among the three explored, no AutoML tool consistently outperforms the rest in terms of predictive performance. The performances of the models in this study suggest that there may be room for improvement in handling medical claims data. Finally, selection of the optimal prediction threshold should be guided by the specific practical application.
first_indexed 2024-12-10T17:39:24Z
format Article
id doaj.art-60b62052fe7d4169b8640d51fd19c755
institution Directory Open Access Journal
issn 1756-0381
language English
last_indexed 2024-12-10T17:39:24Z
publishDate 2022-07-01
publisher BMC
record_format Article
series BioData Mining
spelling doaj.art-60b62052fe7d4169b8640d51fd19c7552022-12-22T01:39:26ZengBMCBioData Mining1756-03812022-07-0115111310.1186/s13040-022-00300-2Benchmarking AutoML frameworks for disease prediction using medical claimsRoland Albert A. Romero0Mariefel Nicole Y. Deypalan1Suchit Mehrotra2John Titus Jungao3Natalie E. Sheils4Elisabetta Manduchi5Jason H. Moore6OptumLabsOptumLabsOptumLabsOptumLabsOptumLabsDepartment of Computational Biomedicine, Cedars-Sinai Medical CenterDepartment of Computational Biomedicine, Cedars-Sinai Medical CenterAbstract Objectives Ascertain and compare the performances of Automated Machine Learning (AutoML) tools on large, highly imbalanced healthcare datasets. Materials and Methods We generated a large dataset using historical de-identified administrative claims including demographic information and flags for disease codes in four different time windows prior to 2019. We then trained three AutoML tools on this dataset to predict six different disease outcomes in 2019 and evaluated model performances on several metrics. Results The AutoML tools showed improvement from the baseline random forest model but did not differ significantly from each other. All models recorded low area under the precision-recall curve and failed to predict true positives while keeping the true negative rate high. Model performance was not directly related to prevalence. We provide a specific use-case to illustrate how to select a threshold that gives the best balance between true and false positive rates, as this is an important consideration in medical applications. Discussion Healthcare datasets present several challenges for AutoML tools, including large sample size, high imbalance, and limitations in the available features. Improvements in scalability, combinations of imbalance-learning resampling and ensemble approaches, and curated feature selection are possible next steps to achieve better performance. Conclusion Among the three explored, no AutoML tool consistently outperforms the rest in terms of predictive performance. The performances of the models in this study suggest that there may be room for improvement in handling medical claims data. Finally, selection of the optimal prediction threshold should be guided by the specific practical application.https://doi.org/10.1186/s13040-022-00300-2Automated machine learningAutoMLMachine learningHealthcareMedical claimsClass imbalance
spellingShingle Roland Albert A. Romero
Mariefel Nicole Y. Deypalan
Suchit Mehrotra
John Titus Jungao
Natalie E. Sheils
Elisabetta Manduchi
Jason H. Moore
Benchmarking AutoML frameworks for disease prediction using medical claims
BioData Mining
Automated machine learning
AutoML
Machine learning
Healthcare
Medical claims
Class imbalance
title Benchmarking AutoML frameworks for disease prediction using medical claims
title_full Benchmarking AutoML frameworks for disease prediction using medical claims
title_fullStr Benchmarking AutoML frameworks for disease prediction using medical claims
title_full_unstemmed Benchmarking AutoML frameworks for disease prediction using medical claims
title_short Benchmarking AutoML frameworks for disease prediction using medical claims
title_sort benchmarking automl frameworks for disease prediction using medical claims
topic Automated machine learning
AutoML
Machine learning
Healthcare
Medical claims
Class imbalance
url https://doi.org/10.1186/s13040-022-00300-2
work_keys_str_mv AT rolandalbertaromero benchmarkingautomlframeworksfordiseasepredictionusingmedicalclaims
AT mariefelnicoleydeypalan benchmarkingautomlframeworksfordiseasepredictionusingmedicalclaims
AT suchitmehrotra benchmarkingautomlframeworksfordiseasepredictionusingmedicalclaims
AT johntitusjungao benchmarkingautomlframeworksfordiseasepredictionusingmedicalclaims
AT natalieesheils benchmarkingautomlframeworksfordiseasepredictionusingmedicalclaims
AT elisabettamanduchi benchmarkingautomlframeworksfordiseasepredictionusingmedicalclaims
AT jasonhmoore benchmarkingautomlframeworksfordiseasepredictionusingmedicalclaims