Benchmarking AutoML frameworks for disease prediction using medical claims

Abstract Objectives Ascertain and compare the performances of Automated Machine Learning (AutoML) tools on large, highly imbalanced healthcare datasets. Materials and Methods We generated a large dataset using historical de-identified administrative claims including demographic information and flags...

Full description

Bibliographic Details
Main Authors:	Roland Albert A. Romero, Mariefel Nicole Y. Deypalan, Suchit Mehrotra, John Titus Jungao, Natalie E. Sheils, Elisabetta Manduchi, Jason H. Moore
Format:	Article
Language:	English
Published:	BMC 2022-07-01
Series:	BioData Mining
Subjects:	Automated machine learning AutoML Machine learning Healthcare Medical claims Class imbalance
Online Access:	https://doi.org/10.1186/s13040-022-00300-2

_version_	1818492145910153216
author	Roland Albert A. Romero Mariefel Nicole Y. Deypalan Suchit Mehrotra John Titus Jungao Natalie E. Sheils Elisabetta Manduchi Jason H. Moore
author_facet	Roland Albert A. Romero Mariefel Nicole Y. Deypalan Suchit Mehrotra John Titus Jungao Natalie E. Sheils Elisabetta Manduchi Jason H. Moore
author_sort	Roland Albert A. Romero
collection	DOAJ
description	Abstract Objectives Ascertain and compare the performances of Automated Machine Learning (AutoML) tools on large, highly imbalanced healthcare datasets. Materials and Methods We generated a large dataset using historical de-identified administrative claims including demographic information and flags for disease codes in four different time windows prior to 2019. We then trained three AutoML tools on this dataset to predict six different disease outcomes in 2019 and evaluated model performances on several metrics. Results The AutoML tools showed improvement from the baseline random forest model but did not differ significantly from each other. All models recorded low area under the precision-recall curve and failed to predict true positives while keeping the true negative rate high. Model performance was not directly related to prevalence. We provide a specific use-case to illustrate how to select a threshold that gives the best balance between true and false positive rates, as this is an important consideration in medical applications. Discussion Healthcare datasets present several challenges for AutoML tools, including large sample size, high imbalance, and limitations in the available features. Improvements in scalability, combinations of imbalance-learning resampling and ensemble approaches, and curated feature selection are possible next steps to achieve better performance. Conclusion Among the three explored, no AutoML tool consistently outperforms the rest in terms of predictive performance. The performances of the models in this study suggest that there may be room for improvement in handling medical claims data. Finally, selection of the optimal prediction threshold should be guided by the specific practical application.
first_indexed	2024-12-10T17:39:24Z
format	Article
id	doaj.art-60b62052fe7d4169b8640d51fd19c755
institution	Directory Open Access Journal
issn	1756-0381
language	English
last_indexed	2024-12-10T17:39:24Z
publishDate	2022-07-01
publisher	BMC
record_format	Article
series	BioData Mining
spelling	doaj.art-60b62052fe7d4169b8640d51fd19c7552022-12-22T01:39:26ZengBMCBioData Mining1756-03812022-07-0115111310.1186/s13040-022-00300-2Benchmarking AutoML frameworks for disease prediction using medical claimsRoland Albert A. Romero0Mariefel Nicole Y. Deypalan1Suchit Mehrotra2John Titus Jungao3Natalie E. Sheils4Elisabetta Manduchi5Jason H. Moore6OptumLabsOptumLabsOptumLabsOptumLabsOptumLabsDepartment of Computational Biomedicine, Cedars-Sinai Medical CenterDepartment of Computational Biomedicine, Cedars-Sinai Medical CenterAbstract Objectives Ascertain and compare the performances of Automated Machine Learning (AutoML) tools on large, highly imbalanced healthcare datasets. Materials and Methods We generated a large dataset using historical de-identified administrative claims including demographic information and flags for disease codes in four different time windows prior to 2019. We then trained three AutoML tools on this dataset to predict six different disease outcomes in 2019 and evaluated model performances on several metrics. Results The AutoML tools showed improvement from the baseline random forest model but did not differ significantly from each other. All models recorded low area under the precision-recall curve and failed to predict true positives while keeping the true negative rate high. Model performance was not directly related to prevalence. We provide a specific use-case to illustrate how to select a threshold that gives the best balance between true and false positive rates, as this is an important consideration in medical applications. Discussion Healthcare datasets present several challenges for AutoML tools, including large sample size, high imbalance, and limitations in the available features. Improvements in scalability, combinations of imbalance-learning resampling and ensemble approaches, and curated feature selection are possible next steps to achieve better performance. Conclusion Among the three explored, no AutoML tool consistently outperforms the rest in terms of predictive performance. The performances of the models in this study suggest that there may be room for improvement in handling medical claims data. Finally, selection of the optimal prediction threshold should be guided by the specific practical application.https://doi.org/10.1186/s13040-022-00300-2Automated machine learningAutoMLMachine learningHealthcareMedical claimsClass imbalance
spellingShingle	Roland Albert A. Romero Mariefel Nicole Y. Deypalan Suchit Mehrotra John Titus Jungao Natalie E. Sheils Elisabetta Manduchi Jason H. Moore Benchmarking AutoML frameworks for disease prediction using medical claims BioData Mining Automated machine learning AutoML Machine learning Healthcare Medical claims Class imbalance
title	Benchmarking AutoML frameworks for disease prediction using medical claims
title_full	Benchmarking AutoML frameworks for disease prediction using medical claims
title_fullStr	Benchmarking AutoML frameworks for disease prediction using medical claims
title_full_unstemmed	Benchmarking AutoML frameworks for disease prediction using medical claims
title_short	Benchmarking AutoML frameworks for disease prediction using medical claims
title_sort	benchmarking automl frameworks for disease prediction using medical claims
topic	Automated machine learning AutoML Machine learning Healthcare Medical claims Class imbalance
url	https://doi.org/10.1186/s13040-022-00300-2
work_keys_str_mv	AT rolandalbertaromero benchmarkingautomlframeworksfordiseasepredictionusingmedicalclaims AT mariefelnicoleydeypalan benchmarkingautomlframeworksfordiseasepredictionusingmedicalclaims AT suchitmehrotra benchmarkingautomlframeworksfordiseasepredictionusingmedicalclaims AT johntitusjungao benchmarkingautomlframeworksfordiseasepredictionusingmedicalclaims AT natalieesheils benchmarkingautomlframeworksfordiseasepredictionusingmedicalclaims AT elisabettamanduchi benchmarkingautomlframeworksfordiseasepredictionusingmedicalclaims AT jasonhmoore benchmarkingautomlframeworksfordiseasepredictionusingmedicalclaims

Benchmarking AutoML frameworks for disease prediction using medical claims

Similar Items