The application of machine learning to predict high-cost patients: A performance-comparison of different models using healthcare claims data

Our aim was to predict future high-cost patients with machine learning using healthcare claims data. We applied a random forest (RF), a gradient boosting machine (GBM), an artificial neural network (ANN) and a logistic regression (LR) to predict high-cost patients in the following year. Therefore, w...

Full description

Bibliographic Details
Main Authors:	Benedikt Langenberger, Timo Schulte, Oliver Groene
Format:	Article
Language:	English
Published:	Public Library of Science (PLoS) 2023-01-01
Series:	PLoS ONE
Online Access:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9847900/?tool=EBI

_version_	1828054979718414336
author	Benedikt Langenberger Timo Schulte Oliver Groene
author_facet	Benedikt Langenberger Timo Schulte Oliver Groene
author_sort	Benedikt Langenberger
collection	DOAJ
description	Our aim was to predict future high-cost patients with machine learning using healthcare claims data. We applied a random forest (RF), a gradient boosting machine (GBM), an artificial neural network (ANN) and a logistic regression (LR) to predict high-cost patients in the following year. Therefore, we exploited routinely collected sickness funds claims and cost data of the years 2016, 2017 and 2018. Various specifications of each algorithm were trained and cross-validated on training data (n = 20,984) with claims and cost data from 2016 and outcomes from 2017. The best performing specifications of each algorithm were selected based on validation dataset performance. For performance comparison, selected models were applied to unforeseen data with features of the year 2017 and outcomes of the year 2018 (n = 21,146). The RF was the best performing algorithm measured by the area under the receiver operating curve (AUC) with a value of 0.883 (95% confidence interval (CI): 0.872–0.893) on test data, followed by the GBM (AUC = 0.878; 95% CI: 0.867–0.889). The ANN (AUC = 0.846; 95% CI: 0.834–0.857) and LR (AUC = 0.839; 95% CI: 0.826–0.852) were significantly outperformed by the GBM and the RF. All ML algorithms and the LR performed ´good´ (i.e. 0.9 > AUC ≥ 0.8). We were able to develop machine learning models that predict high-cost patients with ‘good’ performance facilitating routinely collected sickness fund claims and cost data. We found that tree-based models performed best and outperformed the ANN and LR.
first_indexed	2024-04-10T20:31:20Z
format	Article
id	doaj.art-026e3fb06272447599da43d0ed7ebcd8
institution	Directory Open Access Journal
issn	1932-6203
language	English
last_indexed	2024-04-10T20:31:20Z
publishDate	2023-01-01
publisher	Public Library of Science (PLoS)
record_format	Article
series	PLoS ONE
spelling	doaj.art-026e3fb06272447599da43d0ed7ebcd82023-01-25T05:34:03ZengPublic Library of Science (PLoS)PLoS ONE1932-62032023-01-01181The application of machine learning to predict high-cost patients: A performance-comparison of different models using healthcare claims dataBenedikt LangenbergerTimo SchulteOliver GroeneOur aim was to predict future high-cost patients with machine learning using healthcare claims data. We applied a random forest (RF), a gradient boosting machine (GBM), an artificial neural network (ANN) and a logistic regression (LR) to predict high-cost patients in the following year. Therefore, we exploited routinely collected sickness funds claims and cost data of the years 2016, 2017 and 2018. Various specifications of each algorithm were trained and cross-validated on training data (n = 20,984) with claims and cost data from 2016 and outcomes from 2017. The best performing specifications of each algorithm were selected based on validation dataset performance. For performance comparison, selected models were applied to unforeseen data with features of the year 2017 and outcomes of the year 2018 (n = 21,146). The RF was the best performing algorithm measured by the area under the receiver operating curve (AUC) with a value of 0.883 (95% confidence interval (CI): 0.872–0.893) on test data, followed by the GBM (AUC = 0.878; 95% CI: 0.867–0.889). The ANN (AUC = 0.846; 95% CI: 0.834–0.857) and LR (AUC = 0.839; 95% CI: 0.826–0.852) were significantly outperformed by the GBM and the RF. All ML algorithms and the LR performed ´good´ (i.e. 0.9 > AUC ≥ 0.8). We were able to develop machine learning models that predict high-cost patients with ‘good’ performance facilitating routinely collected sickness fund claims and cost data. We found that tree-based models performed best and outperformed the ANN and LR.https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9847900/?tool=EBI
spellingShingle	Benedikt Langenberger Timo Schulte Oliver Groene The application of machine learning to predict high-cost patients: A performance-comparison of different models using healthcare claims data PLoS ONE
title	The application of machine learning to predict high-cost patients: A performance-comparison of different models using healthcare claims data
title_full	The application of machine learning to predict high-cost patients: A performance-comparison of different models using healthcare claims data
title_fullStr	The application of machine learning to predict high-cost patients: A performance-comparison of different models using healthcare claims data
title_full_unstemmed	The application of machine learning to predict high-cost patients: A performance-comparison of different models using healthcare claims data
title_short	The application of machine learning to predict high-cost patients: A performance-comparison of different models using healthcare claims data
title_sort	application of machine learning to predict high cost patients a performance comparison of different models using healthcare claims data
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9847900/?tool=EBI
work_keys_str_mv	AT benediktlangenberger theapplicationofmachinelearningtopredicthighcostpatientsaperformancecomparisonofdifferentmodelsusinghealthcareclaimsdata AT timoschulte theapplicationofmachinelearningtopredicthighcostpatientsaperformancecomparisonofdifferentmodelsusinghealthcareclaimsdata AT olivergroene theapplicationofmachinelearningtopredicthighcostpatientsaperformancecomparisonofdifferentmodelsusinghealthcareclaimsdata AT benediktlangenberger applicationofmachinelearningtopredicthighcostpatientsaperformancecomparisonofdifferentmodelsusinghealthcareclaimsdata AT timoschulte applicationofmachinelearningtopredicthighcostpatientsaperformancecomparisonofdifferentmodelsusinghealthcareclaimsdata AT olivergroene applicationofmachinelearningtopredicthighcostpatientsaperformancecomparisonofdifferentmodelsusinghealthcareclaimsdata

The application of machine learning to predict high-cost patients: A performance-comparison of different models using healthcare claims data

Similar Items