The application of machine learning to predict high-cost patients: A performance-comparison of different models using healthcare claims data

Our aim was to predict future high-cost patients with machine learning using healthcare claims data. We applied a random forest (RF), a gradient boosting machine (GBM), an artificial neural network (ANN) and a logistic regression (LR) to predict high-cost patients in the following year. Therefore, w...

Full description

Bibliographic Details
Main Authors: Benedikt Langenberger, Timo Schulte, Oliver Groene
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2023-01-01
Series:PLoS ONE
Online Access:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9847900/?tool=EBI
_version_ 1828054979718414336
author Benedikt Langenberger
Timo Schulte
Oliver Groene
author_facet Benedikt Langenberger
Timo Schulte
Oliver Groene
author_sort Benedikt Langenberger
collection DOAJ
description Our aim was to predict future high-cost patients with machine learning using healthcare claims data. We applied a random forest (RF), a gradient boosting machine (GBM), an artificial neural network (ANN) and a logistic regression (LR) to predict high-cost patients in the following year. Therefore, we exploited routinely collected sickness funds claims and cost data of the years 2016, 2017 and 2018. Various specifications of each algorithm were trained and cross-validated on training data (n = 20,984) with claims and cost data from 2016 and outcomes from 2017. The best performing specifications of each algorithm were selected based on validation dataset performance. For performance comparison, selected models were applied to unforeseen data with features of the year 2017 and outcomes of the year 2018 (n = 21,146). The RF was the best performing algorithm measured by the area under the receiver operating curve (AUC) with a value of 0.883 (95% confidence interval (CI): 0.872–0.893) on test data, followed by the GBM (AUC = 0.878; 95% CI: 0.867–0.889). The ANN (AUC = 0.846; 95% CI: 0.834–0.857) and LR (AUC = 0.839; 95% CI: 0.826–0.852) were significantly outperformed by the GBM and the RF. All ML algorithms and the LR performed ´good´ (i.e. 0.9 > AUC ≥ 0.8). We were able to develop machine learning models that predict high-cost patients with ‘good’ performance facilitating routinely collected sickness fund claims and cost data. We found that tree-based models performed best and outperformed the ANN and LR.
first_indexed 2024-04-10T20:31:20Z
format Article
id doaj.art-026e3fb06272447599da43d0ed7ebcd8
institution Directory Open Access Journal
issn 1932-6203
language English
last_indexed 2024-04-10T20:31:20Z
publishDate 2023-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj.art-026e3fb06272447599da43d0ed7ebcd82023-01-25T05:34:03ZengPublic Library of Science (PLoS)PLoS ONE1932-62032023-01-01181The application of machine learning to predict high-cost patients: A performance-comparison of different models using healthcare claims dataBenedikt LangenbergerTimo SchulteOliver GroeneOur aim was to predict future high-cost patients with machine learning using healthcare claims data. We applied a random forest (RF), a gradient boosting machine (GBM), an artificial neural network (ANN) and a logistic regression (LR) to predict high-cost patients in the following year. Therefore, we exploited routinely collected sickness funds claims and cost data of the years 2016, 2017 and 2018. Various specifications of each algorithm were trained and cross-validated on training data (n = 20,984) with claims and cost data from 2016 and outcomes from 2017. The best performing specifications of each algorithm were selected based on validation dataset performance. For performance comparison, selected models were applied to unforeseen data with features of the year 2017 and outcomes of the year 2018 (n = 21,146). The RF was the best performing algorithm measured by the area under the receiver operating curve (AUC) with a value of 0.883 (95% confidence interval (CI): 0.872–0.893) on test data, followed by the GBM (AUC = 0.878; 95% CI: 0.867–0.889). The ANN (AUC = 0.846; 95% CI: 0.834–0.857) and LR (AUC = 0.839; 95% CI: 0.826–0.852) were significantly outperformed by the GBM and the RF. All ML algorithms and the LR performed ´good´ (i.e. 0.9 > AUC ≥ 0.8). We were able to develop machine learning models that predict high-cost patients with ‘good’ performance facilitating routinely collected sickness fund claims and cost data. We found that tree-based models performed best and outperformed the ANN and LR.https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9847900/?tool=EBI
spellingShingle Benedikt Langenberger
Timo Schulte
Oliver Groene
The application of machine learning to predict high-cost patients: A performance-comparison of different models using healthcare claims data
PLoS ONE
title The application of machine learning to predict high-cost patients: A performance-comparison of different models using healthcare claims data
title_full The application of machine learning to predict high-cost patients: A performance-comparison of different models using healthcare claims data
title_fullStr The application of machine learning to predict high-cost patients: A performance-comparison of different models using healthcare claims data
title_full_unstemmed The application of machine learning to predict high-cost patients: A performance-comparison of different models using healthcare claims data
title_short The application of machine learning to predict high-cost patients: A performance-comparison of different models using healthcare claims data
title_sort application of machine learning to predict high cost patients a performance comparison of different models using healthcare claims data
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9847900/?tool=EBI
work_keys_str_mv AT benediktlangenberger theapplicationofmachinelearningtopredicthighcostpatientsaperformancecomparisonofdifferentmodelsusinghealthcareclaimsdata
AT timoschulte theapplicationofmachinelearningtopredicthighcostpatientsaperformancecomparisonofdifferentmodelsusinghealthcareclaimsdata
AT olivergroene theapplicationofmachinelearningtopredicthighcostpatientsaperformancecomparisonofdifferentmodelsusinghealthcareclaimsdata
AT benediktlangenberger applicationofmachinelearningtopredicthighcostpatientsaperformancecomparisonofdifferentmodelsusinghealthcareclaimsdata
AT timoschulte applicationofmachinelearningtopredicthighcostpatientsaperformancecomparisonofdifferentmodelsusinghealthcareclaimsdata
AT olivergroene applicationofmachinelearningtopredicthighcostpatientsaperformancecomparisonofdifferentmodelsusinghealthcareclaimsdata