Improving the performance of machine learning algorithms for health outcomes predictions in multicentric cohorts

Abstract Machine learning algorithms are being increasingly used in healthcare settings but their generalizability between different regions is still unknown. This study aims to identify the strategy that maximizes the predictive performance of identifying the risk of death by COVID-19 in different...

Full description

Bibliographic Details
Main Authors:	Roberta Moreira Wichmann, Fernando Timoteo Fernandes, Alexandre Dias Porto Chiavegatto Filho, IACOV-BR Network
Format:	Article
Language:	English
Published:	Nature Portfolio 2023-01-01
Series:	Scientific Reports
Online Access:	https://doi.org/10.1038/s41598-022-26467-6

_version_	1797945950107336704
author	Roberta Moreira Wichmann Fernando Timoteo Fernandes Alexandre Dias Porto Chiavegatto Filho IACOV-BR Network
author_facet	Roberta Moreira Wichmann Fernando Timoteo Fernandes Alexandre Dias Porto Chiavegatto Filho IACOV-BR Network
author_sort	Roberta Moreira Wichmann
collection	DOAJ
description	Abstract Machine learning algorithms are being increasingly used in healthcare settings but their generalizability between different regions is still unknown. This study aims to identify the strategy that maximizes the predictive performance of identifying the risk of death by COVID-19 in different regions of a large and unequal country. This is a multicenter cohort study with data collected from patients with a positive RT-PCR test for COVID-19 from March to August 2020 (n = 8477) in 18 hospitals, covering all five Brazilian regions. Of all patients with a positive RT-PCR test during the period, 2356 (28%) died. Eight different strategies were used for training and evaluating the performance of three popular machine learning algorithms (extreme gradient boosting, lightGBM, and catboost). The strategies ranged from only using training data from a single hospital, up to aggregating patients by their geographic regions. The predictive performance of the algorithms was evaluated by the area under the ROC curve (AUROC) on the test set of each hospital. We found that the best overall predictive performances were obtained when using training data from the same hospital, which was the winning strategy for 11 (61%) of the 18 participating hospitals. In this study, the use of more patient data from other regions slightly decreased predictive performance. However, models trained in other hospitals still had acceptable performances and could be a solution while data for a specific hospital is being collected.
first_indexed	2024-04-10T21:03:09Z
format	Article
id	doaj.art-5efa086e908b432199861ac96016e18d
institution	Directory Open Access Journal
issn	2045-2322
language	English
last_indexed	2024-04-10T21:03:09Z
publishDate	2023-01-01
publisher	Nature Portfolio
record_format	Article
series	Scientific Reports
spelling	doaj.art-5efa086e908b432199861ac96016e18d2023-01-22T12:10:49ZengNature PortfolioScientific Reports2045-23222023-01-011311810.1038/s41598-022-26467-6Improving the performance of machine learning algorithms for health outcomes predictions in multicentric cohortsRoberta Moreira Wichmann0Fernando Timoteo Fernandes1Alexandre Dias Porto Chiavegatto Filho2IACOV-BR Network3School of Public Health, University of São PauloSchool of Public Health, University of São PauloSchool of Public Health, University of São PauloSchool of Public Health, University of São PauloAbstract Machine learning algorithms are being increasingly used in healthcare settings but their generalizability between different regions is still unknown. This study aims to identify the strategy that maximizes the predictive performance of identifying the risk of death by COVID-19 in different regions of a large and unequal country. This is a multicenter cohort study with data collected from patients with a positive RT-PCR test for COVID-19 from March to August 2020 (n = 8477) in 18 hospitals, covering all five Brazilian regions. Of all patients with a positive RT-PCR test during the period, 2356 (28%) died. Eight different strategies were used for training and evaluating the performance of three popular machine learning algorithms (extreme gradient boosting, lightGBM, and catboost). The strategies ranged from only using training data from a single hospital, up to aggregating patients by their geographic regions. The predictive performance of the algorithms was evaluated by the area under the ROC curve (AUROC) on the test set of each hospital. We found that the best overall predictive performances were obtained when using training data from the same hospital, which was the winning strategy for 11 (61%) of the 18 participating hospitals. In this study, the use of more patient data from other regions slightly decreased predictive performance. However, models trained in other hospitals still had acceptable performances and could be a solution while data for a specific hospital is being collected.https://doi.org/10.1038/s41598-022-26467-6
spellingShingle	Roberta Moreira Wichmann Fernando Timoteo Fernandes Alexandre Dias Porto Chiavegatto Filho IACOV-BR Network Improving the performance of machine learning algorithms for health outcomes predictions in multicentric cohorts Scientific Reports
title	Improving the performance of machine learning algorithms for health outcomes predictions in multicentric cohorts
title_full	Improving the performance of machine learning algorithms for health outcomes predictions in multicentric cohorts
title_fullStr	Improving the performance of machine learning algorithms for health outcomes predictions in multicentric cohorts
title_full_unstemmed	Improving the performance of machine learning algorithms for health outcomes predictions in multicentric cohorts
title_short	Improving the performance of machine learning algorithms for health outcomes predictions in multicentric cohorts
title_sort	improving the performance of machine learning algorithms for health outcomes predictions in multicentric cohorts
url	https://doi.org/10.1038/s41598-022-26467-6
work_keys_str_mv	AT robertamoreirawichmann improvingtheperformanceofmachinelearningalgorithmsforhealthoutcomespredictionsinmulticentriccohorts AT fernandotimoteofernandes improvingtheperformanceofmachinelearningalgorithmsforhealthoutcomespredictionsinmulticentriccohorts AT alexandrediasportochiavegattofilho improvingtheperformanceofmachinelearningalgorithmsforhealthoutcomespredictionsinmulticentriccohorts AT iacovbrnetwork improvingtheperformanceofmachinelearningalgorithmsforhealthoutcomespredictionsinmulticentriccohorts

Improving the performance of machine learning algorithms for health outcomes predictions in multicentric cohorts

Similar Items