Machine learning methods for propensity and disease risk score estimation in high-dimensional data: a plasmode simulation and real-world data cohort analysis

Introduction: Machine learning (ML) methods are promising and scalable alternatives for propensity score (PS) estimation, but their comparative performance in disease risk score (DRS) estimation remains unexplored. Methods: We used real-world data comparing antihypertensive users to non-users with 6...

Full description

Bibliographic Details
Main Authors:	Guo, Y, Strauss, VY, Català, M, Jödicke, AM, Khalid, S, Prieto-Alhambra, D
Format:	Journal article
Language:	English
Published:	Frontiers Media 2024

_version_	1826315375570059264
author	Guo, Y Strauss, VY Català, M Jödicke, AM Khalid, S Prieto-Alhambra, D
author_facet	Guo, Y Strauss, VY Català, M Jödicke, AM Khalid, S Prieto-Alhambra, D
author_sort	Guo, Y
collection	OXFORD
description	Introduction: Machine learning (ML) methods are promising and scalable alternatives for propensity score (PS) estimation, but their comparative performance in disease risk score (DRS) estimation remains unexplored. Methods: We used real-world data comparing antihypertensive users to non-users with 69 negative control outcomes, and plasmode simulations to study the performance of ML methods in PS and DRS estimation. We conducted a cohort study using UK primary care records. Further, we conducted a plasmode simulation with synthetic treatment and outcome mimicking empirical data distributions. We compared four PS and DRS estimation methods: 1. Reference: Logistic regression including clinically chosen confounders. 2. Logistic regression with L1 regularisation (LASSO). 3. Multi-layer perceptron (MLP). 4. Extreme Gradient Boosting (XgBoost). Covariate balance, coverage of the null effect of negative control outcomes (real-world data) and bias based on the absolute difference between observed and true effects (for plasmode) were estimated. 632,201 antihypertensive users and nonusers were included. Results: ML methods outperformed the reference method for PS estimation in some scenarios, both in terms of covariate balance and coverage/bias. Specifically, XgBoost achieved the best performance. DRS-based methods performed worse than PS in all tested scenarios. Discussion: We found that ML methods could be reliable alternatives for PS estimation. ML-based DRS methods performed worse than PS ones, likely given the rarity of outcomes.
first_indexed	2024-12-09T03:24:56Z
format	Journal article
id	oxford-uuid:c1311bbd-bd65-4249-9602-e2061d33e1de
institution	University of Oxford
language	English
last_indexed	2024-12-09T03:24:56Z
publishDate	2024
publisher	Frontiers Media
record_format	dspace
spelling	oxford-uuid:c1311bbd-bd65-4249-9602-e2061d33e1de2024-11-27T20:03:55ZMachine learning methods for propensity and disease risk score estimation in high-dimensional data: a plasmode simulation and real-world data cohort analysisJournal articlehttp://purl.org/coar/resource_type/c_dcae04bcuuid:c1311bbd-bd65-4249-9602-e2061d33e1deEnglishJisc Publications RouterFrontiers Media2024Guo, YStrauss, VYCatalà, MJödicke, AMKhalid, SPrieto-Alhambra, DIntroduction: Machine learning (ML) methods are promising and scalable alternatives for propensity score (PS) estimation, but their comparative performance in disease risk score (DRS) estimation remains unexplored. Methods: We used real-world data comparing antihypertensive users to non-users with 69 negative control outcomes, and plasmode simulations to study the performance of ML methods in PS and DRS estimation. We conducted a cohort study using UK primary care records. Further, we conducted a plasmode simulation with synthetic treatment and outcome mimicking empirical data distributions. We compared four PS and DRS estimation methods: 1. Reference: Logistic regression including clinically chosen confounders. 2. Logistic regression with L1 regularisation (LASSO). 3. Multi-layer perceptron (MLP). 4. Extreme Gradient Boosting (XgBoost). Covariate balance, coverage of the null effect of negative control outcomes (real-world data) and bias based on the absolute difference between observed and true effects (for plasmode) were estimated. 632,201 antihypertensive users and nonusers were included. Results: ML methods outperformed the reference method for PS estimation in some scenarios, both in terms of covariate balance and coverage/bias. Specifically, XgBoost achieved the best performance. DRS-based methods performed worse than PS in all tested scenarios. Discussion: We found that ML methods could be reliable alternatives for PS estimation. ML-based DRS methods performed worse than PS ones, likely given the rarity of outcomes.
spellingShingle	Guo, Y Strauss, VY Català, M Jödicke, AM Khalid, S Prieto-Alhambra, D Machine learning methods for propensity and disease risk score estimation in high-dimensional data: a plasmode simulation and real-world data cohort analysis
title	Machine learning methods for propensity and disease risk score estimation in high-dimensional data: a plasmode simulation and real-world data cohort analysis
title_full	Machine learning methods for propensity and disease risk score estimation in high-dimensional data: a plasmode simulation and real-world data cohort analysis
title_fullStr	Machine learning methods for propensity and disease risk score estimation in high-dimensional data: a plasmode simulation and real-world data cohort analysis
title_full_unstemmed	Machine learning methods for propensity and disease risk score estimation in high-dimensional data: a plasmode simulation and real-world data cohort analysis
title_short	Machine learning methods for propensity and disease risk score estimation in high-dimensional data: a plasmode simulation and real-world data cohort analysis
title_sort	machine learning methods for propensity and disease risk score estimation in high dimensional data a plasmode simulation and real world data cohort analysis
work_keys_str_mv	AT guoy machinelearningmethodsforpropensityanddiseaseriskscoreestimationinhighdimensionaldataaplasmodesimulationandrealworlddatacohortanalysis AT straussvy machinelearningmethodsforpropensityanddiseaseriskscoreestimationinhighdimensionaldataaplasmodesimulationandrealworlddatacohortanalysis AT catalam machinelearningmethodsforpropensityanddiseaseriskscoreestimationinhighdimensionaldataaplasmodesimulationandrealworlddatacohortanalysis AT jodickeam machinelearningmethodsforpropensityanddiseaseriskscoreestimationinhighdimensionaldataaplasmodesimulationandrealworlddatacohortanalysis AT khalids machinelearningmethodsforpropensityanddiseaseriskscoreestimationinhighdimensionaldataaplasmodesimulationandrealworlddatacohortanalysis AT prietoalhambrad machinelearningmethodsforpropensityanddiseaseriskscoreestimationinhighdimensionaldataaplasmodesimulationandrealworlddatacohortanalysis

Machine learning methods for propensity and disease risk score estimation in high-dimensional data: a plasmode simulation and real-world data cohort analysis

Similar Items