A Calibrated Ensemble Algorithm to Address Data Heterogeneity in Machine Learning: An Application to Identify Severe SLE Flares in Lupus Patients

Motivated to address the inconsistency between the essential i.i.d. assumption in machine learning theory and the data heterogeneity in real-world applications, we propose a novel calibrated ensemble (CE) algorithm to facilitate learning with diverse data subgroups. Unlike the traditional ensemble f...

Full description

Bibliographic Details
Main Authors: Yijun Zhao, Man Qin, April Jorge
Format: Article
Language:English
Published: IEEE 2022-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9706189/
_version_ 1819277401907003392
author Yijun Zhao
Man Qin
April Jorge
author_facet Yijun Zhao
Man Qin
April Jorge
author_sort Yijun Zhao
collection DOAJ
description Motivated to address the inconsistency between the essential i.i.d. assumption in machine learning theory and the data heterogeneity in real-world applications, we propose a novel calibrated ensemble (CE) algorithm to facilitate learning with diverse data subgroups. Unlike the traditional ensemble framework in which each learner is trained independently using the entire dataset, our method exploits the strengths of various machine learning models by training them simultaneously and forming model-ergonomic data subgroups as part of the training process. Consequently, each learner is calibrated to a unique subset of data based on their individualized predictive strength. Clinically, we can interpret each model as an expert specializing in treating patients with particular disease manifestations. We evaluate the CE model in our motivating domain of identifying lupus patients with severe SLE flares using 1541 clinical encounters in the Mass General Brigham (MGB) Lupus Cohort. Our experimental results demonstrate the efficacy of our CE model across seven performance evaluation metrics compared to five individual machine learning models and regular ensemble approaches. We further utilize ANOVA and Tukey HSD post-hoc statistical analysis to discover characteristic features of individual model clusters for clinical interpretations.
first_indexed 2024-12-23T23:55:32Z
format Article
id doaj.art-00d2dc37b6f746cf999e753e8846637d
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-12-23T23:55:32Z
publishDate 2022-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-00d2dc37b6f746cf999e753e8846637d2022-12-21T17:25:15ZengIEEEIEEE Access2169-35362022-01-0110187201872910.1109/ACCESS.2022.31494779706189A Calibrated Ensemble Algorithm to Address Data Heterogeneity in Machine Learning: An Application to Identify Severe SLE Flares in Lupus PatientsYijun Zhao0https://orcid.org/0000-0003-2424-5988Man Qin1https://orcid.org/0000-0003-4079-1487April Jorge2https://orcid.org/0000-0001-6935-880XDepartment of Computer and Information Sciences, Fordham University, New York, NY, USADepartment of Computer and Information Sciences, Fordham University, New York, NY, USADivision of Rheumatology, Allergy, and Immunology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USAMotivated to address the inconsistency between the essential i.i.d. assumption in machine learning theory and the data heterogeneity in real-world applications, we propose a novel calibrated ensemble (CE) algorithm to facilitate learning with diverse data subgroups. Unlike the traditional ensemble framework in which each learner is trained independently using the entire dataset, our method exploits the strengths of various machine learning models by training them simultaneously and forming model-ergonomic data subgroups as part of the training process. Consequently, each learner is calibrated to a unique subset of data based on their individualized predictive strength. Clinically, we can interpret each model as an expert specializing in treating patients with particular disease manifestations. We evaluate the CE model in our motivating domain of identifying lupus patients with severe SLE flares using 1541 clinical encounters in the Mass General Brigham (MGB) Lupus Cohort. Our experimental results demonstrate the efficacy of our CE model across seven performance evaluation metrics compared to five individual machine learning models and regular ensemble approaches. We further utilize ANOVA and Tukey HSD post-hoc statistical analysis to discover characteristic features of individual model clusters for clinical interpretations.https://ieeexplore.ieee.org/document/9706189/Data heterogeneityensemble learningmachine learninglupusSLE
spellingShingle Yijun Zhao
Man Qin
April Jorge
A Calibrated Ensemble Algorithm to Address Data Heterogeneity in Machine Learning: An Application to Identify Severe SLE Flares in Lupus Patients
IEEE Access
Data heterogeneity
ensemble learning
machine learning
lupus
SLE
title A Calibrated Ensemble Algorithm to Address Data Heterogeneity in Machine Learning: An Application to Identify Severe SLE Flares in Lupus Patients
title_full A Calibrated Ensemble Algorithm to Address Data Heterogeneity in Machine Learning: An Application to Identify Severe SLE Flares in Lupus Patients
title_fullStr A Calibrated Ensemble Algorithm to Address Data Heterogeneity in Machine Learning: An Application to Identify Severe SLE Flares in Lupus Patients
title_full_unstemmed A Calibrated Ensemble Algorithm to Address Data Heterogeneity in Machine Learning: An Application to Identify Severe SLE Flares in Lupus Patients
title_short A Calibrated Ensemble Algorithm to Address Data Heterogeneity in Machine Learning: An Application to Identify Severe SLE Flares in Lupus Patients
title_sort calibrated ensemble algorithm to address data heterogeneity in machine learning an application to identify severe sle flares in lupus patients
topic Data heterogeneity
ensemble learning
machine learning
lupus
SLE
url https://ieeexplore.ieee.org/document/9706189/
work_keys_str_mv AT yijunzhao acalibratedensemblealgorithmtoaddressdataheterogeneityinmachinelearninganapplicationtoidentifyseveresleflaresinlupuspatients
AT manqin acalibratedensemblealgorithmtoaddressdataheterogeneityinmachinelearninganapplicationtoidentifyseveresleflaresinlupuspatients
AT apriljorge acalibratedensemblealgorithmtoaddressdataheterogeneityinmachinelearninganapplicationtoidentifyseveresleflaresinlupuspatients
AT yijunzhao calibratedensemblealgorithmtoaddressdataheterogeneityinmachinelearninganapplicationtoidentifyseveresleflaresinlupuspatients
AT manqin calibratedensemblealgorithmtoaddressdataheterogeneityinmachinelearninganapplicationtoidentifyseveresleflaresinlupuspatients
AT apriljorge calibratedensemblealgorithmtoaddressdataheterogeneityinmachinelearninganapplicationtoidentifyseveresleflaresinlupuspatients