Generalized Estimating Equations Boosting (GEEB) machine for correlated data

Abstract Rapid development in data science enables machine learning and artificial intelligence to be the most popular research tools across various disciplines. While numerous articles have shown decent predictive ability, little research has examined the impact of complex correlated data. We aim t...

Full description

Bibliographic Details
Main Authors: Yuan-Wey Wang, Hsin-Chou Yang, Yi-Hau Chen, Chao-Yu Guo
Format: Article
Language:English
Published: SpringerOpen 2024-01-01
Series:Journal of Big Data
Subjects:
Online Access:https://doi.org/10.1186/s40537-023-00875-5
_version_ 1797276477238018048
author Yuan-Wey Wang
Hsin-Chou Yang
Yi-Hau Chen
Chao-Yu Guo
author_facet Yuan-Wey Wang
Hsin-Chou Yang
Yi-Hau Chen
Chao-Yu Guo
author_sort Yuan-Wey Wang
collection DOAJ
description Abstract Rapid development in data science enables machine learning and artificial intelligence to be the most popular research tools across various disciplines. While numerous articles have shown decent predictive ability, little research has examined the impact of complex correlated data. We aim to develop a more accurate model under repeated measures or hierarchical data structures. Therefore, this study proposes a novel algorithm, the Generalized Estimating Equations Boosting (GEEB) machine, to integrate the gradient boosting technique into the benchmark statistical approach that deals with the correlated data, the generalized Estimating Equations (GEE). Unlike the previous gradient boosting utilizing all input features, we randomly select some input features when building the model to reduce predictive errors. The simulation study evaluates the predictive performance of the GEEB, GEE, eXtreme Gradient Boosting (XGBoost), and Support Vector Machine (SVM) across several hierarchical structures with different sample sizes. Results suggest that the new strategy GEEB outperforms the GEE and demonstrates superior predictive accuracy than the SVM and XGBoost in most situations. An application to a real-world dataset, the Forest Fire Data, also revealed that the GEEB reduced mean squared errors by 4.5% to 25% compared to GEE, XGBoost, and SVM. This research also provides a freely available R function that could implement the GEEB machine effortlessly for longitudinal or hierarchical data.
first_indexed 2024-03-07T15:28:44Z
format Article
id doaj.art-6f7c6b8fb6e844b0b4e0aa1437e3fdfa
institution Directory Open Access Journal
issn 2196-1115
language English
last_indexed 2024-03-07T15:28:44Z
publishDate 2024-01-01
publisher SpringerOpen
record_format Article
series Journal of Big Data
spelling doaj.art-6f7c6b8fb6e844b0b4e0aa1437e3fdfa2024-03-05T16:32:35ZengSpringerOpenJournal of Big Data2196-11152024-01-0111111910.1186/s40537-023-00875-5Generalized Estimating Equations Boosting (GEEB) machine for correlated dataYuan-Wey Wang0Hsin-Chou Yang1Yi-Hau Chen2Chao-Yu Guo3Division of Biostatistics and Data Science, Institute of Public Health, College of Medicine, National Yang Ming Chiao Tung UniversityInstitute of Statistical Science, Academia SinicaInstitute of Statistical Science, Academia SinicaDivision of Biostatistics and Data Science, Institute of Public Health, College of Medicine, National Yang Ming Chiao Tung UniversityAbstract Rapid development in data science enables machine learning and artificial intelligence to be the most popular research tools across various disciplines. While numerous articles have shown decent predictive ability, little research has examined the impact of complex correlated data. We aim to develop a more accurate model under repeated measures or hierarchical data structures. Therefore, this study proposes a novel algorithm, the Generalized Estimating Equations Boosting (GEEB) machine, to integrate the gradient boosting technique into the benchmark statistical approach that deals with the correlated data, the generalized Estimating Equations (GEE). Unlike the previous gradient boosting utilizing all input features, we randomly select some input features when building the model to reduce predictive errors. The simulation study evaluates the predictive performance of the GEEB, GEE, eXtreme Gradient Boosting (XGBoost), and Support Vector Machine (SVM) across several hierarchical structures with different sample sizes. Results suggest that the new strategy GEEB outperforms the GEE and demonstrates superior predictive accuracy than the SVM and XGBoost in most situations. An application to a real-world dataset, the Forest Fire Data, also revealed that the GEEB reduced mean squared errors by 4.5% to 25% compared to GEE, XGBoost, and SVM. This research also provides a freely available R function that could implement the GEEB machine effortlessly for longitudinal or hierarchical data.https://doi.org/10.1186/s40537-023-00875-5Correlated dataHierarchical dataGeneralized Estimating EquationsMachine learningGradient boosting
spellingShingle Yuan-Wey Wang
Hsin-Chou Yang
Yi-Hau Chen
Chao-Yu Guo
Generalized Estimating Equations Boosting (GEEB) machine for correlated data
Journal of Big Data
Correlated data
Hierarchical data
Generalized Estimating Equations
Machine learning
Gradient boosting
title Generalized Estimating Equations Boosting (GEEB) machine for correlated data
title_full Generalized Estimating Equations Boosting (GEEB) machine for correlated data
title_fullStr Generalized Estimating Equations Boosting (GEEB) machine for correlated data
title_full_unstemmed Generalized Estimating Equations Boosting (GEEB) machine for correlated data
title_short Generalized Estimating Equations Boosting (GEEB) machine for correlated data
title_sort generalized estimating equations boosting geeb machine for correlated data
topic Correlated data
Hierarchical data
Generalized Estimating Equations
Machine learning
Gradient boosting
url https://doi.org/10.1186/s40537-023-00875-5
work_keys_str_mv AT yuanweywang generalizedestimatingequationsboostinggeebmachineforcorrelateddata
AT hsinchouyang generalizedestimatingequationsboostinggeebmachineforcorrelateddata
AT yihauchen generalizedestimatingequationsboostinggeebmachineforcorrelateddata
AT chaoyuguo generalizedestimatingequationsboostinggeebmachineforcorrelateddata