Handling Overlapping Asymmetric Data Sets—A Twice Penalized P-Spline Approach

<b>Aims:</b> Overlapping asymmetric data sets are where a large cohort of observations have a small amount of information recorded, and within this group there exists a smaller cohort which have extensive further information available. Missing imputation is unwise if cohort size differs...

Full description

Bibliographic Details
Main Authors: Matthew McTeer, Robin Henderson, Quentin M. Anstee, Paolo Missier
Format: Article
Language:English
Published: MDPI AG 2024-03-01
Series:Mathematics
Subjects:
Online Access:https://www.mdpi.com/2227-7390/12/5/777
_version_ 1797264166055051264
author Matthew McTeer
Robin Henderson
Quentin M. Anstee
Paolo Missier
author_facet Matthew McTeer
Robin Henderson
Quentin M. Anstee
Paolo Missier
author_sort Matthew McTeer
collection DOAJ
description <b>Aims:</b> Overlapping asymmetric data sets are where a large cohort of observations have a small amount of information recorded, and within this group there exists a smaller cohort which have extensive further information available. Missing imputation is unwise if cohort size differs substantially; therefore, we aim to develop a way of modelling the smaller cohort whilst considering the larger. <b>Methods:</b> Through considering traditionally once penalized P-Spline approximations, we create a second penalty term through observing discrepancies in the marginal value of covariates that exist in both cohorts. Our now twice penalized P-Spline is designed to firstly prevent over/under-fitting of the smaller cohort and secondly to consider the larger cohort. <b>Results:</b> Through a series of data simulations, penalty parameter tunings, and model adaptations, our twice penalized model offers up to a 58% and 46% improvement in model fit upon a continuous and binary response, respectively, against existing B-Spline and once penalized P-Spline methods. Applying our model to an individual’s risk of developing steatohepatitis, we report an over 65% improvement over existing methods. <b>Conclusions:</b> We propose a twice penalized P-Spline method which can vastly improve the model fit of overlapping asymmetric data sets upon a common predictive endpoint, without the need for missing data imputation.
first_indexed 2024-04-25T00:24:34Z
format Article
id doaj.art-936066c9cf9a4e22994fca26259b2eff
institution Directory Open Access Journal
issn 2227-7390
language English
last_indexed 2024-04-25T00:24:34Z
publishDate 2024-03-01
publisher MDPI AG
record_format Article
series Mathematics
spelling doaj.art-936066c9cf9a4e22994fca26259b2eff2024-03-12T16:50:19ZengMDPI AGMathematics2227-73902024-03-0112577710.3390/math12050777Handling Overlapping Asymmetric Data Sets—A Twice Penalized P-Spline ApproachMatthew McTeer0Robin Henderson1Quentin M. Anstee2Paolo Missier3School of Computing, Faculty of Science, Agriculture & Engineering, Newcastle University, Newcastle upon Tyne NE1 7RU, UKSchool of Mathematics, Statistics and Physics, Faculty of Science, Agriculture & Engineering, Newcastle University, Newcastle upon Tyne NE1 7RU, UKTranslational & Clinical Research Institute, Faculty of Medical Sciences, Newcastle University, Newcastle upon Tyne NE1 7RU, UKSchool of Computer Science, University of Birmingham, Birmingham B15 2TT, UK<b>Aims:</b> Overlapping asymmetric data sets are where a large cohort of observations have a small amount of information recorded, and within this group there exists a smaller cohort which have extensive further information available. Missing imputation is unwise if cohort size differs substantially; therefore, we aim to develop a way of modelling the smaller cohort whilst considering the larger. <b>Methods:</b> Through considering traditionally once penalized P-Spline approximations, we create a second penalty term through observing discrepancies in the marginal value of covariates that exist in both cohorts. Our now twice penalized P-Spline is designed to firstly prevent over/under-fitting of the smaller cohort and secondly to consider the larger cohort. <b>Results:</b> Through a series of data simulations, penalty parameter tunings, and model adaptations, our twice penalized model offers up to a 58% and 46% improvement in model fit upon a continuous and binary response, respectively, against existing B-Spline and once penalized P-Spline methods. Applying our model to an individual’s risk of developing steatohepatitis, we report an over 65% improvement over existing methods. <b>Conclusions:</b> We propose a twice penalized P-Spline method which can vastly improve the model fit of overlapping asymmetric data sets upon a common predictive endpoint, without the need for missing data imputation.https://www.mdpi.com/2227-7390/12/5/777P-Splinepenalized regressionsmoothingasymmetric dataB-Splinenon-Parametric
spellingShingle Matthew McTeer
Robin Henderson
Quentin M. Anstee
Paolo Missier
Handling Overlapping Asymmetric Data Sets—A Twice Penalized P-Spline Approach
Mathematics
P-Spline
penalized regression
smoothing
asymmetric data
B-Spline
non-Parametric
title Handling Overlapping Asymmetric Data Sets—A Twice Penalized P-Spline Approach
title_full Handling Overlapping Asymmetric Data Sets—A Twice Penalized P-Spline Approach
title_fullStr Handling Overlapping Asymmetric Data Sets—A Twice Penalized P-Spline Approach
title_full_unstemmed Handling Overlapping Asymmetric Data Sets—A Twice Penalized P-Spline Approach
title_short Handling Overlapping Asymmetric Data Sets—A Twice Penalized P-Spline Approach
title_sort handling overlapping asymmetric data sets a twice penalized p spline approach
topic P-Spline
penalized regression
smoothing
asymmetric data
B-Spline
non-Parametric
url https://www.mdpi.com/2227-7390/12/5/777
work_keys_str_mv AT matthewmcteer handlingoverlappingasymmetricdatasetsatwicepenalizedpsplineapproach
AT robinhenderson handlingoverlappingasymmetricdatasetsatwicepenalizedpsplineapproach
AT quentinmanstee handlingoverlappingasymmetricdatasetsatwicepenalizedpsplineapproach
AT paolomissier handlingoverlappingasymmetricdatasetsatwicepenalizedpsplineapproach