Handling Overlapping Asymmetric Data Sets—A Twice Penalized P-Spline Approach
<b>Aims:</b> Overlapping asymmetric data sets are where a large cohort of observations have a small amount of information recorded, and within this group there exists a smaller cohort which have extensive further information available. Missing imputation is unwise if cohort size differs...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2024-03-01
|
Series: | Mathematics |
Subjects: | |
Online Access: | https://www.mdpi.com/2227-7390/12/5/777 |
_version_ | 1797264166055051264 |
---|---|
author | Matthew McTeer Robin Henderson Quentin M. Anstee Paolo Missier |
author_facet | Matthew McTeer Robin Henderson Quentin M. Anstee Paolo Missier |
author_sort | Matthew McTeer |
collection | DOAJ |
description | <b>Aims:</b> Overlapping asymmetric data sets are where a large cohort of observations have a small amount of information recorded, and within this group there exists a smaller cohort which have extensive further information available. Missing imputation is unwise if cohort size differs substantially; therefore, we aim to develop a way of modelling the smaller cohort whilst considering the larger. <b>Methods:</b> Through considering traditionally once penalized P-Spline approximations, we create a second penalty term through observing discrepancies in the marginal value of covariates that exist in both cohorts. Our now twice penalized P-Spline is designed to firstly prevent over/under-fitting of the smaller cohort and secondly to consider the larger cohort. <b>Results:</b> Through a series of data simulations, penalty parameter tunings, and model adaptations, our twice penalized model offers up to a 58% and 46% improvement in model fit upon a continuous and binary response, respectively, against existing B-Spline and once penalized P-Spline methods. Applying our model to an individual’s risk of developing steatohepatitis, we report an over 65% improvement over existing methods. <b>Conclusions:</b> We propose a twice penalized P-Spline method which can vastly improve the model fit of overlapping asymmetric data sets upon a common predictive endpoint, without the need for missing data imputation. |
first_indexed | 2024-04-25T00:24:34Z |
format | Article |
id | doaj.art-936066c9cf9a4e22994fca26259b2eff |
institution | Directory Open Access Journal |
issn | 2227-7390 |
language | English |
last_indexed | 2024-04-25T00:24:34Z |
publishDate | 2024-03-01 |
publisher | MDPI AG |
record_format | Article |
series | Mathematics |
spelling | doaj.art-936066c9cf9a4e22994fca26259b2eff2024-03-12T16:50:19ZengMDPI AGMathematics2227-73902024-03-0112577710.3390/math12050777Handling Overlapping Asymmetric Data Sets—A Twice Penalized P-Spline ApproachMatthew McTeer0Robin Henderson1Quentin M. Anstee2Paolo Missier3School of Computing, Faculty of Science, Agriculture & Engineering, Newcastle University, Newcastle upon Tyne NE1 7RU, UKSchool of Mathematics, Statistics and Physics, Faculty of Science, Agriculture & Engineering, Newcastle University, Newcastle upon Tyne NE1 7RU, UKTranslational & Clinical Research Institute, Faculty of Medical Sciences, Newcastle University, Newcastle upon Tyne NE1 7RU, UKSchool of Computer Science, University of Birmingham, Birmingham B15 2TT, UK<b>Aims:</b> Overlapping asymmetric data sets are where a large cohort of observations have a small amount of information recorded, and within this group there exists a smaller cohort which have extensive further information available. Missing imputation is unwise if cohort size differs substantially; therefore, we aim to develop a way of modelling the smaller cohort whilst considering the larger. <b>Methods:</b> Through considering traditionally once penalized P-Spline approximations, we create a second penalty term through observing discrepancies in the marginal value of covariates that exist in both cohorts. Our now twice penalized P-Spline is designed to firstly prevent over/under-fitting of the smaller cohort and secondly to consider the larger cohort. <b>Results:</b> Through a series of data simulations, penalty parameter tunings, and model adaptations, our twice penalized model offers up to a 58% and 46% improvement in model fit upon a continuous and binary response, respectively, against existing B-Spline and once penalized P-Spline methods. Applying our model to an individual’s risk of developing steatohepatitis, we report an over 65% improvement over existing methods. <b>Conclusions:</b> We propose a twice penalized P-Spline method which can vastly improve the model fit of overlapping asymmetric data sets upon a common predictive endpoint, without the need for missing data imputation.https://www.mdpi.com/2227-7390/12/5/777P-Splinepenalized regressionsmoothingasymmetric dataB-Splinenon-Parametric |
spellingShingle | Matthew McTeer Robin Henderson Quentin M. Anstee Paolo Missier Handling Overlapping Asymmetric Data Sets—A Twice Penalized P-Spline Approach Mathematics P-Spline penalized regression smoothing asymmetric data B-Spline non-Parametric |
title | Handling Overlapping Asymmetric Data Sets—A Twice Penalized P-Spline Approach |
title_full | Handling Overlapping Asymmetric Data Sets—A Twice Penalized P-Spline Approach |
title_fullStr | Handling Overlapping Asymmetric Data Sets—A Twice Penalized P-Spline Approach |
title_full_unstemmed | Handling Overlapping Asymmetric Data Sets—A Twice Penalized P-Spline Approach |
title_short | Handling Overlapping Asymmetric Data Sets—A Twice Penalized P-Spline Approach |
title_sort | handling overlapping asymmetric data sets a twice penalized p spline approach |
topic | P-Spline penalized regression smoothing asymmetric data B-Spline non-Parametric |
url | https://www.mdpi.com/2227-7390/12/5/777 |
work_keys_str_mv | AT matthewmcteer handlingoverlappingasymmetricdatasetsatwicepenalizedpsplineapproach AT robinhenderson handlingoverlappingasymmetricdatasetsatwicepenalizedpsplineapproach AT quentinmanstee handlingoverlappingasymmetricdatasetsatwicepenalizedpsplineapproach AT paolomissier handlingoverlappingasymmetricdatasetsatwicepenalizedpsplineapproach |