On the Use of Gradient Boosting Methods to Improve the Estimation with Data Obtained with Self-Selection Procedures

In the last years, web surveys have established themselves as one of the main methods in empirical research. However, the effect of coverage and selection bias in such surveys has undercut their utility for statistical inference in finite populations. To compensate for these biases, researchers have...

Full description

Bibliographic Details
Main Authors: Luis Castro-Martín, María del Mar Rueda, Ramón Ferri-García, César Hernando-Tamayo
Format: Article
Language:English
Published: MDPI AG 2021-11-01
Series:Mathematics
Subjects:
Online Access:https://www.mdpi.com/2227-7390/9/23/2991
_version_ 1797507421176856576
author Luis Castro-Martín
María del Mar Rueda
Ramón Ferri-García
César Hernando-Tamayo
author_facet Luis Castro-Martín
María del Mar Rueda
Ramón Ferri-García
César Hernando-Tamayo
author_sort Luis Castro-Martín
collection DOAJ
description In the last years, web surveys have established themselves as one of the main methods in empirical research. However, the effect of coverage and selection bias in such surveys has undercut their utility for statistical inference in finite populations. To compensate for these biases, researchers have employed a variety of statistical techniques to adjust nonprobability samples so that they more closely match the population. In this study, we test the potential of the XGBoost algorithm in the most important methods for estimation that integrate data from a probability survey and a nonprobability survey. At the same time, a comparison is made of the effectiveness of these methods for the elimination of biases. The results show that the four proposed estimators based on gradient boosting frameworks can improve survey representativity with respect to other classic prediction methods. The proposed methodology is also used to analyze a real nonprobability survey sample on the social effects of COVID-19.
first_indexed 2024-03-10T04:49:15Z
format Article
id doaj.art-47160649c3f64efd8d20d88a9f777391
institution Directory Open Access Journal
issn 2227-7390
language English
last_indexed 2024-03-10T04:49:15Z
publishDate 2021-11-01
publisher MDPI AG
record_format Article
series Mathematics
spelling doaj.art-47160649c3f64efd8d20d88a9f7773912023-11-23T02:44:22ZengMDPI AGMathematics2227-73902021-11-01923299110.3390/math9232991On the Use of Gradient Boosting Methods to Improve the Estimation with Data Obtained with Self-Selection ProceduresLuis Castro-Martín0María del Mar Rueda1Ramón Ferri-García2César Hernando-Tamayo3Department of Statistics and Operational Research, University of Granada, 18011 Granada, SpainDepartment of Statistics and Operational Research, University of Granada, 18011 Granada, SpainDepartment of Statistics and Operational Research, University of Granada, 18011 Granada, SpainDepartment of Statistics and Operational Research, University of Granada, 18011 Granada, SpainIn the last years, web surveys have established themselves as one of the main methods in empirical research. However, the effect of coverage and selection bias in such surveys has undercut their utility for statistical inference in finite populations. To compensate for these biases, researchers have employed a variety of statistical techniques to adjust nonprobability samples so that they more closely match the population. In this study, we test the potential of the XGBoost algorithm in the most important methods for estimation that integrate data from a probability survey and a nonprobability survey. At the same time, a comparison is made of the effectiveness of these methods for the elimination of biases. The results show that the four proposed estimators based on gradient boosting frameworks can improve survey representativity with respect to other classic prediction methods. The proposed methodology is also used to analyze a real nonprobability survey sample on the social effects of COVID-19.https://www.mdpi.com/2227-7390/9/23/2991nonprobability surveysmachine learning techniquespropensity score adjustmentsurvey sampling
spellingShingle Luis Castro-Martín
María del Mar Rueda
Ramón Ferri-García
César Hernando-Tamayo
On the Use of Gradient Boosting Methods to Improve the Estimation with Data Obtained with Self-Selection Procedures
Mathematics
nonprobability surveys
machine learning techniques
propensity score adjustment
survey sampling
title On the Use of Gradient Boosting Methods to Improve the Estimation with Data Obtained with Self-Selection Procedures
title_full On the Use of Gradient Boosting Methods to Improve the Estimation with Data Obtained with Self-Selection Procedures
title_fullStr On the Use of Gradient Boosting Methods to Improve the Estimation with Data Obtained with Self-Selection Procedures
title_full_unstemmed On the Use of Gradient Boosting Methods to Improve the Estimation with Data Obtained with Self-Selection Procedures
title_short On the Use of Gradient Boosting Methods to Improve the Estimation with Data Obtained with Self-Selection Procedures
title_sort on the use of gradient boosting methods to improve the estimation with data obtained with self selection procedures
topic nonprobability surveys
machine learning techniques
propensity score adjustment
survey sampling
url https://www.mdpi.com/2227-7390/9/23/2991
work_keys_str_mv AT luiscastromartin ontheuseofgradientboostingmethodstoimprovetheestimationwithdataobtainedwithselfselectionprocedures
AT mariadelmarrueda ontheuseofgradientboostingmethodstoimprovetheestimationwithdataobtainedwithselfselectionprocedures
AT ramonferrigarcia ontheuseofgradientboostingmethodstoimprovetheestimationwithdataobtainedwithselfselectionprocedures
AT cesarhernandotamayo ontheuseofgradientboostingmethodstoimprovetheestimationwithdataobtainedwithselfselectionprocedures