Bias and efficiency loss in regression estimates due to duplicated observations: a Monte Carlo simulation

Recent studies documented that survey data contain duplicate records. We assess how duplicate records affect regression estimates, and we evaluate the effectiveness of solutions to deal with duplicate records. Results show that the chances of obtaining unbiased estimates when data contain 40 doubl...

Full description

Bibliographic Details
Main Authors: Francesco Sarracino, Malgorzata Mikucka
Format: Article
Language:English
Published: European Survey Research Association 2017-04-01
Series:Survey Research Methods
Subjects:
Online Access:https://ojs.ub.uni-konstanz.de/srm/article/view/7149
_version_ 1798037204916764672
author Francesco Sarracino
Malgorzata Mikucka
author_facet Francesco Sarracino
Malgorzata Mikucka
author_sort Francesco Sarracino
collection DOAJ
description Recent studies documented that survey data contain duplicate records. We assess how duplicate records affect regression estimates, and we evaluate the effectiveness of solutions to deal with duplicate records. Results show that the chances of obtaining unbiased estimates when data contain 40 doublets (about 5% of the sample) range between 3.5% and 11.5% depending on the distribution of duplicates. If 7 quintuplets are present in the data (2% of the sample), then the probability of obtaining biased estimates ranges between 11% and 20%. Weighting the duplicate records by the inverse of their multiplicity, or dropping superfluous duplicates outperform other solutions in all considered scenarios. Our results illustrate the risk of using data in presence of duplicate records and call for further research on strategies to analyze affected data.
first_indexed 2024-04-11T21:23:25Z
format Article
id doaj.art-d14b5f72cb57433fa6af14504ac27a01
institution Directory Open Access Journal
issn 1864-3361
language English
last_indexed 2024-04-11T21:23:25Z
publishDate 2017-04-01
publisher European Survey Research Association
record_format Article
series Survey Research Methods
spelling doaj.art-d14b5f72cb57433fa6af14504ac27a012022-12-22T04:02:32ZengEuropean Survey Research AssociationSurvey Research Methods1864-33612017-04-0111110.18148/srm/2017.v11i1.71496522Bias and efficiency loss in regression estimates due to duplicated observations: a Monte Carlo simulationFrancesco Sarracino0Malgorzata Mikucka1National Institute of Statistics of Luxembourg (STATEC) and National Research University Higher School of EconomicsUniversité Catholique de Louvain and National Research University Higher School of EconomicsRecent studies documented that survey data contain duplicate records. We assess how duplicate records affect regression estimates, and we evaluate the effectiveness of solutions to deal with duplicate records. Results show that the chances of obtaining unbiased estimates when data contain 40 doublets (about 5% of the sample) range between 3.5% and 11.5% depending on the distribution of duplicates. If 7 quintuplets are present in the data (2% of the sample), then the probability of obtaining biased estimates ranges between 11% and 20%. Weighting the duplicate records by the inverse of their multiplicity, or dropping superfluous duplicates outperform other solutions in all considered scenarios. Our results illustrate the risk of using data in presence of duplicate records and call for further research on strategies to analyze affected data.https://ojs.ub.uni-konstanz.de/srm/article/view/7149duplicated observationsestimation biasMonte Carlo simulationinference
spellingShingle Francesco Sarracino
Malgorzata Mikucka
Bias and efficiency loss in regression estimates due to duplicated observations: a Monte Carlo simulation
Survey Research Methods
duplicated observations
estimation bias
Monte Carlo simulation
inference
title Bias and efficiency loss in regression estimates due to duplicated observations: a Monte Carlo simulation
title_full Bias and efficiency loss in regression estimates due to duplicated observations: a Monte Carlo simulation
title_fullStr Bias and efficiency loss in regression estimates due to duplicated observations: a Monte Carlo simulation
title_full_unstemmed Bias and efficiency loss in regression estimates due to duplicated observations: a Monte Carlo simulation
title_short Bias and efficiency loss in regression estimates due to duplicated observations: a Monte Carlo simulation
title_sort bias and efficiency loss in regression estimates due to duplicated observations a monte carlo simulation
topic duplicated observations
estimation bias
Monte Carlo simulation
inference
url https://ojs.ub.uni-konstanz.de/srm/article/view/7149
work_keys_str_mv AT francescosarracino biasandefficiencylossinregressionestimatesduetoduplicatedobservationsamontecarlosimulation
AT malgorzatamikucka biasandefficiencylossinregressionestimatesduetoduplicatedobservationsamontecarlosimulation