Bias and efficiency loss in regression estimates due to duplicated observations: a Monte Carlo simulation
Recent studies documented that survey data contain duplicate records. We assess how duplicate records affect regression estimates, and we evaluate the effectiveness of solutions to deal with duplicate records. Results show that the chances of obtaining unbiased estimates when data contain 40 doubl...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
European Survey Research Association
2017-04-01
|
Series: | Survey Research Methods |
Subjects: | |
Online Access: | https://ojs.ub.uni-konstanz.de/srm/article/view/7149 |
_version_ | 1798037204916764672 |
---|---|
author | Francesco Sarracino Malgorzata Mikucka |
author_facet | Francesco Sarracino Malgorzata Mikucka |
author_sort | Francesco Sarracino |
collection | DOAJ |
description | Recent studies documented that survey data contain duplicate records. We assess how duplicate records affect regression estimates, and we evaluate the effectiveness of solutions to deal with duplicate records. Results show that the chances of obtaining unbiased estimates when data contain 40 doublets (about 5% of the sample) range between 3.5% and 11.5% depending on the distribution of duplicates. If 7 quintuplets are present in the data (2% of the sample), then the probability of obtaining biased estimates ranges between 11% and 20%. Weighting the duplicate records by the inverse of their multiplicity, or dropping superfluous duplicates outperform other solutions in all considered scenarios. Our results illustrate the risk of using data in presence of duplicate records and call for further research on strategies to analyze affected data. |
first_indexed | 2024-04-11T21:23:25Z |
format | Article |
id | doaj.art-d14b5f72cb57433fa6af14504ac27a01 |
institution | Directory Open Access Journal |
issn | 1864-3361 |
language | English |
last_indexed | 2024-04-11T21:23:25Z |
publishDate | 2017-04-01 |
publisher | European Survey Research Association |
record_format | Article |
series | Survey Research Methods |
spelling | doaj.art-d14b5f72cb57433fa6af14504ac27a012022-12-22T04:02:32ZengEuropean Survey Research AssociationSurvey Research Methods1864-33612017-04-0111110.18148/srm/2017.v11i1.71496522Bias and efficiency loss in regression estimates due to duplicated observations: a Monte Carlo simulationFrancesco Sarracino0Malgorzata Mikucka1National Institute of Statistics of Luxembourg (STATEC) and National Research University Higher School of EconomicsUniversité Catholique de Louvain and National Research University Higher School of EconomicsRecent studies documented that survey data contain duplicate records. We assess how duplicate records affect regression estimates, and we evaluate the effectiveness of solutions to deal with duplicate records. Results show that the chances of obtaining unbiased estimates when data contain 40 doublets (about 5% of the sample) range between 3.5% and 11.5% depending on the distribution of duplicates. If 7 quintuplets are present in the data (2% of the sample), then the probability of obtaining biased estimates ranges between 11% and 20%. Weighting the duplicate records by the inverse of their multiplicity, or dropping superfluous duplicates outperform other solutions in all considered scenarios. Our results illustrate the risk of using data in presence of duplicate records and call for further research on strategies to analyze affected data.https://ojs.ub.uni-konstanz.de/srm/article/view/7149duplicated observationsestimation biasMonte Carlo simulationinference |
spellingShingle | Francesco Sarracino Malgorzata Mikucka Bias and efficiency loss in regression estimates due to duplicated observations: a Monte Carlo simulation Survey Research Methods duplicated observations estimation bias Monte Carlo simulation inference |
title | Bias and efficiency loss in regression estimates due to duplicated observations: a Monte Carlo simulation |
title_full | Bias and efficiency loss in regression estimates due to duplicated observations: a Monte Carlo simulation |
title_fullStr | Bias and efficiency loss in regression estimates due to duplicated observations: a Monte Carlo simulation |
title_full_unstemmed | Bias and efficiency loss in regression estimates due to duplicated observations: a Monte Carlo simulation |
title_short | Bias and efficiency loss in regression estimates due to duplicated observations: a Monte Carlo simulation |
title_sort | bias and efficiency loss in regression estimates due to duplicated observations a monte carlo simulation |
topic | duplicated observations estimation bias Monte Carlo simulation inference |
url | https://ojs.ub.uni-konstanz.de/srm/article/view/7149 |
work_keys_str_mv | AT francescosarracino biasandefficiencylossinregressionestimatesduetoduplicatedobservationsamontecarlosimulation AT malgorzatamikucka biasandefficiencylossinregressionestimatesduetoduplicatedobservationsamontecarlosimulation |