Barriers and facilitators to generating synthetic administrative data for research.

Objectives Generation of synthetic data could improve the efficiency of administrative data analysis. We describe barriers and facilitators to synthetic administrative data in the UK based on our experience of generating, assessing, and evaluating the performance of different approaches. We aim to p...

Full description

Bibliographic Details
Main Authors: Theodora Kokosi, Bianca De Stavola, Robin Mitra, Andrew Copas, Katie Harron
Format: Article
Language:English
Published: Swansea University 2022-08-01
Series:International Journal of Population Data Science
Subjects:
Online Access:https://ijpds.org/article/view/1984
_version_ 1797423090039259136
author Theodora Kokosi
Bianca De Stavola
Robin Mitra
Andrew Copas
Katie Harron
author_facet Theodora Kokosi
Bianca De Stavola
Robin Mitra
Andrew Copas
Katie Harron
author_sort Theodora Kokosi
collection DOAJ
description Objectives Generation of synthetic data could improve the efficiency of administrative data analysis. We describe barriers and facilitators to synthetic administrative data in the UK based on our experience of generating, assessing, and evaluating the performance of different approaches. We aim to provide guidance on the appropriate uses of synthetic administrative data. Approach We generated synthetic versions of one large-population survey (Natsal-3) and two administrative datasets (Hospital Episode Statistics [HES] and National Pupil Database [NPD]). A range of methods were used based on the statistical techniques of sampling and prediction. We implemented non-parametric (e.g., Classification and Regression Tree) and parametric (e.g., generalised linear models) methods, and multiple imputation and Bayesian networks in R software. We attempted to generate low- and high-fidelity datasets and assessed utility by visualising marginal distributions of key variables, estimating the standardised propensity mean square error, and deriving standardised coefficient differences of model estimates and overlap of confidence intervals. Results Results from our analysis highlighted some facilitators related to low-fidelity synthetic data that are quicker to generate, can retain the data types, format, and privacy and could be used to support training and code development. Conversely, some of the barriers included computational issues when generating high-fidelity synthetic data from complex data structures. High-fidelity data are achievable but only in the context of a specific research question and a limited number of variables. Results from the Natsal-3 data showed that parametric methods produced slightly better data utility compared to non-parametric methods. Results for HES and NPD will also be presented. Conclusions Low-fidelity synthetic data can provide a useful resource to support users of administrative data, whilst minimising data access timelines and while retaining privacy and confidentiality of personal data. High-utility datasets can be generated but take considerable resources, and current approaches cannot fully handle the complexity of longitudinal administrative data.
first_indexed 2024-03-09T07:42:25Z
format Article
id doaj.art-a4a6b745c826467ca14cd20593713227
institution Directory Open Access Journal
issn 2399-4908
language English
last_indexed 2024-03-09T07:42:25Z
publishDate 2022-08-01
publisher Swansea University
record_format Article
series International Journal of Population Data Science
spelling doaj.art-a4a6b745c826467ca14cd205937132272023-12-03T04:22:43ZengSwansea UniversityInternational Journal of Population Data Science2399-49082022-08-017310.23889/ijpds.v7i3.1984Barriers and facilitators to generating synthetic administrative data for research.Theodora Kokosi0Bianca De Stavola1Robin Mitra2Andrew Copas3Katie Harron4UCL GOS Institute of Child HealthUCL GOS Institute of Child HealthSchool of Mathematics, Cardiff University, Cardiff UKUCL Institute for Global Health, UKUCL GOS Institute of Child HealthObjectives Generation of synthetic data could improve the efficiency of administrative data analysis. We describe barriers and facilitators to synthetic administrative data in the UK based on our experience of generating, assessing, and evaluating the performance of different approaches. We aim to provide guidance on the appropriate uses of synthetic administrative data. Approach We generated synthetic versions of one large-population survey (Natsal-3) and two administrative datasets (Hospital Episode Statistics [HES] and National Pupil Database [NPD]). A range of methods were used based on the statistical techniques of sampling and prediction. We implemented non-parametric (e.g., Classification and Regression Tree) and parametric (e.g., generalised linear models) methods, and multiple imputation and Bayesian networks in R software. We attempted to generate low- and high-fidelity datasets and assessed utility by visualising marginal distributions of key variables, estimating the standardised propensity mean square error, and deriving standardised coefficient differences of model estimates and overlap of confidence intervals. Results Results from our analysis highlighted some facilitators related to low-fidelity synthetic data that are quicker to generate, can retain the data types, format, and privacy and could be used to support training and code development. Conversely, some of the barriers included computational issues when generating high-fidelity synthetic data from complex data structures. High-fidelity data are achievable but only in the context of a specific research question and a limited number of variables. Results from the Natsal-3 data showed that parametric methods produced slightly better data utility compared to non-parametric methods. Results for HES and NPD will also be presented. Conclusions Low-fidelity synthetic data can provide a useful resource to support users of administrative data, whilst minimising data access timelines and while retaining privacy and confidentiality of personal data. High-utility datasets can be generated but take considerable resources, and current approaches cannot fully handle the complexity of longitudinal administrative data. https://ijpds.org/article/view/1984synthetic dataadministrative datasetsdata linkagestatistical disclosure controldata utilitydata confidentiality
spellingShingle Theodora Kokosi
Bianca De Stavola
Robin Mitra
Andrew Copas
Katie Harron
Barriers and facilitators to generating synthetic administrative data for research.
International Journal of Population Data Science
synthetic data
administrative datasets
data linkage
statistical disclosure control
data utility
data confidentiality
title Barriers and facilitators to generating synthetic administrative data for research.
title_full Barriers and facilitators to generating synthetic administrative data for research.
title_fullStr Barriers and facilitators to generating synthetic administrative data for research.
title_full_unstemmed Barriers and facilitators to generating synthetic administrative data for research.
title_short Barriers and facilitators to generating synthetic administrative data for research.
title_sort barriers and facilitators to generating synthetic administrative data for research
topic synthetic data
administrative datasets
data linkage
statistical disclosure control
data utility
data confidentiality
url https://ijpds.org/article/view/1984
work_keys_str_mv AT theodorakokosi barriersandfacilitatorstogeneratingsyntheticadministrativedataforresearch
AT biancadestavola barriersandfacilitatorstogeneratingsyntheticadministrativedataforresearch
AT robinmitra barriersandfacilitatorstogeneratingsyntheticadministrativedataforresearch
AT andrewcopas barriersandfacilitatorstogeneratingsyntheticadministrativedataforresearch
AT katieharron barriersandfacilitatorstogeneratingsyntheticadministrativedataforresearch