Barriers and facilitators to generating synthetic administrative data for research.

Objectives Generation of synthetic data could improve the efficiency of administrative data analysis. We describe barriers and facilitators to synthetic administrative data in the UK based on our experience of generating, assessing, and evaluating the performance of different approaches. We aim to p...

Full description

Bibliographic Details
Main Authors:	Theodora Kokosi, Bianca De Stavola, Robin Mitra, Andrew Copas, Katie Harron
Format:	Article
Language:	English
Published:	Swansea University 2022-08-01
Series:	International Journal of Population Data Science
Subjects:	synthetic data administrative datasets data linkage statistical disclosure control data utility data confidentiality
Online Access:	https://ijpds.org/article/view/1984

_version_	1797423090039259136
author	Theodora Kokosi Bianca De Stavola Robin Mitra Andrew Copas Katie Harron
author_facet	Theodora Kokosi Bianca De Stavola Robin Mitra Andrew Copas Katie Harron
author_sort	Theodora Kokosi
collection	DOAJ
description	Objectives Generation of synthetic data could improve the efficiency of administrative data analysis. We describe barriers and facilitators to synthetic administrative data in the UK based on our experience of generating, assessing, and evaluating the performance of different approaches. We aim to provide guidance on the appropriate uses of synthetic administrative data. Approach We generated synthetic versions of one large-population survey (Natsal-3) and two administrative datasets (Hospital Episode Statistics [HES] and National Pupil Database [NPD]). A range of methods were used based on the statistical techniques of sampling and prediction. We implemented non-parametric (e.g., Classification and Regression Tree) and parametric (e.g., generalised linear models) methods, and multiple imputation and Bayesian networks in R software. We attempted to generate low- and high-fidelity datasets and assessed utility by visualising marginal distributions of key variables, estimating the standardised propensity mean square error, and deriving standardised coefficient differences of model estimates and overlap of confidence intervals. Results Results from our analysis highlighted some facilitators related to low-fidelity synthetic data that are quicker to generate, can retain the data types, format, and privacy and could be used to support training and code development. Conversely, some of the barriers included computational issues when generating high-fidelity synthetic data from complex data structures. High-fidelity data are achievable but only in the context of a specific research question and a limited number of variables. Results from the Natsal-3 data showed that parametric methods produced slightly better data utility compared to non-parametric methods. Results for HES and NPD will also be presented. Conclusions Low-fidelity synthetic data can provide a useful resource to support users of administrative data, whilst minimising data access timelines and while retaining privacy and confidentiality of personal data. High-utility datasets can be generated but take considerable resources, and current approaches cannot fully handle the complexity of longitudinal administrative data.
first_indexed	2024-03-09T07:42:25Z
format	Article
id	doaj.art-a4a6b745c826467ca14cd20593713227
institution	Directory Open Access Journal
issn	2399-4908
language	English
last_indexed	2024-03-09T07:42:25Z
publishDate	2022-08-01
publisher	Swansea University
record_format	Article
series	International Journal of Population Data Science
spelling	doaj.art-a4a6b745c826467ca14cd205937132272023-12-03T04:22:43ZengSwansea UniversityInternational Journal of Population Data Science2399-49082022-08-017310.23889/ijpds.v7i3.1984Barriers and facilitators to generating synthetic administrative data for research.Theodora Kokosi0Bianca De Stavola1Robin Mitra2Andrew Copas3Katie Harron4UCL GOS Institute of Child HealthUCL GOS Institute of Child HealthSchool of Mathematics, Cardiff University, Cardiff UKUCL Institute for Global Health, UKUCL GOS Institute of Child HealthObjectives Generation of synthetic data could improve the efficiency of administrative data analysis. We describe barriers and facilitators to synthetic administrative data in the UK based on our experience of generating, assessing, and evaluating the performance of different approaches. We aim to provide guidance on the appropriate uses of synthetic administrative data. Approach We generated synthetic versions of one large-population survey (Natsal-3) and two administrative datasets (Hospital Episode Statistics [HES] and National Pupil Database [NPD]). A range of methods were used based on the statistical techniques of sampling and prediction. We implemented non-parametric (e.g., Classification and Regression Tree) and parametric (e.g., generalised linear models) methods, and multiple imputation and Bayesian networks in R software. We attempted to generate low- and high-fidelity datasets and assessed utility by visualising marginal distributions of key variables, estimating the standardised propensity mean square error, and deriving standardised coefficient differences of model estimates and overlap of confidence intervals. Results Results from our analysis highlighted some facilitators related to low-fidelity synthetic data that are quicker to generate, can retain the data types, format, and privacy and could be used to support training and code development. Conversely, some of the barriers included computational issues when generating high-fidelity synthetic data from complex data structures. High-fidelity data are achievable but only in the context of a specific research question and a limited number of variables. Results from the Natsal-3 data showed that parametric methods produced slightly better data utility compared to non-parametric methods. Results for HES and NPD will also be presented. Conclusions Low-fidelity synthetic data can provide a useful resource to support users of administrative data, whilst minimising data access timelines and while retaining privacy and confidentiality of personal data. High-utility datasets can be generated but take considerable resources, and current approaches cannot fully handle the complexity of longitudinal administrative data. https://ijpds.org/article/view/1984synthetic dataadministrative datasetsdata linkagestatistical disclosure controldata utilitydata confidentiality
spellingShingle	Theodora Kokosi Bianca De Stavola Robin Mitra Andrew Copas Katie Harron Barriers and facilitators to generating synthetic administrative data for research. International Journal of Population Data Science synthetic data administrative datasets data linkage statistical disclosure control data utility data confidentiality
title	Barriers and facilitators to generating synthetic administrative data for research.
title_full	Barriers and facilitators to generating synthetic administrative data for research.
title_fullStr	Barriers and facilitators to generating synthetic administrative data for research.
title_full_unstemmed	Barriers and facilitators to generating synthetic administrative data for research.
title_short	Barriers and facilitators to generating synthetic administrative data for research.
title_sort	barriers and facilitators to generating synthetic administrative data for research
topic	synthetic data administrative datasets data linkage statistical disclosure control data utility data confidentiality
url	https://ijpds.org/article/view/1984
work_keys_str_mv	AT theodorakokosi barriersandfacilitatorstogeneratingsyntheticadministrativedataforresearch AT biancadestavola barriersandfacilitatorstogeneratingsyntheticadministrativedataforresearch AT robinmitra barriersandfacilitatorstogeneratingsyntheticadministrativedataforresearch AT andrewcopas barriersandfacilitatorstogeneratingsyntheticadministrativedataforresearch AT katieharron barriersandfacilitatorstogeneratingsyntheticadministrativedataforresearch

Barriers and facilitators to generating synthetic administrative data for research.

Similar Items