Barriers and facilitators to generating synthetic administrative data for research.
Objectives Generation of synthetic data could improve the efficiency of administrative data analysis. We describe barriers and facilitators to synthetic administrative data in the UK based on our experience of generating, assessing, and evaluating the performance of different approaches. We aim to p...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Swansea University
2022-08-01
|
Series: | International Journal of Population Data Science |
Subjects: | |
Online Access: | https://ijpds.org/article/view/1984 |
_version_ | 1797423090039259136 |
---|---|
author | Theodora Kokosi Bianca De Stavola Robin Mitra Andrew Copas Katie Harron |
author_facet | Theodora Kokosi Bianca De Stavola Robin Mitra Andrew Copas Katie Harron |
author_sort | Theodora Kokosi |
collection | DOAJ |
description | Objectives
Generation of synthetic data could improve the efficiency of administrative data analysis. We describe barriers and facilitators to synthetic administrative data in the UK based on our experience of generating, assessing, and evaluating the performance of different approaches. We aim to provide guidance on the appropriate uses of synthetic administrative data.
Approach
We generated synthetic versions of one large-population survey (Natsal-3) and two administrative datasets (Hospital Episode Statistics [HES] and National Pupil Database [NPD]). A range of methods were used based on the statistical techniques of sampling and prediction. We implemented non-parametric (e.g., Classification and Regression Tree) and parametric (e.g., generalised linear models) methods, and multiple imputation and Bayesian networks in R software. We attempted to generate low- and high-fidelity datasets and assessed utility by visualising marginal distributions of key variables, estimating the standardised propensity mean square error, and deriving standardised coefficient differences of model estimates and overlap of confidence intervals.
Results
Results from our analysis highlighted some facilitators related to low-fidelity synthetic data that are quicker to generate, can retain the data types, format, and privacy and could be used to support training and code development. Conversely, some of the barriers included computational issues when generating high-fidelity synthetic data from complex data structures. High-fidelity data are achievable but only in the context of a specific research question and a limited number of variables. Results from the Natsal-3 data showed that parametric methods produced slightly better data utility compared to non-parametric methods. Results for HES and NPD will also be presented.
Conclusions
Low-fidelity synthetic data can provide a useful resource to support users of administrative data, whilst minimising data access timelines and while retaining privacy and confidentiality of personal data. High-utility datasets can be generated but take considerable resources, and current approaches cannot fully handle the complexity of longitudinal administrative data.
|
first_indexed | 2024-03-09T07:42:25Z |
format | Article |
id | doaj.art-a4a6b745c826467ca14cd20593713227 |
institution | Directory Open Access Journal |
issn | 2399-4908 |
language | English |
last_indexed | 2024-03-09T07:42:25Z |
publishDate | 2022-08-01 |
publisher | Swansea University |
record_format | Article |
series | International Journal of Population Data Science |
spelling | doaj.art-a4a6b745c826467ca14cd205937132272023-12-03T04:22:43ZengSwansea UniversityInternational Journal of Population Data Science2399-49082022-08-017310.23889/ijpds.v7i3.1984Barriers and facilitators to generating synthetic administrative data for research.Theodora Kokosi0Bianca De Stavola1Robin Mitra2Andrew Copas3Katie Harron4UCL GOS Institute of Child HealthUCL GOS Institute of Child HealthSchool of Mathematics, Cardiff University, Cardiff UKUCL Institute for Global Health, UKUCL GOS Institute of Child HealthObjectives Generation of synthetic data could improve the efficiency of administrative data analysis. We describe barriers and facilitators to synthetic administrative data in the UK based on our experience of generating, assessing, and evaluating the performance of different approaches. We aim to provide guidance on the appropriate uses of synthetic administrative data. Approach We generated synthetic versions of one large-population survey (Natsal-3) and two administrative datasets (Hospital Episode Statistics [HES] and National Pupil Database [NPD]). A range of methods were used based on the statistical techniques of sampling and prediction. We implemented non-parametric (e.g., Classification and Regression Tree) and parametric (e.g., generalised linear models) methods, and multiple imputation and Bayesian networks in R software. We attempted to generate low- and high-fidelity datasets and assessed utility by visualising marginal distributions of key variables, estimating the standardised propensity mean square error, and deriving standardised coefficient differences of model estimates and overlap of confidence intervals. Results Results from our analysis highlighted some facilitators related to low-fidelity synthetic data that are quicker to generate, can retain the data types, format, and privacy and could be used to support training and code development. Conversely, some of the barriers included computational issues when generating high-fidelity synthetic data from complex data structures. High-fidelity data are achievable but only in the context of a specific research question and a limited number of variables. Results from the Natsal-3 data showed that parametric methods produced slightly better data utility compared to non-parametric methods. Results for HES and NPD will also be presented. Conclusions Low-fidelity synthetic data can provide a useful resource to support users of administrative data, whilst minimising data access timelines and while retaining privacy and confidentiality of personal data. High-utility datasets can be generated but take considerable resources, and current approaches cannot fully handle the complexity of longitudinal administrative data. https://ijpds.org/article/view/1984synthetic dataadministrative datasetsdata linkagestatistical disclosure controldata utilitydata confidentiality |
spellingShingle | Theodora Kokosi Bianca De Stavola Robin Mitra Andrew Copas Katie Harron Barriers and facilitators to generating synthetic administrative data for research. International Journal of Population Data Science synthetic data administrative datasets data linkage statistical disclosure control data utility data confidentiality |
title | Barriers and facilitators to generating synthetic administrative data for research. |
title_full | Barriers and facilitators to generating synthetic administrative data for research. |
title_fullStr | Barriers and facilitators to generating synthetic administrative data for research. |
title_full_unstemmed | Barriers and facilitators to generating synthetic administrative data for research. |
title_short | Barriers and facilitators to generating synthetic administrative data for research. |
title_sort | barriers and facilitators to generating synthetic administrative data for research |
topic | synthetic data administrative datasets data linkage statistical disclosure control data utility data confidentiality |
url | https://ijpds.org/article/view/1984 |
work_keys_str_mv | AT theodorakokosi barriersandfacilitatorstogeneratingsyntheticadministrativedataforresearch AT biancadestavola barriersandfacilitatorstogeneratingsyntheticadministrativedataforresearch AT robinmitra barriersandfacilitatorstogeneratingsyntheticadministrativedataforresearch AT andrewcopas barriersandfacilitatorstogeneratingsyntheticadministrativedataforresearch AT katieharron barriersandfacilitatorstogeneratingsyntheticadministrativedataforresearch |