An Introductory Synthetic Data Tool

Objectives Synthetic data reproduces features of a dataset without disclosing sensitive information, allowing researchers to explore data structures and test code without requiring access to real, potentially sensitive, data. We produced a low-fidelity synthetic data generation tool, accompanied by...

Full description

Bibliographic Details
Main Authors: Iori Thomas, Bobby Stuijfzand
Format: Article
Language:English
Published: Swansea University 2023-09-01
Series:International Journal of Population Data Science
Online Access:https://ijpds.org/article/view/2255
_version_ 1827610669172654080
author Iori Thomas
Bobby Stuijfzand
author_facet Iori Thomas
Bobby Stuijfzand
author_sort Iori Thomas
collection DOAJ
description Objectives Synthetic data reproduces features of a dataset without disclosing sensitive information, allowing researchers to explore data structures and test code without requiring access to real, potentially sensitive, data. We produced a low-fidelity synthetic data generation tool, accompanied by extensive documentation, allowing novice and expert users to produce such data. Methods Our tool, consisting of a Python notebook and a user guide, takes a dataset as input, and produces ‘low-fidelity’ synthetic copy of this dataset, recreating the data fields (or columns) of a dataset, as well as the data types and statistical relationships within these fields, but not between them. It has been tested using real-world administrative data sets and with several users, looking at the quality of the data generated, inspecting whether the data is indeed low-fidelity (i.e. statistical relationships between fields are not recreated) and the usability of the tool. Results Our tool successfully created synthetic datasets from administrative datasets. Users were positive about its usability and the generated data. Tests indicated that computational memory is a main constraint on the size of datatable that can be read in by the tool. We have since implemented improvements to the memory efficiency of the tool to partially address this and have also added procedures that allow for using subsets instead of complete datasets, allowing for the use of datasets which would have otherwise been too large to be used. Testing further indicated that, while the tool by design does not preserve any relationships between fields, they can be reproduced by coincidence, and a limited disclosure process may be required when correlations from the original data are reproduced. Conclusions The tool is easy to use and therefore a useful introduction to synthetic data, providing users with a foundation before using more sophisticated synthetic data tools like Synthpop. Future work could include the development of a Python library and extension of the tool to handle linked datatables.
first_indexed 2024-03-09T07:54:26Z
format Article
id doaj.art-3fdccd92cb9e4a4a9aad86a84e3ec4f5
institution Directory Open Access Journal
issn 2399-4908
language English
last_indexed 2024-03-09T07:54:26Z
publishDate 2023-09-01
publisher Swansea University
record_format Article
series International Journal of Population Data Science
spelling doaj.art-3fdccd92cb9e4a4a9aad86a84e3ec4f52023-12-03T01:18:17ZengSwansea UniversityInternational Journal of Population Data Science2399-49082023-09-018210.23889/ijpds.v8i2.2255An Introductory Synthetic Data ToolIori Thomas0Bobby Stuijfzand1Behavioural Insights Team, London, United KingdomBehavioural Insights Team, London, United Kingdom Objectives Synthetic data reproduces features of a dataset without disclosing sensitive information, allowing researchers to explore data structures and test code without requiring access to real, potentially sensitive, data. We produced a low-fidelity synthetic data generation tool, accompanied by extensive documentation, allowing novice and expert users to produce such data. Methods Our tool, consisting of a Python notebook and a user guide, takes a dataset as input, and produces ‘low-fidelity’ synthetic copy of this dataset, recreating the data fields (or columns) of a dataset, as well as the data types and statistical relationships within these fields, but not between them. It has been tested using real-world administrative data sets and with several users, looking at the quality of the data generated, inspecting whether the data is indeed low-fidelity (i.e. statistical relationships between fields are not recreated) and the usability of the tool. Results Our tool successfully created synthetic datasets from administrative datasets. Users were positive about its usability and the generated data. Tests indicated that computational memory is a main constraint on the size of datatable that can be read in by the tool. We have since implemented improvements to the memory efficiency of the tool to partially address this and have also added procedures that allow for using subsets instead of complete datasets, allowing for the use of datasets which would have otherwise been too large to be used. Testing further indicated that, while the tool by design does not preserve any relationships between fields, they can be reproduced by coincidence, and a limited disclosure process may be required when correlations from the original data are reproduced. Conclusions The tool is easy to use and therefore a useful introduction to synthetic data, providing users with a foundation before using more sophisticated synthetic data tools like Synthpop. Future work could include the development of a Python library and extension of the tool to handle linked datatables. https://ijpds.org/article/view/2255
spellingShingle Iori Thomas
Bobby Stuijfzand
An Introductory Synthetic Data Tool
International Journal of Population Data Science
title An Introductory Synthetic Data Tool
title_full An Introductory Synthetic Data Tool
title_fullStr An Introductory Synthetic Data Tool
title_full_unstemmed An Introductory Synthetic Data Tool
title_short An Introductory Synthetic Data Tool
title_sort introductory synthetic data tool
url https://ijpds.org/article/view/2255
work_keys_str_mv AT iorithomas anintroductorysyntheticdatatool
AT bobbystuijfzand anintroductorysyntheticdatatool
AT iorithomas introductorysyntheticdatatool
AT bobbystuijfzand introductorysyntheticdatatool