Generating high-fidelity synthetic time-to-event datasets to improve data transparency and accessibility

Abstract Background A lack of available data and statistical code being published alongside journal articles provides a significant barrier to open scientific discourse, and reproducibility of research. Information governance restrictions inhibit the active dissemination of individual level data to...

Full description

Bibliographic Details
Main Authors:	Aiden Smith, Paul C. Lambert, Mark J. Rutherford
Format:	Article
Language:	English
Published:	BMC 2022-06-01
Series:	BMC Medical Research Methodology
Subjects:	Simulation Survival Data accessibility Flexible parametric survival models Reproducible research Time-to-event
Online Access:	https://doi.org/10.1186/s12874-022-01654-1

_version_	1811229965341949952
author	Aiden Smith Paul C. Lambert Mark J. Rutherford
author_facet	Aiden Smith Paul C. Lambert Mark J. Rutherford
author_sort	Aiden Smith
collection	DOAJ
description	Abstract Background A lack of available data and statistical code being published alongside journal articles provides a significant barrier to open scientific discourse, and reproducibility of research. Information governance restrictions inhibit the active dissemination of individual level data to accompany published manuscripts. Realistic, high-fidelity time-to-event synthetic data can aid in the acceleration of methodological developments in survival analysis and beyond by enabling researchers to access and test published methods using data similar to that which they were developed on. Methods We present methods to accurately emulate the covariate patterns and survival times found in real-world datasets using synthetic data techniques, without compromising patient privacy. We model the joint covariate distribution of the original data using covariate specific sequential conditional regression models, then fit a complex flexible parametric survival model from which to generate survival times conditional on individual covariate patterns. We recreate the administrative censoring mechanism using the last observed follow-up date information from the initial dataset. Metrics for evaluating the accuracy of the synthetic data, and the non-identifiability of individuals from the original dataset, are presented. Results We successfully create a synthetic version of an example colon cancer dataset consisting of 9064 patients which aims to show good similarity to both covariate distributions and survival times from the original data, without containing any exact information from the original data, therefore allowing them to be published openly alongside research. Conclusions We evaluate the effectiveness of the methods for constructing synthetic data, as well as providing evidence that there is minimal risk that a given patient from the original data could be identified from their individual unique patient information. Synthetic datasets using this methodology could be made available alongside published research without breaching data privacy protocols, and allow for data and code to be made available alongside methodological or applied manuscripts to greatly improve the transparency and accessibility of medical research.
first_indexed	2024-04-12T10:22:09Z
format	Article
id	doaj.art-000f1a1f9cde427d9e34f035f6cbb0d5
institution	Directory Open Access Journal
issn	1471-2288
language	English
last_indexed	2024-04-12T10:22:09Z
publishDate	2022-06-01
publisher	BMC
record_format	Article
series	BMC Medical Research Methodology
spelling	doaj.art-000f1a1f9cde427d9e34f035f6cbb0d52022-12-22T03:37:04ZengBMCBMC Medical Research Methodology1471-22882022-06-0122111510.1186/s12874-022-01654-1Generating high-fidelity synthetic time-to-event datasets to improve data transparency and accessibilityAiden Smith0Paul C. Lambert1Mark J. Rutherford2Department of Health Sciences, Centre for Medicine, University of LeicesterDepartment of Health Sciences, Centre for Medicine, University of LeicesterDepartment of Health Sciences, Centre for Medicine, University of LeicesterAbstract Background A lack of available data and statistical code being published alongside journal articles provides a significant barrier to open scientific discourse, and reproducibility of research. Information governance restrictions inhibit the active dissemination of individual level data to accompany published manuscripts. Realistic, high-fidelity time-to-event synthetic data can aid in the acceleration of methodological developments in survival analysis and beyond by enabling researchers to access and test published methods using data similar to that which they were developed on. Methods We present methods to accurately emulate the covariate patterns and survival times found in real-world datasets using synthetic data techniques, without compromising patient privacy. We model the joint covariate distribution of the original data using covariate specific sequential conditional regression models, then fit a complex flexible parametric survival model from which to generate survival times conditional on individual covariate patterns. We recreate the administrative censoring mechanism using the last observed follow-up date information from the initial dataset. Metrics for evaluating the accuracy of the synthetic data, and the non-identifiability of individuals from the original dataset, are presented. Results We successfully create a synthetic version of an example colon cancer dataset consisting of 9064 patients which aims to show good similarity to both covariate distributions and survival times from the original data, without containing any exact information from the original data, therefore allowing them to be published openly alongside research. Conclusions We evaluate the effectiveness of the methods for constructing synthetic data, as well as providing evidence that there is minimal risk that a given patient from the original data could be identified from their individual unique patient information. Synthetic datasets using this methodology could be made available alongside published research without breaching data privacy protocols, and allow for data and code to be made available alongside methodological or applied manuscripts to greatly improve the transparency and accessibility of medical research.https://doi.org/10.1186/s12874-022-01654-1SimulationSurvivalData accessibilityFlexible parametric survival modelsReproducible researchTime-to-event
spellingShingle	Aiden Smith Paul C. Lambert Mark J. Rutherford Generating high-fidelity synthetic time-to-event datasets to improve data transparency and accessibility BMC Medical Research Methodology Simulation Survival Data accessibility Flexible parametric survival models Reproducible research Time-to-event
title	Generating high-fidelity synthetic time-to-event datasets to improve data transparency and accessibility
title_full	Generating high-fidelity synthetic time-to-event datasets to improve data transparency and accessibility
title_fullStr	Generating high-fidelity synthetic time-to-event datasets to improve data transparency and accessibility
title_full_unstemmed	Generating high-fidelity synthetic time-to-event datasets to improve data transparency and accessibility
title_short	Generating high-fidelity synthetic time-to-event datasets to improve data transparency and accessibility
title_sort	generating high fidelity synthetic time to event datasets to improve data transparency and accessibility
topic	Simulation Survival Data accessibility Flexible parametric survival models Reproducible research Time-to-event
url	https://doi.org/10.1186/s12874-022-01654-1
work_keys_str_mv	AT aidensmith generatinghighfidelitysynthetictimetoeventdatasetstoimprovedatatransparencyandaccessibility AT paulclambert generatinghighfidelitysynthetictimetoeventdatasetstoimprovedatatransparencyandaccessibility AT markjrutherford generatinghighfidelitysynthetictimetoeventdatasetstoimprovedatatransparencyandaccessibility

Generating high-fidelity synthetic time-to-event datasets to improve data transparency and accessibility

Similar Items