Comparative assessment of synthetic time series generation approaches in healthcare: leveraging patient metadata for accurate data synthesis

Abstract Background Synthetic data is an emerging approach for addressing legal and regulatory concerns in biomedical research that deals with personal and clinical data, whether as a single tool or through its combination with other privacy enhancing technologies. Generating uncompromised synthetic...

Full description

Bibliographic Details
Main Authors: Imanol Isasa, Mikel Hernandez, Gorka Epelde, Francisco Londoño, Andoni Beristain, Xabat Larrea, Ane Alberdi, Panagiotis Bamidis, Evdokimos Konstantinidis
Format: Article
Language:English
Published: BMC 2024-01-01
Series:BMC Medical Informatics and Decision Making
Subjects:
Online Access:https://doi.org/10.1186/s12911-024-02427-0
_version_ 1797274416076292096
author Imanol Isasa
Mikel Hernandez
Gorka Epelde
Francisco Londoño
Andoni Beristain
Xabat Larrea
Ane Alberdi
Panagiotis Bamidis
Evdokimos Konstantinidis
author_facet Imanol Isasa
Mikel Hernandez
Gorka Epelde
Francisco Londoño
Andoni Beristain
Xabat Larrea
Ane Alberdi
Panagiotis Bamidis
Evdokimos Konstantinidis
author_sort Imanol Isasa
collection DOAJ
description Abstract Background Synthetic data is an emerging approach for addressing legal and regulatory concerns in biomedical research that deals with personal and clinical data, whether as a single tool or through its combination with other privacy enhancing technologies. Generating uncompromised synthetic data could significantly benefit external researchers performing secondary analyses by providing unlimited access to information while fulfilling pertinent regulations. However, the original data to be synthesized (e.g., data acquired in Living Labs) may consist of subjects’ metadata (static) and a longitudinal component (set of time-dependent measurements), making it challenging to produce coherent synthetic counterparts. Methods Three synthetic time series generation approaches were defined and compared in this work: only generating the metadata and coupling it with the real time series from the original data (A1), generating both metadata and time series separately to join them afterwards (A2), and jointly generating both metadata and time series (A3). The comparative assessment of the three approaches was carried out using two different synthetic data generation models: the Wasserstein GAN with Gradient Penalty (WGAN-GP) and the DöppelGANger (DGAN). The experiments were performed with three different healthcare-related longitudinal datasets: Treadmill Maximal Effort Test (TMET) measurements from the University of Malaga (1), a hypotension subset derived from the MIMIC-III v1.4 database (2), and a lifelogging dataset named PMData (3). Results Three pivotal dimensions were assessed on the generated synthetic data: resemblance to the original data (1), utility (2), and privacy level (3). The optimal approach fluctuates based on the assessed dimension and metric. Conclusion The initial characteristics of the datasets to be synthesized play a crucial role in determining the best approach. Coupling synthetic metadata with real time series (A1), as well as jointly generating synthetic time series and metadata (A3), are both competitive methods, while separately generating time series and metadata (A2) appears to perform more poorly overall.
first_indexed 2024-03-07T14:58:00Z
format Article
id doaj.art-fa415233cda846b5bf8b77eeeaf389b0
institution Directory Open Access Journal
issn 1472-6947
language English
last_indexed 2024-03-07T14:58:00Z
publishDate 2024-01-01
publisher BMC
record_format Article
series BMC Medical Informatics and Decision Making
spelling doaj.art-fa415233cda846b5bf8b77eeeaf389b02024-03-05T19:20:01ZengBMCBMC Medical Informatics and Decision Making1472-69472024-01-0124111410.1186/s12911-024-02427-0Comparative assessment of synthetic time series generation approaches in healthcare: leveraging patient metadata for accurate data synthesisImanol Isasa0Mikel Hernandez1Gorka Epelde2Francisco Londoño3Andoni Beristain4Xabat Larrea5Ane Alberdi6Panagiotis Bamidis7Evdokimos Konstantinidis8Digital Health and Biomedical Technologies, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA)Digital Health and Biomedical Technologies, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA)Digital Health and Biomedical Technologies, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA)Digital Health and Biomedical Technologies, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA)Digital Health and Biomedical Technologies, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA)Digital Health and Biomedical Technologies, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA)Biomedical Engineering Department, Mondragon UniversityLaboratory of Medical Physics and Digital Innovation, School of Medicine, Aristotle University of ThessalonikiLaboratory of Medical Physics and Digital Innovation, School of Medicine, Aristotle University of ThessalonikiAbstract Background Synthetic data is an emerging approach for addressing legal and regulatory concerns in biomedical research that deals with personal and clinical data, whether as a single tool or through its combination with other privacy enhancing technologies. Generating uncompromised synthetic data could significantly benefit external researchers performing secondary analyses by providing unlimited access to information while fulfilling pertinent regulations. However, the original data to be synthesized (e.g., data acquired in Living Labs) may consist of subjects’ metadata (static) and a longitudinal component (set of time-dependent measurements), making it challenging to produce coherent synthetic counterparts. Methods Three synthetic time series generation approaches were defined and compared in this work: only generating the metadata and coupling it with the real time series from the original data (A1), generating both metadata and time series separately to join them afterwards (A2), and jointly generating both metadata and time series (A3). The comparative assessment of the three approaches was carried out using two different synthetic data generation models: the Wasserstein GAN with Gradient Penalty (WGAN-GP) and the DöppelGANger (DGAN). The experiments were performed with three different healthcare-related longitudinal datasets: Treadmill Maximal Effort Test (TMET) measurements from the University of Malaga (1), a hypotension subset derived from the MIMIC-III v1.4 database (2), and a lifelogging dataset named PMData (3). Results Three pivotal dimensions were assessed on the generated synthetic data: resemblance to the original data (1), utility (2), and privacy level (3). The optimal approach fluctuates based on the assessed dimension and metric. Conclusion The initial characteristics of the datasets to be synthesized play a crucial role in determining the best approach. Coupling synthetic metadata with real time series (A1), as well as jointly generating synthetic time series and metadata (A3), are both competitive methods, while separately generating time series and metadata (A2) appears to perform more poorly overall.https://doi.org/10.1186/s12911-024-02427-0Time seriesSynthetic dataPrivacy-preserving data sharingHealth data
spellingShingle Imanol Isasa
Mikel Hernandez
Gorka Epelde
Francisco Londoño
Andoni Beristain
Xabat Larrea
Ane Alberdi
Panagiotis Bamidis
Evdokimos Konstantinidis
Comparative assessment of synthetic time series generation approaches in healthcare: leveraging patient metadata for accurate data synthesis
BMC Medical Informatics and Decision Making
Time series
Synthetic data
Privacy-preserving data sharing
Health data
title Comparative assessment of synthetic time series generation approaches in healthcare: leveraging patient metadata for accurate data synthesis
title_full Comparative assessment of synthetic time series generation approaches in healthcare: leveraging patient metadata for accurate data synthesis
title_fullStr Comparative assessment of synthetic time series generation approaches in healthcare: leveraging patient metadata for accurate data synthesis
title_full_unstemmed Comparative assessment of synthetic time series generation approaches in healthcare: leveraging patient metadata for accurate data synthesis
title_short Comparative assessment of synthetic time series generation approaches in healthcare: leveraging patient metadata for accurate data synthesis
title_sort comparative assessment of synthetic time series generation approaches in healthcare leveraging patient metadata for accurate data synthesis
topic Time series
Synthetic data
Privacy-preserving data sharing
Health data
url https://doi.org/10.1186/s12911-024-02427-0
work_keys_str_mv AT imanolisasa comparativeassessmentofsynthetictimeseriesgenerationapproachesinhealthcareleveragingpatientmetadataforaccuratedatasynthesis
AT mikelhernandez comparativeassessmentofsynthetictimeseriesgenerationapproachesinhealthcareleveragingpatientmetadataforaccuratedatasynthesis
AT gorkaepelde comparativeassessmentofsynthetictimeseriesgenerationapproachesinhealthcareleveragingpatientmetadataforaccuratedatasynthesis
AT franciscolondono comparativeassessmentofsynthetictimeseriesgenerationapproachesinhealthcareleveragingpatientmetadataforaccuratedatasynthesis
AT andoniberistain comparativeassessmentofsynthetictimeseriesgenerationapproachesinhealthcareleveragingpatientmetadataforaccuratedatasynthesis
AT xabatlarrea comparativeassessmentofsynthetictimeseriesgenerationapproachesinhealthcareleveragingpatientmetadataforaccuratedatasynthesis
AT anealberdi comparativeassessmentofsynthetictimeseriesgenerationapproachesinhealthcareleveragingpatientmetadataforaccuratedatasynthesis
AT panagiotisbamidis comparativeassessmentofsynthetictimeseriesgenerationapproachesinhealthcareleveragingpatientmetadataforaccuratedatasynthesis
AT evdokimoskonstantinidis comparativeassessmentofsynthetictimeseriesgenerationapproachesinhealthcareleveragingpatientmetadataforaccuratedatasynthesis