Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study

BackgroundA regular task by developers and users of synthetic data generation (SDG) methods is to evaluate and compare the utility of these methods. Multiple utility metrics have been proposed and used to evaluate synthetic data. However, they have not been validated in gener...

Full description

Bibliographic Details
Main Authors:	Khaled El Emam, Lucy Mosquera, Xi Fang, Alaa El-Hussuna
Format:	Article
Language:	English
Published:	JMIR Publications 2022-04-01
Series:	JMIR Medical Informatics
Online Access:	https://medinform.jmir.org/2022/4/e35734

_version_	1827858480507125760
author	Khaled El Emam Lucy Mosquera Xi Fang Alaa El-Hussuna
author_facet	Khaled El Emam Lucy Mosquera Xi Fang Alaa El-Hussuna
author_sort	Khaled El Emam
collection	DOAJ
description	BackgroundA regular task by developers and users of synthetic data generation (SDG) methods is to evaluate and compare the utility of these methods. Multiple utility metrics have been proposed and used to evaluate synthetic data. However, they have not been validated in general or for comparing SDG methods. ObjectiveThis study evaluates the ability of common utility metrics to rank SDG methods according to performance on a specific analytic workload. The workload of interest is the use of synthetic data for logistic regression prediction models, which is a very frequent workload in health research. MethodsWe evaluated 6 utility metrics on 30 different health data sets and 3 different SDG methods (a Bayesian network, a Generative Adversarial Network, and sequential tree synthesis). These metrics were computed by averaging across 20 synthetic data sets from the same generative model. The metrics were then tested on their ability to rank the SDG methods based on prediction performance. Prediction performance was defined as the difference between each of the area under the receiver operating characteristic curve and area under the precision-recall curve values on synthetic data logistic regression prediction models versus real data models. ResultsThe utility metric best able to rank SDG methods was the multivariate Hellinger distance based on a Gaussian copula representation of real and synthetic joint distributions. ConclusionsThis study has validated a generative model utility metric, the multivariate Hellinger distance, which can be used to reliably rank competing SDG methods on the same data set. The Hellinger distance metric can be used to evaluate and compare alternate SDG methods.
first_indexed	2024-03-12T12:54:47Z
format	Article
id	doaj.art-a29d415257434ee8a2598b8cc3dc3ee2
institution	Directory Open Access Journal
issn	2291-9694
language	English
last_indexed	2024-03-12T12:54:47Z
publishDate	2022-04-01
publisher	JMIR Publications
record_format	Article
series	JMIR Medical Informatics
spelling	doaj.art-a29d415257434ee8a2598b8cc3dc3ee22023-08-28T21:21:21ZengJMIR PublicationsJMIR Medical Informatics2291-96942022-04-01104e3573410.2196/35734Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation StudyKhaled El Emamhttps://orcid.org/0000-0003-3325-4149Lucy Mosquerahttps://orcid.org/0000-0002-5289-8372Xi Fanghttps://orcid.org/0000-0002-5571-7004Alaa El-Hussunahttps://orcid.org/0000-0002-0070-8362 BackgroundA regular task by developers and users of synthetic data generation (SDG) methods is to evaluate and compare the utility of these methods. Multiple utility metrics have been proposed and used to evaluate synthetic data. However, they have not been validated in general or for comparing SDG methods. ObjectiveThis study evaluates the ability of common utility metrics to rank SDG methods according to performance on a specific analytic workload. The workload of interest is the use of synthetic data for logistic regression prediction models, which is a very frequent workload in health research. MethodsWe evaluated 6 utility metrics on 30 different health data sets and 3 different SDG methods (a Bayesian network, a Generative Adversarial Network, and sequential tree synthesis). These metrics were computed by averaging across 20 synthetic data sets from the same generative model. The metrics were then tested on their ability to rank the SDG methods based on prediction performance. Prediction performance was defined as the difference between each of the area under the receiver operating characteristic curve and area under the precision-recall curve values on synthetic data logistic regression prediction models versus real data models. ResultsThe utility metric best able to rank SDG methods was the multivariate Hellinger distance based on a Gaussian copula representation of real and synthetic joint distributions. ConclusionsThis study has validated a generative model utility metric, the multivariate Hellinger distance, which can be used to reliably rank competing SDG methods on the same data set. The Hellinger distance metric can be used to evaluate and compare alternate SDG methods.https://medinform.jmir.org/2022/4/e35734
spellingShingle	Khaled El Emam Lucy Mosquera Xi Fang Alaa El-Hussuna Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study JMIR Medical Informatics
title	Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study
title_full	Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study
title_fullStr	Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study
title_full_unstemmed	Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study
title_short	Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study
title_sort	utility metrics for evaluating synthetic health data generation methods validation study
url	https://medinform.jmir.org/2022/4/e35734
work_keys_str_mv	AT khaledelemam utilitymetricsforevaluatingsynthetichealthdatagenerationmethodsvalidationstudy AT lucymosquera utilitymetricsforevaluatingsynthetichealthdatagenerationmethodsvalidationstudy AT xifang utilitymetricsforevaluatingsynthetichealthdatagenerationmethodsvalidationstudy AT alaaelhussuna utilitymetricsforevaluatingsynthetichealthdatagenerationmethodsvalidationstudy

Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study

Similar Items