Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study

BackgroundA regular task by developers and users of synthetic data generation (SDG) methods is to evaluate and compare the utility of these methods. Multiple utility metrics have been proposed and used to evaluate synthetic data. However, they have not been validated in gener...

Full description

Bibliographic Details
Main Authors: Khaled El Emam, Lucy Mosquera, Xi Fang, Alaa El-Hussuna
Format: Article
Language:English
Published: JMIR Publications 2022-04-01
Series:JMIR Medical Informatics
Online Access:https://medinform.jmir.org/2022/4/e35734
_version_ 1827858480507125760
author Khaled El Emam
Lucy Mosquera
Xi Fang
Alaa El-Hussuna
author_facet Khaled El Emam
Lucy Mosquera
Xi Fang
Alaa El-Hussuna
author_sort Khaled El Emam
collection DOAJ
description BackgroundA regular task by developers and users of synthetic data generation (SDG) methods is to evaluate and compare the utility of these methods. Multiple utility metrics have been proposed and used to evaluate synthetic data. However, they have not been validated in general or for comparing SDG methods. ObjectiveThis study evaluates the ability of common utility metrics to rank SDG methods according to performance on a specific analytic workload. The workload of interest is the use of synthetic data for logistic regression prediction models, which is a very frequent workload in health research. MethodsWe evaluated 6 utility metrics on 30 different health data sets and 3 different SDG methods (a Bayesian network, a Generative Adversarial Network, and sequential tree synthesis). These metrics were computed by averaging across 20 synthetic data sets from the same generative model. The metrics were then tested on their ability to rank the SDG methods based on prediction performance. Prediction performance was defined as the difference between each of the area under the receiver operating characteristic curve and area under the precision-recall curve values on synthetic data logistic regression prediction models versus real data models. ResultsThe utility metric best able to rank SDG methods was the multivariate Hellinger distance based on a Gaussian copula representation of real and synthetic joint distributions. ConclusionsThis study has validated a generative model utility metric, the multivariate Hellinger distance, which can be used to reliably rank competing SDG methods on the same data set. The Hellinger distance metric can be used to evaluate and compare alternate SDG methods.
first_indexed 2024-03-12T12:54:47Z
format Article
id doaj.art-a29d415257434ee8a2598b8cc3dc3ee2
institution Directory Open Access Journal
issn 2291-9694
language English
last_indexed 2024-03-12T12:54:47Z
publishDate 2022-04-01
publisher JMIR Publications
record_format Article
series JMIR Medical Informatics
spelling doaj.art-a29d415257434ee8a2598b8cc3dc3ee22023-08-28T21:21:21ZengJMIR PublicationsJMIR Medical Informatics2291-96942022-04-01104e3573410.2196/35734Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation StudyKhaled El Emamhttps://orcid.org/0000-0003-3325-4149Lucy Mosquerahttps://orcid.org/0000-0002-5289-8372Xi Fanghttps://orcid.org/0000-0002-5571-7004Alaa El-Hussunahttps://orcid.org/0000-0002-0070-8362 BackgroundA regular task by developers and users of synthetic data generation (SDG) methods is to evaluate and compare the utility of these methods. Multiple utility metrics have been proposed and used to evaluate synthetic data. However, they have not been validated in general or for comparing SDG methods. ObjectiveThis study evaluates the ability of common utility metrics to rank SDG methods according to performance on a specific analytic workload. The workload of interest is the use of synthetic data for logistic regression prediction models, which is a very frequent workload in health research. MethodsWe evaluated 6 utility metrics on 30 different health data sets and 3 different SDG methods (a Bayesian network, a Generative Adversarial Network, and sequential tree synthesis). These metrics were computed by averaging across 20 synthetic data sets from the same generative model. The metrics were then tested on their ability to rank the SDG methods based on prediction performance. Prediction performance was defined as the difference between each of the area under the receiver operating characteristic curve and area under the precision-recall curve values on synthetic data logistic regression prediction models versus real data models. ResultsThe utility metric best able to rank SDG methods was the multivariate Hellinger distance based on a Gaussian copula representation of real and synthetic joint distributions. ConclusionsThis study has validated a generative model utility metric, the multivariate Hellinger distance, which can be used to reliably rank competing SDG methods on the same data set. The Hellinger distance metric can be used to evaluate and compare alternate SDG methods.https://medinform.jmir.org/2022/4/e35734
spellingShingle Khaled El Emam
Lucy Mosquera
Xi Fang
Alaa El-Hussuna
Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study
JMIR Medical Informatics
title Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study
title_full Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study
title_fullStr Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study
title_full_unstemmed Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study
title_short Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study
title_sort utility metrics for evaluating synthetic health data generation methods validation study
url https://medinform.jmir.org/2022/4/e35734
work_keys_str_mv AT khaledelemam utilitymetricsforevaluatingsynthetichealthdatagenerationmethodsvalidationstudy
AT lucymosquera utilitymetricsforevaluatingsynthetichealthdatagenerationmethodsvalidationstudy
AT xifang utilitymetricsforevaluatingsynthetichealthdatagenerationmethodsvalidationstudy
AT alaaelhussuna utilitymetricsforevaluatingsynthetichealthdatagenerationmethodsvalidationstudy