Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study
BackgroundA regular task by developers and users of synthetic data generation (SDG) methods is to evaluate and compare the utility of these methods. Multiple utility metrics have been proposed and used to evaluate synthetic data. However, they have not been validated in gener...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
JMIR Publications
2022-04-01
|
Series: | JMIR Medical Informatics |
Online Access: | https://medinform.jmir.org/2022/4/e35734 |
_version_ | 1827858480507125760 |
---|---|
author | Khaled El Emam Lucy Mosquera Xi Fang Alaa El-Hussuna |
author_facet | Khaled El Emam Lucy Mosquera Xi Fang Alaa El-Hussuna |
author_sort | Khaled El Emam |
collection | DOAJ |
description |
BackgroundA regular task by developers and users of synthetic data generation (SDG) methods is to evaluate and compare the utility of these methods. Multiple utility metrics have been proposed and used to evaluate synthetic data. However, they have not been validated in general or for comparing SDG methods.
ObjectiveThis study evaluates the ability of common utility metrics to rank SDG methods according to performance on a specific analytic workload. The workload of interest is the use of synthetic data for logistic regression prediction models, which is a very frequent workload in health research.
MethodsWe evaluated 6 utility metrics on 30 different health data sets and 3 different SDG methods (a Bayesian network, a Generative Adversarial Network, and sequential tree synthesis). These metrics were computed by averaging across 20 synthetic data sets from the same generative model. The metrics were then tested on their ability to rank the SDG methods based on prediction performance. Prediction performance was defined as the difference between each of the area under the receiver operating characteristic curve and area under the precision-recall curve values on synthetic data logistic regression prediction models versus real data models.
ResultsThe utility metric best able to rank SDG methods was the multivariate Hellinger distance based on a Gaussian copula representation of real and synthetic joint distributions.
ConclusionsThis study has validated a generative model utility metric, the multivariate Hellinger distance, which can be used to reliably rank competing SDG methods on the same data set. The Hellinger distance metric can be used to evaluate and compare alternate SDG methods. |
first_indexed | 2024-03-12T12:54:47Z |
format | Article |
id | doaj.art-a29d415257434ee8a2598b8cc3dc3ee2 |
institution | Directory Open Access Journal |
issn | 2291-9694 |
language | English |
last_indexed | 2024-03-12T12:54:47Z |
publishDate | 2022-04-01 |
publisher | JMIR Publications |
record_format | Article |
series | JMIR Medical Informatics |
spelling | doaj.art-a29d415257434ee8a2598b8cc3dc3ee22023-08-28T21:21:21ZengJMIR PublicationsJMIR Medical Informatics2291-96942022-04-01104e3573410.2196/35734Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation StudyKhaled El Emamhttps://orcid.org/0000-0003-3325-4149Lucy Mosquerahttps://orcid.org/0000-0002-5289-8372Xi Fanghttps://orcid.org/0000-0002-5571-7004Alaa El-Hussunahttps://orcid.org/0000-0002-0070-8362 BackgroundA regular task by developers and users of synthetic data generation (SDG) methods is to evaluate and compare the utility of these methods. Multiple utility metrics have been proposed and used to evaluate synthetic data. However, they have not been validated in general or for comparing SDG methods. ObjectiveThis study evaluates the ability of common utility metrics to rank SDG methods according to performance on a specific analytic workload. The workload of interest is the use of synthetic data for logistic regression prediction models, which is a very frequent workload in health research. MethodsWe evaluated 6 utility metrics on 30 different health data sets and 3 different SDG methods (a Bayesian network, a Generative Adversarial Network, and sequential tree synthesis). These metrics were computed by averaging across 20 synthetic data sets from the same generative model. The metrics were then tested on their ability to rank the SDG methods based on prediction performance. Prediction performance was defined as the difference between each of the area under the receiver operating characteristic curve and area under the precision-recall curve values on synthetic data logistic regression prediction models versus real data models. ResultsThe utility metric best able to rank SDG methods was the multivariate Hellinger distance based on a Gaussian copula representation of real and synthetic joint distributions. ConclusionsThis study has validated a generative model utility metric, the multivariate Hellinger distance, which can be used to reliably rank competing SDG methods on the same data set. The Hellinger distance metric can be used to evaluate and compare alternate SDG methods.https://medinform.jmir.org/2022/4/e35734 |
spellingShingle | Khaled El Emam Lucy Mosquera Xi Fang Alaa El-Hussuna Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study JMIR Medical Informatics |
title | Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study |
title_full | Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study |
title_fullStr | Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study |
title_full_unstemmed | Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study |
title_short | Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study |
title_sort | utility metrics for evaluating synthetic health data generation methods validation study |
url | https://medinform.jmir.org/2022/4/e35734 |
work_keys_str_mv | AT khaledelemam utilitymetricsforevaluatingsynthetichealthdatagenerationmethodsvalidationstudy AT lucymosquera utilitymetricsforevaluatingsynthetichealthdatagenerationmethodsvalidationstudy AT xifang utilitymetricsforevaluatingsynthetichealthdatagenerationmethodsvalidationstudy AT alaaelhussuna utilitymetricsforevaluatingsynthetichealthdatagenerationmethodsvalidationstudy |