The Problem of Fairness in Synthetic Healthcare Data

Access to healthcare data such as electronic health records (EHR) is often restricted by laws established to protect patient privacy. These restrictions hinder the reproducibility of existing results based on private healthcare data and also limit new research. Synthetically-generated healthcare dat...

Full description

Bibliographic Details
Main Authors: Karan Bhanot, Miao Qi, John S. Erickson, Isabelle Guyon, Kristin P. Bennett
Format: Article
Language:English
Published: MDPI AG 2021-09-01
Series:Entropy
Subjects:
Online Access:https://www.mdpi.com/1099-4300/23/9/1165
_version_ 1797519426193457152
author Karan Bhanot
Miao Qi
John S. Erickson
Isabelle Guyon
Kristin P. Bennett
author_facet Karan Bhanot
Miao Qi
John S. Erickson
Isabelle Guyon
Kristin P. Bennett
author_sort Karan Bhanot
collection DOAJ
description Access to healthcare data such as electronic health records (EHR) is often restricted by laws established to protect patient privacy. These restrictions hinder the reproducibility of existing results based on private healthcare data and also limit new research. Synthetically-generated healthcare data solve this problem by preserving privacy and enabling researchers and policymakers to drive decisions and methods based on realistic data. Healthcare data can include information about multiple in- and out- patient visits of patients, making it a time-series dataset which is often influenced by protected attributes like age, gender, race etc. The COVID-19 pandemic has exacerbated health inequities, with certain subgroups experiencing poorer outcomes and less access to healthcare. To combat these inequities, synthetic data must “fairly” represent diverse minority subgroups such that the conclusions drawn on synthetic data are correct and the results can be generalized to real data. In this article, we develop two fairness metrics for synthetic data, and analyze all subgroups defined by protected attributes to analyze the bias in three published synthetic research datasets. These covariate-level disparity metrics revealed that synthetic data may not be representative at the univariate and multivariate subgroup-levels and thus, fairness should be addressed when developing data generation methods. We discuss the need for measuring fairness in synthetic healthcare data to enable the development of robust machine learning models to create more equitable synthetic healthcare datasets.
first_indexed 2024-03-10T07:41:37Z
format Article
id doaj.art-752b40c99b834e19ae0bd0758d7bff44
institution Directory Open Access Journal
issn 1099-4300
language English
last_indexed 2024-03-10T07:41:37Z
publishDate 2021-09-01
publisher MDPI AG
record_format Article
series Entropy
spelling doaj.art-752b40c99b834e19ae0bd0758d7bff442023-11-22T12:57:39ZengMDPI AGEntropy1099-43002021-09-01239116510.3390/e23091165The Problem of Fairness in Synthetic Healthcare DataKaran Bhanot0Miao Qi1John S. Erickson2Isabelle Guyon3Kristin P. Bennett4Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 12180, USADepartment of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 12180, USARensselaer Institute for Data Exploration and Applications, Troy, NY 12180, USALISN, CNRS/INRIA, Université Paris-Saclay, 91190 Gif-sur-Yvette, FranceDepartment of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 12180, USAAccess to healthcare data such as electronic health records (EHR) is often restricted by laws established to protect patient privacy. These restrictions hinder the reproducibility of existing results based on private healthcare data and also limit new research. Synthetically-generated healthcare data solve this problem by preserving privacy and enabling researchers and policymakers to drive decisions and methods based on realistic data. Healthcare data can include information about multiple in- and out- patient visits of patients, making it a time-series dataset which is often influenced by protected attributes like age, gender, race etc. The COVID-19 pandemic has exacerbated health inequities, with certain subgroups experiencing poorer outcomes and less access to healthcare. To combat these inequities, synthetic data must “fairly” represent diverse minority subgroups such that the conclusions drawn on synthetic data are correct and the results can be generalized to real data. In this article, we develop two fairness metrics for synthetic data, and analyze all subgroups defined by protected attributes to analyze the bias in three published synthetic research datasets. These covariate-level disparity metrics revealed that synthetic data may not be representative at the univariate and multivariate subgroup-levels and thus, fairness should be addressed when developing data generation methods. We discuss the need for measuring fairness in synthetic healthcare data to enable the development of robust machine learning models to create more equitable synthetic healthcare datasets.https://www.mdpi.com/1099-4300/23/9/1165synthetic datahealthcarefairnesscovariatetemporaltime-series
spellingShingle Karan Bhanot
Miao Qi
John S. Erickson
Isabelle Guyon
Kristin P. Bennett
The Problem of Fairness in Synthetic Healthcare Data
Entropy
synthetic data
healthcare
fairness
covariate
temporal
time-series
title The Problem of Fairness in Synthetic Healthcare Data
title_full The Problem of Fairness in Synthetic Healthcare Data
title_fullStr The Problem of Fairness in Synthetic Healthcare Data
title_full_unstemmed The Problem of Fairness in Synthetic Healthcare Data
title_short The Problem of Fairness in Synthetic Healthcare Data
title_sort problem of fairness in synthetic healthcare data
topic synthetic data
healthcare
fairness
covariate
temporal
time-series
url https://www.mdpi.com/1099-4300/23/9/1165
work_keys_str_mv AT karanbhanot theproblemoffairnessinsynthetichealthcaredata
AT miaoqi theproblemoffairnessinsynthetichealthcaredata
AT johnserickson theproblemoffairnessinsynthetichealthcaredata
AT isabelleguyon theproblemoffairnessinsynthetichealthcaredata
AT kristinpbennett theproblemoffairnessinsynthetichealthcaredata
AT karanbhanot problemoffairnessinsynthetichealthcaredata
AT miaoqi problemoffairnessinsynthetichealthcaredata
AT johnserickson problemoffairnessinsynthetichealthcaredata
AT isabelleguyon problemoffairnessinsynthetichealthcaredata
AT kristinpbennett problemoffairnessinsynthetichealthcaredata