The Problem of Fairness in Synthetic Healthcare Data
Access to healthcare data such as electronic health records (EHR) is often restricted by laws established to protect patient privacy. These restrictions hinder the reproducibility of existing results based on private healthcare data and also limit new research. Synthetically-generated healthcare dat...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2021-09-01
|
Series: | Entropy |
Subjects: | |
Online Access: | https://www.mdpi.com/1099-4300/23/9/1165 |
_version_ | 1797519426193457152 |
---|---|
author | Karan Bhanot Miao Qi John S. Erickson Isabelle Guyon Kristin P. Bennett |
author_facet | Karan Bhanot Miao Qi John S. Erickson Isabelle Guyon Kristin P. Bennett |
author_sort | Karan Bhanot |
collection | DOAJ |
description | Access to healthcare data such as electronic health records (EHR) is often restricted by laws established to protect patient privacy. These restrictions hinder the reproducibility of existing results based on private healthcare data and also limit new research. Synthetically-generated healthcare data solve this problem by preserving privacy and enabling researchers and policymakers to drive decisions and methods based on realistic data. Healthcare data can include information about multiple in- and out- patient visits of patients, making it a time-series dataset which is often influenced by protected attributes like age, gender, race etc. The COVID-19 pandemic has exacerbated health inequities, with certain subgroups experiencing poorer outcomes and less access to healthcare. To combat these inequities, synthetic data must “fairly” represent diverse minority subgroups such that the conclusions drawn on synthetic data are correct and the results can be generalized to real data. In this article, we develop two fairness metrics for synthetic data, and analyze all subgroups defined by protected attributes to analyze the bias in three published synthetic research datasets. These covariate-level disparity metrics revealed that synthetic data may not be representative at the univariate and multivariate subgroup-levels and thus, fairness should be addressed when developing data generation methods. We discuss the need for measuring fairness in synthetic healthcare data to enable the development of robust machine learning models to create more equitable synthetic healthcare datasets. |
first_indexed | 2024-03-10T07:41:37Z |
format | Article |
id | doaj.art-752b40c99b834e19ae0bd0758d7bff44 |
institution | Directory Open Access Journal |
issn | 1099-4300 |
language | English |
last_indexed | 2024-03-10T07:41:37Z |
publishDate | 2021-09-01 |
publisher | MDPI AG |
record_format | Article |
series | Entropy |
spelling | doaj.art-752b40c99b834e19ae0bd0758d7bff442023-11-22T12:57:39ZengMDPI AGEntropy1099-43002021-09-01239116510.3390/e23091165The Problem of Fairness in Synthetic Healthcare DataKaran Bhanot0Miao Qi1John S. Erickson2Isabelle Guyon3Kristin P. Bennett4Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 12180, USADepartment of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 12180, USARensselaer Institute for Data Exploration and Applications, Troy, NY 12180, USALISN, CNRS/INRIA, Université Paris-Saclay, 91190 Gif-sur-Yvette, FranceDepartment of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 12180, USAAccess to healthcare data such as electronic health records (EHR) is often restricted by laws established to protect patient privacy. These restrictions hinder the reproducibility of existing results based on private healthcare data and also limit new research. Synthetically-generated healthcare data solve this problem by preserving privacy and enabling researchers and policymakers to drive decisions and methods based on realistic data. Healthcare data can include information about multiple in- and out- patient visits of patients, making it a time-series dataset which is often influenced by protected attributes like age, gender, race etc. The COVID-19 pandemic has exacerbated health inequities, with certain subgroups experiencing poorer outcomes and less access to healthcare. To combat these inequities, synthetic data must “fairly” represent diverse minority subgroups such that the conclusions drawn on synthetic data are correct and the results can be generalized to real data. In this article, we develop two fairness metrics for synthetic data, and analyze all subgroups defined by protected attributes to analyze the bias in three published synthetic research datasets. These covariate-level disparity metrics revealed that synthetic data may not be representative at the univariate and multivariate subgroup-levels and thus, fairness should be addressed when developing data generation methods. We discuss the need for measuring fairness in synthetic healthcare data to enable the development of robust machine learning models to create more equitable synthetic healthcare datasets.https://www.mdpi.com/1099-4300/23/9/1165synthetic datahealthcarefairnesscovariatetemporaltime-series |
spellingShingle | Karan Bhanot Miao Qi John S. Erickson Isabelle Guyon Kristin P. Bennett The Problem of Fairness in Synthetic Healthcare Data Entropy synthetic data healthcare fairness covariate temporal time-series |
title | The Problem of Fairness in Synthetic Healthcare Data |
title_full | The Problem of Fairness in Synthetic Healthcare Data |
title_fullStr | The Problem of Fairness in Synthetic Healthcare Data |
title_full_unstemmed | The Problem of Fairness in Synthetic Healthcare Data |
title_short | The Problem of Fairness in Synthetic Healthcare Data |
title_sort | problem of fairness in synthetic healthcare data |
topic | synthetic data healthcare fairness covariate temporal time-series |
url | https://www.mdpi.com/1099-4300/23/9/1165 |
work_keys_str_mv | AT karanbhanot theproblemoffairnessinsynthetichealthcaredata AT miaoqi theproblemoffairnessinsynthetichealthcaredata AT johnserickson theproblemoffairnessinsynthetichealthcaredata AT isabelleguyon theproblemoffairnessinsynthetichealthcaredata AT kristinpbennett theproblemoffairnessinsynthetichealthcaredata AT karanbhanot problemoffairnessinsynthetichealthcaredata AT miaoqi problemoffairnessinsynthetichealthcaredata AT johnserickson problemoffairnessinsynthetichealthcaredata AT isabelleguyon problemoffairnessinsynthetichealthcaredata AT kristinpbennett problemoffairnessinsynthetichealthcaredata |