Privacy and Utility of Private Synthetic Data for Medical Data Analyses

The increasing availability and use of sensitive personal data raises a set of issues regarding the privacy of the individuals behind the data. These concerns become even more important when health data are processed, as are considered sensitive (according to most global regulations). Privacy Enhanc...

Full description

Bibliographic Details
Main Authors: Arno Appenzeller, Moritz Leitner, Patrick Philipp, Erik Krempel, Jürgen Beyerer
Format: Article
Language:English
Published: MDPI AG 2022-12-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/12/23/12320
_version_ 1797463630911897600
author Arno Appenzeller
Moritz Leitner
Patrick Philipp
Erik Krempel
Jürgen Beyerer
author_facet Arno Appenzeller
Moritz Leitner
Patrick Philipp
Erik Krempel
Jürgen Beyerer
author_sort Arno Appenzeller
collection DOAJ
description The increasing availability and use of sensitive personal data raises a set of issues regarding the privacy of the individuals behind the data. These concerns become even more important when health data are processed, as are considered sensitive (according to most global regulations). Privacy Enhancing Technologies (PETs) attempt to protect the privacy of individuals whilst preserving the utility of data. One of the most popular technologies recently is Differential Privacy (DP), which was used for the 2020 U.S. Census. Another trend is to combine synthetic data generators with DP to create so-called private synthetic data generators. The objective is to preserve statistical properties as accurately as possible, while the generated data should be as different as possible compared to the original data regarding private features. While these technologies seem promising, there is a gap between academic research on DP and synthetic data and the practical application and evaluation of these techniques for real-world use cases. In this paper, we evaluate three different private synthetic data generators (MWEM, DP-CTGAN, and PATE-CTGAN) on their use-case-specific privacy and utility. For the use case, continuous heart rate measurements from different individuals are analyzed. This work shows that private synthetic data generators have tremendous advantages over traditional techniques, but also require in-depth analysis depending on the use case. Furthermore, it can be seen that each technology has different strengths, so there is no clear winner. However, DP-CTGAN often performs slightly better than the other technologies, so it can be recommended for a continuous medical data use case.
first_indexed 2024-03-09T17:53:30Z
format Article
id doaj.art-9f8f1c29585842229bfe07e0892111a4
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-09T17:53:30Z
publishDate 2022-12-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-9f8f1c29585842229bfe07e0892111a42023-11-24T10:34:35ZengMDPI AGApplied Sciences2076-34172022-12-0112231232010.3390/app122312320Privacy and Utility of Private Synthetic Data for Medical Data AnalysesArno Appenzeller0Moritz Leitner1Patrick Philipp2Erik Krempel3Jürgen Beyerer4Karlsruhe Institute of Technology, 76131 Karlsruhe, GermanyKarlsruhe Institute of Technology, 76131 Karlsruhe, GermanyFraunhofer Institute of Optronics, System Technologies and Image Exploitation IOSB, 76131 Karlsruhe, GermanyDepartment of Computer Science and Mathematics, Hochschule München University of Applied Sciences, 80335 München, GermanyKarlsruhe Institute of Technology, 76131 Karlsruhe, GermanyThe increasing availability and use of sensitive personal data raises a set of issues regarding the privacy of the individuals behind the data. These concerns become even more important when health data are processed, as are considered sensitive (according to most global regulations). Privacy Enhancing Technologies (PETs) attempt to protect the privacy of individuals whilst preserving the utility of data. One of the most popular technologies recently is Differential Privacy (DP), which was used for the 2020 U.S. Census. Another trend is to combine synthetic data generators with DP to create so-called private synthetic data generators. The objective is to preserve statistical properties as accurately as possible, while the generated data should be as different as possible compared to the original data regarding private features. While these technologies seem promising, there is a gap between academic research on DP and synthetic data and the practical application and evaluation of these techniques for real-world use cases. In this paper, we evaluate three different private synthetic data generators (MWEM, DP-CTGAN, and PATE-CTGAN) on their use-case-specific privacy and utility. For the use case, continuous heart rate measurements from different individuals are analyzed. This work shows that private synthetic data generators have tremendous advantages over traditional techniques, but also require in-depth analysis depending on the use case. Furthermore, it can be seen that each technology has different strengths, so there is no clear winner. However, DP-CTGAN often performs slightly better than the other technologies, so it can be recommended for a continuous medical data use case.https://www.mdpi.com/2076-3417/12/23/12320synthetic data generationdifferential privacysecondary usemedical dataprivate data processingopen source framework
spellingShingle Arno Appenzeller
Moritz Leitner
Patrick Philipp
Erik Krempel
Jürgen Beyerer
Privacy and Utility of Private Synthetic Data for Medical Data Analyses
Applied Sciences
synthetic data generation
differential privacy
secondary use
medical data
private data processing
open source framework
title Privacy and Utility of Private Synthetic Data for Medical Data Analyses
title_full Privacy and Utility of Private Synthetic Data for Medical Data Analyses
title_fullStr Privacy and Utility of Private Synthetic Data for Medical Data Analyses
title_full_unstemmed Privacy and Utility of Private Synthetic Data for Medical Data Analyses
title_short Privacy and Utility of Private Synthetic Data for Medical Data Analyses
title_sort privacy and utility of private synthetic data for medical data analyses
topic synthetic data generation
differential privacy
secondary use
medical data
private data processing
open source framework
url https://www.mdpi.com/2076-3417/12/23/12320
work_keys_str_mv AT arnoappenzeller privacyandutilityofprivatesyntheticdataformedicaldataanalyses
AT moritzleitner privacyandutilityofprivatesyntheticdataformedicaldataanalyses
AT patrickphilipp privacyandutilityofprivatesyntheticdataformedicaldataanalyses
AT erikkrempel privacyandutilityofprivatesyntheticdataformedicaldataanalyses
AT jurgenbeyerer privacyandutilityofprivatesyntheticdataformedicaldataanalyses