Privacy and Utility of Private Synthetic Data for Medical Data Analyses
The increasing availability and use of sensitive personal data raises a set of issues regarding the privacy of the individuals behind the data. These concerns become even more important when health data are processed, as are considered sensitive (according to most global regulations). Privacy Enhanc...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2022-12-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/12/23/12320 |
_version_ | 1797463630911897600 |
---|---|
author | Arno Appenzeller Moritz Leitner Patrick Philipp Erik Krempel Jürgen Beyerer |
author_facet | Arno Appenzeller Moritz Leitner Patrick Philipp Erik Krempel Jürgen Beyerer |
author_sort | Arno Appenzeller |
collection | DOAJ |
description | The increasing availability and use of sensitive personal data raises a set of issues regarding the privacy of the individuals behind the data. These concerns become even more important when health data are processed, as are considered sensitive (according to most global regulations). Privacy Enhancing Technologies (PETs) attempt to protect the privacy of individuals whilst preserving the utility of data. One of the most popular technologies recently is Differential Privacy (DP), which was used for the 2020 U.S. Census. Another trend is to combine synthetic data generators with DP to create so-called private synthetic data generators. The objective is to preserve statistical properties as accurately as possible, while the generated data should be as different as possible compared to the original data regarding private features. While these technologies seem promising, there is a gap between academic research on DP and synthetic data and the practical application and evaluation of these techniques for real-world use cases. In this paper, we evaluate three different private synthetic data generators (MWEM, DP-CTGAN, and PATE-CTGAN) on their use-case-specific privacy and utility. For the use case, continuous heart rate measurements from different individuals are analyzed. This work shows that private synthetic data generators have tremendous advantages over traditional techniques, but also require in-depth analysis depending on the use case. Furthermore, it can be seen that each technology has different strengths, so there is no clear winner. However, DP-CTGAN often performs slightly better than the other technologies, so it can be recommended for a continuous medical data use case. |
first_indexed | 2024-03-09T17:53:30Z |
format | Article |
id | doaj.art-9f8f1c29585842229bfe07e0892111a4 |
institution | Directory Open Access Journal |
issn | 2076-3417 |
language | English |
last_indexed | 2024-03-09T17:53:30Z |
publishDate | 2022-12-01 |
publisher | MDPI AG |
record_format | Article |
series | Applied Sciences |
spelling | doaj.art-9f8f1c29585842229bfe07e0892111a42023-11-24T10:34:35ZengMDPI AGApplied Sciences2076-34172022-12-0112231232010.3390/app122312320Privacy and Utility of Private Synthetic Data for Medical Data AnalysesArno Appenzeller0Moritz Leitner1Patrick Philipp2Erik Krempel3Jürgen Beyerer4Karlsruhe Institute of Technology, 76131 Karlsruhe, GermanyKarlsruhe Institute of Technology, 76131 Karlsruhe, GermanyFraunhofer Institute of Optronics, System Technologies and Image Exploitation IOSB, 76131 Karlsruhe, GermanyDepartment of Computer Science and Mathematics, Hochschule München University of Applied Sciences, 80335 München, GermanyKarlsruhe Institute of Technology, 76131 Karlsruhe, GermanyThe increasing availability and use of sensitive personal data raises a set of issues regarding the privacy of the individuals behind the data. These concerns become even more important when health data are processed, as are considered sensitive (according to most global regulations). Privacy Enhancing Technologies (PETs) attempt to protect the privacy of individuals whilst preserving the utility of data. One of the most popular technologies recently is Differential Privacy (DP), which was used for the 2020 U.S. Census. Another trend is to combine synthetic data generators with DP to create so-called private synthetic data generators. The objective is to preserve statistical properties as accurately as possible, while the generated data should be as different as possible compared to the original data regarding private features. While these technologies seem promising, there is a gap between academic research on DP and synthetic data and the practical application and evaluation of these techniques for real-world use cases. In this paper, we evaluate three different private synthetic data generators (MWEM, DP-CTGAN, and PATE-CTGAN) on their use-case-specific privacy and utility. For the use case, continuous heart rate measurements from different individuals are analyzed. This work shows that private synthetic data generators have tremendous advantages over traditional techniques, but also require in-depth analysis depending on the use case. Furthermore, it can be seen that each technology has different strengths, so there is no clear winner. However, DP-CTGAN often performs slightly better than the other technologies, so it can be recommended for a continuous medical data use case.https://www.mdpi.com/2076-3417/12/23/12320synthetic data generationdifferential privacysecondary usemedical dataprivate data processingopen source framework |
spellingShingle | Arno Appenzeller Moritz Leitner Patrick Philipp Erik Krempel Jürgen Beyerer Privacy and Utility of Private Synthetic Data for Medical Data Analyses Applied Sciences synthetic data generation differential privacy secondary use medical data private data processing open source framework |
title | Privacy and Utility of Private Synthetic Data for Medical Data Analyses |
title_full | Privacy and Utility of Private Synthetic Data for Medical Data Analyses |
title_fullStr | Privacy and Utility of Private Synthetic Data for Medical Data Analyses |
title_full_unstemmed | Privacy and Utility of Private Synthetic Data for Medical Data Analyses |
title_short | Privacy and Utility of Private Synthetic Data for Medical Data Analyses |
title_sort | privacy and utility of private synthetic data for medical data analyses |
topic | synthetic data generation differential privacy secondary use medical data private data processing open source framework |
url | https://www.mdpi.com/2076-3417/12/23/12320 |
work_keys_str_mv | AT arnoappenzeller privacyandutilityofprivatesyntheticdataformedicaldataanalyses AT moritzleitner privacyandutilityofprivatesyntheticdataformedicaldataanalyses AT patrickphilipp privacyandutilityofprivatesyntheticdataformedicaldataanalyses AT erikkrempel privacyandutilityofprivatesyntheticdataformedicaldataanalyses AT jurgenbeyerer privacyandutilityofprivatesyntheticdataformedicaldataanalyses |