Privacy-preserving data sharing via probabilistic modeling

Summary: Differential privacy allows quantifying privacy loss resulting from accession of sensitive personal data. Repeated accesses to underlying data incur increasing loss. Releasing data as privacy-preserving synthetic data would avoid this limitation but would leave open the problem of designing...

Full description

Bibliographic Details
Main Authors:	Joonas Jälkö, Eemil Lagerspetz, Jari Haukka, Sasu Tarkoma, Antti Honkela, Samuel Kaski
Format:	Article
Language:	English
Published:	Elsevier 2021-07-01
Series:	Patterns
Subjects:	differential privacy machine learning probabilistic modeling open data synthetic data
Online Access:	http://www.sciencedirect.com/science/article/pii/S2666389921000970

_version_	1819156743403339776
author	Joonas Jälkö Eemil Lagerspetz Jari Haukka Sasu Tarkoma Antti Honkela Samuel Kaski
author_facet	Joonas Jälkö Eemil Lagerspetz Jari Haukka Sasu Tarkoma Antti Honkela Samuel Kaski
author_sort	Joonas Jälkö
collection	DOAJ
description	Summary: Differential privacy allows quantifying privacy loss resulting from accession of sensitive personal data. Repeated accesses to underlying data incur increasing loss. Releasing data as privacy-preserving synthetic data would avoid this limitation but would leave open the problem of designing what kind of synthetic data. We propose formulating the problem of private data release through probabilistic modeling. This approach transforms the problem of designing the synthetic data into choosing a model for the data, allowing also the inclusion of prior knowledge, which improves the quality of the synthetic data. We demonstrate empirically, in an epidemiological study, that statistical discoveries can be reliably reproduced from the synthetic data. We expect the method to have broad use in creating high-quality anonymized data twins of key datasets for research. The bigger picture: Open data are a key component of open science. Unrestricted access to datasets would be necessary for the transparency and reproducibility that the scientific method requires. So far, openness has been at odds with privacy requirements, which has prohibited the opening up of sensitive data even after pseudonymization, which does not protect against privacy breaches using side information. A recent solution for the data-sharing problem is to release synthetic data drawn from privacy-preserving generative models. We propose to interpret privacy-preserving data sharing as a modeling task, allowing us to incorporate prior knowledge of the data-generation process into the generator model using modern probabilistic modeling methods. We demonstrate that this can significantly increase the utility of the generated data.
first_indexed	2024-12-22T15:57:43Z
format	Article
id	doaj.art-6621fd994528496fb25e007e44770528
institution	Directory Open Access Journal
issn	2666-3899
language	English
last_indexed	2024-12-22T15:57:43Z
publishDate	2021-07-01
publisher	Elsevier
record_format	Article
series	Patterns
spelling	doaj.art-6621fd994528496fb25e007e447705282022-12-21T18:20:45ZengElsevierPatterns2666-38992021-07-0127100271Privacy-preserving data sharing via probabilistic modelingJoonas Jälkö0Eemil Lagerspetz1Jari Haukka2Sasu Tarkoma3Antti Honkela4Samuel Kaski5Helsinki Institute for Information Technology (HIIT), Department of Computer Science, Aalto University, Espoo, 00076, Finland; Corresponding authorHelsinki Institute for Information Technology (HIIT), Department of Computer Science, University of Helsinki, Helsinki 00014, FinlandDepartment of Public Health, University of Helsinki, Helsinki 00014, FinlandHelsinki Institute for Information Technology (HIIT), Department of Computer Science, University of Helsinki, Helsinki 00014, FinlandHelsinki Institute for Information Technology (HIIT), Department of Computer Science, University of Helsinki, Helsinki 00014, FinlandHelsinki Institute for Information Technology (HIIT), Department of Computer Science, Aalto University, Espoo, 00076, Finland; Department of Computer Science, University of Manchester, Manchester M13 9PL, UK; Corresponding authorSummary: Differential privacy allows quantifying privacy loss resulting from accession of sensitive personal data. Repeated accesses to underlying data incur increasing loss. Releasing data as privacy-preserving synthetic data would avoid this limitation but would leave open the problem of designing what kind of synthetic data. We propose formulating the problem of private data release through probabilistic modeling. This approach transforms the problem of designing the synthetic data into choosing a model for the data, allowing also the inclusion of prior knowledge, which improves the quality of the synthetic data. We demonstrate empirically, in an epidemiological study, that statistical discoveries can be reliably reproduced from the synthetic data. We expect the method to have broad use in creating high-quality anonymized data twins of key datasets for research. The bigger picture: Open data are a key component of open science. Unrestricted access to datasets would be necessary for the transparency and reproducibility that the scientific method requires. So far, openness has been at odds with privacy requirements, which has prohibited the opening up of sensitive data even after pseudonymization, which does not protect against privacy breaches using side information. A recent solution for the data-sharing problem is to release synthetic data drawn from privacy-preserving generative models. We propose to interpret privacy-preserving data sharing as a modeling task, allowing us to incorporate prior knowledge of the data-generation process into the generator model using modern probabilistic modeling methods. We demonstrate that this can significantly increase the utility of the generated data.http://www.sciencedirect.com/science/article/pii/S2666389921000970differential privacymachine learningprobabilistic modelingopen datasynthetic data
spellingShingle	Joonas Jälkö Eemil Lagerspetz Jari Haukka Sasu Tarkoma Antti Honkela Samuel Kaski Privacy-preserving data sharing via probabilistic modeling Patterns differential privacy machine learning probabilistic modeling open data synthetic data
title	Privacy-preserving data sharing via probabilistic modeling
title_full	Privacy-preserving data sharing via probabilistic modeling
title_fullStr	Privacy-preserving data sharing via probabilistic modeling
title_full_unstemmed	Privacy-preserving data sharing via probabilistic modeling
title_short	Privacy-preserving data sharing via probabilistic modeling
title_sort	privacy preserving data sharing via probabilistic modeling
topic	differential privacy machine learning probabilistic modeling open data synthetic data
url	http://www.sciencedirect.com/science/article/pii/S2666389921000970
work_keys_str_mv	AT joonasjalko privacypreservingdatasharingviaprobabilisticmodeling AT eemillagerspetz privacypreservingdatasharingviaprobabilisticmodeling AT jarihaukka privacypreservingdatasharingviaprobabilisticmodeling AT sasutarkoma privacypreservingdatasharingviaprobabilisticmodeling AT anttihonkela privacypreservingdatasharingviaprobabilisticmodeling AT samuelkaski privacypreservingdatasharingviaprobabilisticmodeling

Privacy-preserving data sharing via probabilistic modeling

Similar Items