Privacy-preserving data sharing via probabilistic modeling

Summary: Differential privacy allows quantifying privacy loss resulting from accession of sensitive personal data. Repeated accesses to underlying data incur increasing loss. Releasing data as privacy-preserving synthetic data would avoid this limitation but would leave open the problem of designing...

Full description

Bibliographic Details
Main Authors: Joonas Jälkö, Eemil Lagerspetz, Jari Haukka, Sasu Tarkoma, Antti Honkela, Samuel Kaski
Format: Article
Language:English
Published: Elsevier 2021-07-01
Series:Patterns
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2666389921000970
_version_ 1819156743403339776
author Joonas Jälkö
Eemil Lagerspetz
Jari Haukka
Sasu Tarkoma
Antti Honkela
Samuel Kaski
author_facet Joonas Jälkö
Eemil Lagerspetz
Jari Haukka
Sasu Tarkoma
Antti Honkela
Samuel Kaski
author_sort Joonas Jälkö
collection DOAJ
description Summary: Differential privacy allows quantifying privacy loss resulting from accession of sensitive personal data. Repeated accesses to underlying data incur increasing loss. Releasing data as privacy-preserving synthetic data would avoid this limitation but would leave open the problem of designing what kind of synthetic data. We propose formulating the problem of private data release through probabilistic modeling. This approach transforms the problem of designing the synthetic data into choosing a model for the data, allowing also the inclusion of prior knowledge, which improves the quality of the synthetic data. We demonstrate empirically, in an epidemiological study, that statistical discoveries can be reliably reproduced from the synthetic data. We expect the method to have broad use in creating high-quality anonymized data twins of key datasets for research. The bigger picture: Open data are a key component of open science. Unrestricted access to datasets would be necessary for the transparency and reproducibility that the scientific method requires. So far, openness has been at odds with privacy requirements, which has prohibited the opening up of sensitive data even after pseudonymization, which does not protect against privacy breaches using side information. A recent solution for the data-sharing problem is to release synthetic data drawn from privacy-preserving generative models. We propose to interpret privacy-preserving data sharing as a modeling task, allowing us to incorporate prior knowledge of the data-generation process into the generator model using modern probabilistic modeling methods. We demonstrate that this can significantly increase the utility of the generated data.
first_indexed 2024-12-22T15:57:43Z
format Article
id doaj.art-6621fd994528496fb25e007e44770528
institution Directory Open Access Journal
issn 2666-3899
language English
last_indexed 2024-12-22T15:57:43Z
publishDate 2021-07-01
publisher Elsevier
record_format Article
series Patterns
spelling doaj.art-6621fd994528496fb25e007e447705282022-12-21T18:20:45ZengElsevierPatterns2666-38992021-07-0127100271Privacy-preserving data sharing via probabilistic modelingJoonas Jälkö0Eemil Lagerspetz1Jari Haukka2Sasu Tarkoma3Antti Honkela4Samuel Kaski5Helsinki Institute for Information Technology (HIIT), Department of Computer Science, Aalto University, Espoo, 00076, Finland; Corresponding authorHelsinki Institute for Information Technology (HIIT), Department of Computer Science, University of Helsinki, Helsinki 00014, FinlandDepartment of Public Health, University of Helsinki, Helsinki 00014, FinlandHelsinki Institute for Information Technology (HIIT), Department of Computer Science, University of Helsinki, Helsinki 00014, FinlandHelsinki Institute for Information Technology (HIIT), Department of Computer Science, University of Helsinki, Helsinki 00014, FinlandHelsinki Institute for Information Technology (HIIT), Department of Computer Science, Aalto University, Espoo, 00076, Finland; Department of Computer Science, University of Manchester, Manchester M13 9PL, UK; Corresponding authorSummary: Differential privacy allows quantifying privacy loss resulting from accession of sensitive personal data. Repeated accesses to underlying data incur increasing loss. Releasing data as privacy-preserving synthetic data would avoid this limitation but would leave open the problem of designing what kind of synthetic data. We propose formulating the problem of private data release through probabilistic modeling. This approach transforms the problem of designing the synthetic data into choosing a model for the data, allowing also the inclusion of prior knowledge, which improves the quality of the synthetic data. We demonstrate empirically, in an epidemiological study, that statistical discoveries can be reliably reproduced from the synthetic data. We expect the method to have broad use in creating high-quality anonymized data twins of key datasets for research. The bigger picture: Open data are a key component of open science. Unrestricted access to datasets would be necessary for the transparency and reproducibility that the scientific method requires. So far, openness has been at odds with privacy requirements, which has prohibited the opening up of sensitive data even after pseudonymization, which does not protect against privacy breaches using side information. A recent solution for the data-sharing problem is to release synthetic data drawn from privacy-preserving generative models. We propose to interpret privacy-preserving data sharing as a modeling task, allowing us to incorporate prior knowledge of the data-generation process into the generator model using modern probabilistic modeling methods. We demonstrate that this can significantly increase the utility of the generated data.http://www.sciencedirect.com/science/article/pii/S2666389921000970differential privacymachine learningprobabilistic modelingopen datasynthetic data
spellingShingle Joonas Jälkö
Eemil Lagerspetz
Jari Haukka
Sasu Tarkoma
Antti Honkela
Samuel Kaski
Privacy-preserving data sharing via probabilistic modeling
Patterns
differential privacy
machine learning
probabilistic modeling
open data
synthetic data
title Privacy-preserving data sharing via probabilistic modeling
title_full Privacy-preserving data sharing via probabilistic modeling
title_fullStr Privacy-preserving data sharing via probabilistic modeling
title_full_unstemmed Privacy-preserving data sharing via probabilistic modeling
title_short Privacy-preserving data sharing via probabilistic modeling
title_sort privacy preserving data sharing via probabilistic modeling
topic differential privacy
machine learning
probabilistic modeling
open data
synthetic data
url http://www.sciencedirect.com/science/article/pii/S2666389921000970
work_keys_str_mv AT joonasjalko privacypreservingdatasharingviaprobabilisticmodeling
AT eemillagerspetz privacypreservingdatasharingviaprobabilisticmodeling
AT jarihaukka privacypreservingdatasharingviaprobabilisticmodeling
AT sasutarkoma privacypreservingdatasharingviaprobabilisticmodeling
AT anttihonkela privacypreservingdatasharingviaprobabilisticmodeling
AT samuelkaski privacypreservingdatasharingviaprobabilisticmodeling