Privacy-preserving data sharing via probabilistic modeling
Summary: Differential privacy allows quantifying privacy loss resulting from accession of sensitive personal data. Repeated accesses to underlying data incur increasing loss. Releasing data as privacy-preserving synthetic data would avoid this limitation but would leave open the problem of designing...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2021-07-01
|
Series: | Patterns |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2666389921000970 |
_version_ | 1819156743403339776 |
---|---|
author | Joonas Jälkö Eemil Lagerspetz Jari Haukka Sasu Tarkoma Antti Honkela Samuel Kaski |
author_facet | Joonas Jälkö Eemil Lagerspetz Jari Haukka Sasu Tarkoma Antti Honkela Samuel Kaski |
author_sort | Joonas Jälkö |
collection | DOAJ |
description | Summary: Differential privacy allows quantifying privacy loss resulting from accession of sensitive personal data. Repeated accesses to underlying data incur increasing loss. Releasing data as privacy-preserving synthetic data would avoid this limitation but would leave open the problem of designing what kind of synthetic data. We propose formulating the problem of private data release through probabilistic modeling. This approach transforms the problem of designing the synthetic data into choosing a model for the data, allowing also the inclusion of prior knowledge, which improves the quality of the synthetic data. We demonstrate empirically, in an epidemiological study, that statistical discoveries can be reliably reproduced from the synthetic data. We expect the method to have broad use in creating high-quality anonymized data twins of key datasets for research. The bigger picture: Open data are a key component of open science. Unrestricted access to datasets would be necessary for the transparency and reproducibility that the scientific method requires. So far, openness has been at odds with privacy requirements, which has prohibited the opening up of sensitive data even after pseudonymization, which does not protect against privacy breaches using side information. A recent solution for the data-sharing problem is to release synthetic data drawn from privacy-preserving generative models. We propose to interpret privacy-preserving data sharing as a modeling task, allowing us to incorporate prior knowledge of the data-generation process into the generator model using modern probabilistic modeling methods. We demonstrate that this can significantly increase the utility of the generated data. |
first_indexed | 2024-12-22T15:57:43Z |
format | Article |
id | doaj.art-6621fd994528496fb25e007e44770528 |
institution | Directory Open Access Journal |
issn | 2666-3899 |
language | English |
last_indexed | 2024-12-22T15:57:43Z |
publishDate | 2021-07-01 |
publisher | Elsevier |
record_format | Article |
series | Patterns |
spelling | doaj.art-6621fd994528496fb25e007e447705282022-12-21T18:20:45ZengElsevierPatterns2666-38992021-07-0127100271Privacy-preserving data sharing via probabilistic modelingJoonas Jälkö0Eemil Lagerspetz1Jari Haukka2Sasu Tarkoma3Antti Honkela4Samuel Kaski5Helsinki Institute for Information Technology (HIIT), Department of Computer Science, Aalto University, Espoo, 00076, Finland; Corresponding authorHelsinki Institute for Information Technology (HIIT), Department of Computer Science, University of Helsinki, Helsinki 00014, FinlandDepartment of Public Health, University of Helsinki, Helsinki 00014, FinlandHelsinki Institute for Information Technology (HIIT), Department of Computer Science, University of Helsinki, Helsinki 00014, FinlandHelsinki Institute for Information Technology (HIIT), Department of Computer Science, University of Helsinki, Helsinki 00014, FinlandHelsinki Institute for Information Technology (HIIT), Department of Computer Science, Aalto University, Espoo, 00076, Finland; Department of Computer Science, University of Manchester, Manchester M13 9PL, UK; Corresponding authorSummary: Differential privacy allows quantifying privacy loss resulting from accession of sensitive personal data. Repeated accesses to underlying data incur increasing loss. Releasing data as privacy-preserving synthetic data would avoid this limitation but would leave open the problem of designing what kind of synthetic data. We propose formulating the problem of private data release through probabilistic modeling. This approach transforms the problem of designing the synthetic data into choosing a model for the data, allowing also the inclusion of prior knowledge, which improves the quality of the synthetic data. We demonstrate empirically, in an epidemiological study, that statistical discoveries can be reliably reproduced from the synthetic data. We expect the method to have broad use in creating high-quality anonymized data twins of key datasets for research. The bigger picture: Open data are a key component of open science. Unrestricted access to datasets would be necessary for the transparency and reproducibility that the scientific method requires. So far, openness has been at odds with privacy requirements, which has prohibited the opening up of sensitive data even after pseudonymization, which does not protect against privacy breaches using side information. A recent solution for the data-sharing problem is to release synthetic data drawn from privacy-preserving generative models. We propose to interpret privacy-preserving data sharing as a modeling task, allowing us to incorporate prior knowledge of the data-generation process into the generator model using modern probabilistic modeling methods. We demonstrate that this can significantly increase the utility of the generated data.http://www.sciencedirect.com/science/article/pii/S2666389921000970differential privacymachine learningprobabilistic modelingopen datasynthetic data |
spellingShingle | Joonas Jälkö Eemil Lagerspetz Jari Haukka Sasu Tarkoma Antti Honkela Samuel Kaski Privacy-preserving data sharing via probabilistic modeling Patterns differential privacy machine learning probabilistic modeling open data synthetic data |
title | Privacy-preserving data sharing via probabilistic modeling |
title_full | Privacy-preserving data sharing via probabilistic modeling |
title_fullStr | Privacy-preserving data sharing via probabilistic modeling |
title_full_unstemmed | Privacy-preserving data sharing via probabilistic modeling |
title_short | Privacy-preserving data sharing via probabilistic modeling |
title_sort | privacy preserving data sharing via probabilistic modeling |
topic | differential privacy machine learning probabilistic modeling open data synthetic data |
url | http://www.sciencedirect.com/science/article/pii/S2666389921000970 |
work_keys_str_mv | AT joonasjalko privacypreservingdatasharingviaprobabilisticmodeling AT eemillagerspetz privacypreservingdatasharingviaprobabilisticmodeling AT jarihaukka privacypreservingdatasharingviaprobabilisticmodeling AT sasutarkoma privacypreservingdatasharingviaprobabilisticmodeling AT anttihonkela privacypreservingdatasharingviaprobabilisticmodeling AT samuelkaski privacypreservingdatasharingviaprobabilisticmodeling |