A Privacy Preserving Algorithm to Release Sparse High-dimensional Histograms

Differential privacy has emerged as a popular model to provably limit privacy risks associated with a given data release. However releasing high dimensional synthetic data under differential privacy remains a challenging problem. In this paper, we study the problem of releasing synthetic data in the...

Full description

Bibliographic Details
Main Authors: Bai Li, Vishesh Karwa, Aleksandra Slavković, Rebecca Carter Steorts
Format: Article
Language:English
Published: Labor Dynamics Institute 2018-12-01
Series:The Journal of Privacy and Confidentiality
Subjects:
Online Access:https://journalprivacyconfidentiality.org/index.php/jpc/article/view/657
_version_ 1818147553002127360
author Bai Li
Vishesh Karwa
Aleksandra Slavković
Rebecca Carter Steorts
author_facet Bai Li
Vishesh Karwa
Aleksandra Slavković
Rebecca Carter Steorts
author_sort Bai Li
collection DOAJ
description Differential privacy has emerged as a popular model to provably limit privacy risks associated with a given data release. However releasing high dimensional synthetic data under differential privacy remains a challenging problem. In this paper, we study the problem of releasing synthetic data in the form of a high dimensional histogram under the constraint of differential privacy. We develop an $(\epsilon, \delta)$-differentially private categorical data synthesizer called \emph{Stability Based Hashed Gibbs Sampler} (SBHG). SBHG works by combining a stability based sparse histogram estimation algorithm with Gibbs sampling and feature selection to approximate the empirical joint distribution of a discrete dataset. SBHG offers a competitive alternative to state-of-the art synthetic data generators while preserving the sparsity structure of the original dataset, which leads to improved statistical utility as illustrated on simulated data. Finally, to study the utility of the resulting synthetic data sets generated by SBHG, we also perform logistic regression using the synthetic datasets and compare the classification accuracy with those from using the original dataset.
first_indexed 2024-12-11T12:37:04Z
format Article
id doaj.art-8b1c31f5e1994d14a0119e8634fd7acb
institution Directory Open Access Journal
issn 2575-8527
language English
last_indexed 2024-12-11T12:37:04Z
publishDate 2018-12-01
publisher Labor Dynamics Institute
record_format Article
series The Journal of Privacy and Confidentiality
spelling doaj.art-8b1c31f5e1994d14a0119e8634fd7acb2022-12-22T01:07:06ZengLabor Dynamics InstituteThe Journal of Privacy and Confidentiality2575-85272018-12-018110.29012/jpc.657A Privacy Preserving Algorithm to Release Sparse High-dimensional HistogramsBai Li0Vishesh Karwa1Aleksandra Slavković2Rebecca Carter Steorts3Duke UniversityTemple UniversityPennsylvania State UniversityDuke UniversityDifferential privacy has emerged as a popular model to provably limit privacy risks associated with a given data release. However releasing high dimensional synthetic data under differential privacy remains a challenging problem. In this paper, we study the problem of releasing synthetic data in the form of a high dimensional histogram under the constraint of differential privacy. We develop an $(\epsilon, \delta)$-differentially private categorical data synthesizer called \emph{Stability Based Hashed Gibbs Sampler} (SBHG). SBHG works by combining a stability based sparse histogram estimation algorithm with Gibbs sampling and feature selection to approximate the empirical joint distribution of a discrete dataset. SBHG offers a competitive alternative to state-of-the art synthetic data generators while preserving the sparsity structure of the original dataset, which leads to improved statistical utility as illustrated on simulated data. Finally, to study the utility of the resulting synthetic data sets generated by SBHG, we also perform logistic regression using the synthetic datasets and compare the classification accuracy with those from using the original dataset.https://journalprivacyconfidentiality.org/index.php/jpc/article/view/657differential privacyhigh dimensional sparse histogramsstability based algorithmperturbed Gibbs samplerStability Based Hashed Gibbs Sampler
spellingShingle Bai Li
Vishesh Karwa
Aleksandra Slavković
Rebecca Carter Steorts
A Privacy Preserving Algorithm to Release Sparse High-dimensional Histograms
The Journal of Privacy and Confidentiality
differential privacy
high dimensional sparse histograms
stability based algorithm
perturbed Gibbs sampler
Stability Based Hashed Gibbs Sampler
title A Privacy Preserving Algorithm to Release Sparse High-dimensional Histograms
title_full A Privacy Preserving Algorithm to Release Sparse High-dimensional Histograms
title_fullStr A Privacy Preserving Algorithm to Release Sparse High-dimensional Histograms
title_full_unstemmed A Privacy Preserving Algorithm to Release Sparse High-dimensional Histograms
title_short A Privacy Preserving Algorithm to Release Sparse High-dimensional Histograms
title_sort privacy preserving algorithm to release sparse high dimensional histograms
topic differential privacy
high dimensional sparse histograms
stability based algorithm
perturbed Gibbs sampler
Stability Based Hashed Gibbs Sampler
url https://journalprivacyconfidentiality.org/index.php/jpc/article/view/657
work_keys_str_mv AT baili aprivacypreservingalgorithmtoreleasesparsehighdimensionalhistograms
AT visheshkarwa aprivacypreservingalgorithmtoreleasesparsehighdimensionalhistograms
AT aleksandraslavkovic aprivacypreservingalgorithmtoreleasesparsehighdimensionalhistograms
AT rebeccacartersteorts aprivacypreservingalgorithmtoreleasesparsehighdimensionalhistograms
AT baili privacypreservingalgorithmtoreleasesparsehighdimensionalhistograms
AT visheshkarwa privacypreservingalgorithmtoreleasesparsehighdimensionalhistograms
AT aleksandraslavkovic privacypreservingalgorithmtoreleasesparsehighdimensionalhistograms
AT rebeccacartersteorts privacypreservingalgorithmtoreleasesparsehighdimensionalhistograms