Summary: | As we depend on data more heavily to power the insights made by machine learning systems, it becomes imperative that we design guarantees for protecting the privacy of such data. Recent research has shown the ease with which attacks such as membership inference or model inversion can extract potentially sensitive training data given the model alone. To prevent curious or malevolent users from gleaning training data through these attacks, we propose the generation of private synthetic datasets to replace the original datasets in training and testing the model. These synthetic datasets will have the same semantic and statistical distribution as the original dataset, but will be differentially private, thus preventing individuals in the dataset from being identified. This would guarantee that no sensitive information from the original dataset can be extracted from the generated synthetic dataset. Compared to related works that dealt with either structured data or unstructured data separately, our work developed a pipeline for generating synthetic datasets given a complex dataset consisting of structured and unstructured text, as well as numerical data. We used a number of metrics to evaluate the generation pipeline according to its statistical similarity to the original dataset, its utility, and its privacy. Our experiments focused on varying the degree of privacy across the sub-modules of the pipeline. We found that we can generate differentially private synthetic datasets whose structured and unstructured components each achieve good performance in similarity, utility, and privacy.
|