Generating Differentially Private Synthetic Text

The advent of more powerful cloud compute over the past decade has made it possible to train the deep neural networks used today for applications in almost everything we do. However, the amount of existing data for private datasets, such as hospital records, remain scarce and will probably remain sc...

Full description

Bibliographic Details
Main Author: Park, YeonHwan
Other Authors: Kagal, Lalana
Format: Thesis
Published: Massachusetts Institute of Technology 2022
Online Access:https://hdl.handle.net/1721.1/144503
_version_ 1826212436204584960
author Park, YeonHwan
author2 Kagal, Lalana
author_facet Kagal, Lalana
Park, YeonHwan
author_sort Park, YeonHwan
collection MIT
description The advent of more powerful cloud compute over the past decade has made it possible to train the deep neural networks used today for applications in almost everything we do. However, the amount of existing data for private datasets, such as hospital records, remain scarce and will probably remain scarce for the foreseeable future. Without high-quality data, neural networks will not be able to perform high-quality inference. To aid in training models when existing information is limited, we aim to train existing deep neural network architectures to generate synthetic text that is similar to the text it was trained on without memorizing one-to-one mappings or leaking any sensitive data. To achieve this goal, we fine-tune our models to adhere to a strong notion differential privacy – a mathematical model bounding the extent to which an adversary can reconstruct the original dataset. In the desire to use the differentially private models to generate mixed-type tabular datasets with unstructured text, we also perform a survey to gain a better understanding of how our algorithm might be used to supplement existing neural networks.
first_indexed 2024-09-23T15:21:18Z
format Thesis
id mit-1721.1/144503
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T15:21:18Z
publishDate 2022
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/1445032022-08-30T03:10:07Z Generating Differentially Private Synthetic Text Park, YeonHwan Kagal, Lalana Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science The advent of more powerful cloud compute over the past decade has made it possible to train the deep neural networks used today for applications in almost everything we do. However, the amount of existing data for private datasets, such as hospital records, remain scarce and will probably remain scarce for the foreseeable future. Without high-quality data, neural networks will not be able to perform high-quality inference. To aid in training models when existing information is limited, we aim to train existing deep neural network architectures to generate synthetic text that is similar to the text it was trained on without memorizing one-to-one mappings or leaking any sensitive data. To achieve this goal, we fine-tune our models to adhere to a strong notion differential privacy – a mathematical model bounding the extent to which an adversary can reconstruct the original dataset. In the desire to use the differentially private models to generate mixed-type tabular datasets with unstructured text, we also perform a survey to gain a better understanding of how our algorithm might be used to supplement existing neural networks. M.Eng. 2022-08-29T15:51:56Z 2022-08-29T15:51:56Z 2022-05 2022-05-27T16:18:21.479Z Thesis https://hdl.handle.net/1721.1/144503 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle Park, YeonHwan
Generating Differentially Private Synthetic Text
title Generating Differentially Private Synthetic Text
title_full Generating Differentially Private Synthetic Text
title_fullStr Generating Differentially Private Synthetic Text
title_full_unstemmed Generating Differentially Private Synthetic Text
title_short Generating Differentially Private Synthetic Text
title_sort generating differentially private synthetic text
url https://hdl.handle.net/1721.1/144503
work_keys_str_mv AT parkyeonhwan generatingdifferentiallyprivatesynthetictext