Speaker Anonymization using End-to-End Zero-Shot Voice Conversion
Spoken language is a rich medium of communication that combines words with various information about emotions, feelings, and excitation through modulations in tone and pitch. In discourse, this allows for maintaining a human element that is lacking in many other channels, such as writing or social m...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Published: |
Massachusetts Institute of Technology
2022
|
Online Access: | https://hdl.handle.net/1721.1/144662 |
_version_ | 1811082096352952320 |
---|---|
author | Kang, Wonjune |
author2 | Roy, Deb |
author_facet | Roy, Deb Kang, Wonjune |
author_sort | Kang, Wonjune |
collection | MIT |
description | Spoken language is a rich medium of communication that combines words with various information about emotions, feelings, and excitation through modulations in tone and pitch. In discourse, this allows for maintaining a human element that is lacking in many other channels, such as writing or social media. However, a person's voice is a distinct biomarker, and there exist many settings in which it may need to be anonymized in order to protect the speaker's identity.
This thesis presents a framework for performing speaker anonymization using voice conversion (VC) methods. We first introduce a model for performing end-to-end zero-shot voice conversion by modifying the architecture of a neural vocoder. To the best of our knowledge, this is one of the first end-to-end approaches for zero-shot VC that has ever been proposed. Our model is able to maintain the clarity and intelligibility of transformed speech very well while also achieving good voice style transfer performance---an improvement over current state-of-the-art VC models, which exhibit a trade-off between audio quality and accurate voice style transfer.
Next, we present a method for extending targeted voice conversion to un-targeted voice anonymization. This is done by fitting a Gaussian mixture model (GMM) to the latent space of speaker embeddings that are fed into the VC model, and then sampling from the GMM to select the target voice for anonymization. This obviates the need for explicitly specifying a target speaker when performing VC-based anonymization.
We evaluate both our voice conversion and anonymization methods on publicly available data as well as real-world audio from conversations on the Local Voices Network (LVN) platform, demonstrating their applicability to "in-the-wild" settings. Finally, we provide a discussion of this work's potential applications and the ethical considerations of using voice conversion technologies in society. |
first_indexed | 2024-09-23T11:57:30Z |
format | Thesis |
id | mit-1721.1/144662 |
institution | Massachusetts Institute of Technology |
last_indexed | 2024-09-23T11:57:30Z |
publishDate | 2022 |
publisher | Massachusetts Institute of Technology |
record_format | dspace |
spelling | mit-1721.1/1446622022-08-30T03:37:34Z Speaker Anonymization using End-to-End Zero-Shot Voice Conversion Kang, Wonjune Roy, Deb Program in Media Arts and Sciences (Massachusetts Institute of Technology) Spoken language is a rich medium of communication that combines words with various information about emotions, feelings, and excitation through modulations in tone and pitch. In discourse, this allows for maintaining a human element that is lacking in many other channels, such as writing or social media. However, a person's voice is a distinct biomarker, and there exist many settings in which it may need to be anonymized in order to protect the speaker's identity. This thesis presents a framework for performing speaker anonymization using voice conversion (VC) methods. We first introduce a model for performing end-to-end zero-shot voice conversion by modifying the architecture of a neural vocoder. To the best of our knowledge, this is one of the first end-to-end approaches for zero-shot VC that has ever been proposed. Our model is able to maintain the clarity and intelligibility of transformed speech very well while also achieving good voice style transfer performance---an improvement over current state-of-the-art VC models, which exhibit a trade-off between audio quality and accurate voice style transfer. Next, we present a method for extending targeted voice conversion to un-targeted voice anonymization. This is done by fitting a Gaussian mixture model (GMM) to the latent space of speaker embeddings that are fed into the VC model, and then sampling from the GMM to select the target voice for anonymization. This obviates the need for explicitly specifying a target speaker when performing VC-based anonymization. We evaluate both our voice conversion and anonymization methods on publicly available data as well as real-world audio from conversations on the Local Voices Network (LVN) platform, demonstrating their applicability to "in-the-wild" settings. Finally, we provide a discussion of this work's potential applications and the ethical considerations of using voice conversion technologies in society. S.M. 2022-08-29T16:03:01Z 2022-08-29T16:03:01Z 2022-05 2022-06-07T17:53:35.418Z Thesis https://hdl.handle.net/1721.1/144662 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology |
spellingShingle | Kang, Wonjune Speaker Anonymization using End-to-End Zero-Shot Voice Conversion |
title | Speaker Anonymization using End-to-End Zero-Shot Voice Conversion |
title_full | Speaker Anonymization using End-to-End Zero-Shot Voice Conversion |
title_fullStr | Speaker Anonymization using End-to-End Zero-Shot Voice Conversion |
title_full_unstemmed | Speaker Anonymization using End-to-End Zero-Shot Voice Conversion |
title_short | Speaker Anonymization using End-to-End Zero-Shot Voice Conversion |
title_sort | speaker anonymization using end to end zero shot voice conversion |
url | https://hdl.handle.net/1721.1/144662 |
work_keys_str_mv | AT kangwonjune speakeranonymizationusingendtoendzeroshotvoiceconversion |