Speaker Anonymization using End-to-End Zero-Shot Voice Conversion

Spoken language is a rich medium of communication that combines words with various information about emotions, feelings, and excitation through modulations in tone and pitch. In discourse, this allows for maintaining a human element that is lacking in many other channels, such as writing or social m...

Full description

Bibliographic Details
Main Author: Kang, Wonjune
Other Authors: Roy, Deb
Format: Thesis
Published: Massachusetts Institute of Technology 2022
Online Access:https://hdl.handle.net/1721.1/144662
_version_ 1811082096352952320
author Kang, Wonjune
author2 Roy, Deb
author_facet Roy, Deb
Kang, Wonjune
author_sort Kang, Wonjune
collection MIT
description Spoken language is a rich medium of communication that combines words with various information about emotions, feelings, and excitation through modulations in tone and pitch. In discourse, this allows for maintaining a human element that is lacking in many other channels, such as writing or social media. However, a person's voice is a distinct biomarker, and there exist many settings in which it may need to be anonymized in order to protect the speaker's identity. This thesis presents a framework for performing speaker anonymization using voice conversion (VC) methods. We first introduce a model for performing end-to-end zero-shot voice conversion by modifying the architecture of a neural vocoder. To the best of our knowledge, this is one of the first end-to-end approaches for zero-shot VC that has ever been proposed. Our model is able to maintain the clarity and intelligibility of transformed speech very well while also achieving good voice style transfer performance---an improvement over current state-of-the-art VC models, which exhibit a trade-off between audio quality and accurate voice style transfer. Next, we present a method for extending targeted voice conversion to un-targeted voice anonymization. This is done by fitting a Gaussian mixture model (GMM) to the latent space of speaker embeddings that are fed into the VC model, and then sampling from the GMM to select the target voice for anonymization. This obviates the need for explicitly specifying a target speaker when performing VC-based anonymization. We evaluate both our voice conversion and anonymization methods on publicly available data as well as real-world audio from conversations on the Local Voices Network (LVN) platform, demonstrating their applicability to "in-the-wild" settings. Finally, we provide a discussion of this work's potential applications and the ethical considerations of using voice conversion technologies in society.
first_indexed 2024-09-23T11:57:30Z
format Thesis
id mit-1721.1/144662
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T11:57:30Z
publishDate 2022
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/1446622022-08-30T03:37:34Z Speaker Anonymization using End-to-End Zero-Shot Voice Conversion Kang, Wonjune Roy, Deb Program in Media Arts and Sciences (Massachusetts Institute of Technology) Spoken language is a rich medium of communication that combines words with various information about emotions, feelings, and excitation through modulations in tone and pitch. In discourse, this allows for maintaining a human element that is lacking in many other channels, such as writing or social media. However, a person's voice is a distinct biomarker, and there exist many settings in which it may need to be anonymized in order to protect the speaker's identity. This thesis presents a framework for performing speaker anonymization using voice conversion (VC) methods. We first introduce a model for performing end-to-end zero-shot voice conversion by modifying the architecture of a neural vocoder. To the best of our knowledge, this is one of the first end-to-end approaches for zero-shot VC that has ever been proposed. Our model is able to maintain the clarity and intelligibility of transformed speech very well while also achieving good voice style transfer performance---an improvement over current state-of-the-art VC models, which exhibit a trade-off between audio quality and accurate voice style transfer. Next, we present a method for extending targeted voice conversion to un-targeted voice anonymization. This is done by fitting a Gaussian mixture model (GMM) to the latent space of speaker embeddings that are fed into the VC model, and then sampling from the GMM to select the target voice for anonymization. This obviates the need for explicitly specifying a target speaker when performing VC-based anonymization. We evaluate both our voice conversion and anonymization methods on publicly available data as well as real-world audio from conversations on the Local Voices Network (LVN) platform, demonstrating their applicability to "in-the-wild" settings. Finally, we provide a discussion of this work's potential applications and the ethical considerations of using voice conversion technologies in society. S.M. 2022-08-29T16:03:01Z 2022-08-29T16:03:01Z 2022-05 2022-06-07T17:53:35.418Z Thesis https://hdl.handle.net/1721.1/144662 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle Kang, Wonjune
Speaker Anonymization using End-to-End Zero-Shot Voice Conversion
title Speaker Anonymization using End-to-End Zero-Shot Voice Conversion
title_full Speaker Anonymization using End-to-End Zero-Shot Voice Conversion
title_fullStr Speaker Anonymization using End-to-End Zero-Shot Voice Conversion
title_full_unstemmed Speaker Anonymization using End-to-End Zero-Shot Voice Conversion
title_short Speaker Anonymization using End-to-End Zero-Shot Voice Conversion
title_sort speaker anonymization using end to end zero shot voice conversion
url https://hdl.handle.net/1721.1/144662
work_keys_str_mv AT kangwonjune speakeranonymizationusingendtoendzeroshotvoiceconversion