Non-Parallel Whisper-to-Normal Speaking Style Conversion Using Auxiliary Classifier Variational Autoencoder

This paper is concerned with non-parallel whisper-to-normal speaking-style conversion (W2N-SC), which converts whispered speech into normal speech without using parallel training data. Most relevant to this task is voice conversion (VC), which converts one speaker’s voice to another. Howe...

Full description

Bibliographic Details
Main Authors:	Shogo Seki, Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka
Format:	Article
Language:	English
Published:	IEEE 2023-01-01
Series:	IEEE Access
Subjects:	Voice conversion whisper-to-normal speaking style conversion variational autoencoder self-supervised learning data augmentation
Online Access:	https://ieeexplore.ieee.org/document/10109017/

_version_	1797805955771006976
author	Shogo Seki Hirokazu Kameoka Takuhiro Kaneko Kou Tanaka
author_facet	Shogo Seki Hirokazu Kameoka Takuhiro Kaneko Kou Tanaka
author_sort	Shogo Seki
collection	DOAJ
description	This paper is concerned with non-parallel whisper-to-normal speaking-style conversion (W2N-SC), which converts whispered speech into normal speech without using parallel training data. Most relevant to this task is voice conversion (VC), which converts one speaker’s voice to another. However, the W2N-SC task differs from the regular VC task in three main respects. First, unlike normal speech, whispered speech contains little or no pitch information. Second, whispered speech usually has significantly less energy than normal speech and is therefore more susceptible to external noise. Third, in the actual usage scenario of W2N-SC, users may suddenly switch voice modes from whispered to normal speech, or vice versa, meaning that the speaking-style of input speech cannot be assumed in advance. To clarify whether existing VC techniques can successfully handle these task-specific concerns and how they should be modified to better address them, we consider a variational autoencoder (VAE)-based VC method as a baseline and examine what modifications to this method would be effective for the current task. Specifically, we study the effects of 1) a self-supervised training scheme called filling-in-frames (FIF); 2) data augmentation (DA) using noisy speech samples; and 3) an architecture that allows for any-to-many conversions. Through experimental evaluation of the W2N-SC and speaker conversion tasks, we confirmed that, especially in the W2N-SC task, the version incorporating the above modifications works better than the baseline VC model applied as is.
first_indexed	2024-03-13T06:00:25Z
format	Article
id	doaj.art-b1e162ef0003421a9cc91e7eff195414
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-03-13T06:00:25Z
publishDate	2023-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-b1e162ef0003421a9cc91e7eff1954142023-06-12T23:01:32ZengIEEEIEEE Access2169-35362023-01-0111445904459910.1109/ACCESS.2023.327069910109017Non-Parallel Whisper-to-Normal Speaking Style Conversion Using Auxiliary Classifier Variational AutoencoderShogo Seki0https://orcid.org/0009-0007-3990-3740Hirokazu Kameoka1https://orcid.org/0000-0003-3102-0162Takuhiro Kaneko2Kou Tanaka3NTT Communication Science Laboratories, Nippon Telegraph and Telephone Corporation, Atsugi, JapanNTT Communication Science Laboratories, Nippon Telegraph and Telephone Corporation, Atsugi, JapanNTT Communication Science Laboratories, Nippon Telegraph and Telephone Corporation, Atsugi, JapanNTT Communication Science Laboratories, Nippon Telegraph and Telephone Corporation, Atsugi, JapanThis paper is concerned with non-parallel whisper-to-normal speaking-style conversion (W2N-SC), which converts whispered speech into normal speech without using parallel training data. Most relevant to this task is voice conversion (VC), which converts one speaker’s voice to another. However, the W2N-SC task differs from the regular VC task in three main respects. First, unlike normal speech, whispered speech contains little or no pitch information. Second, whispered speech usually has significantly less energy than normal speech and is therefore more susceptible to external noise. Third, in the actual usage scenario of W2N-SC, users may suddenly switch voice modes from whispered to normal speech, or vice versa, meaning that the speaking-style of input speech cannot be assumed in advance. To clarify whether existing VC techniques can successfully handle these task-specific concerns and how they should be modified to better address them, we consider a variational autoencoder (VAE)-based VC method as a baseline and examine what modifications to this method would be effective for the current task. Specifically, we study the effects of 1) a self-supervised training scheme called filling-in-frames (FIF); 2) data augmentation (DA) using noisy speech samples; and 3) an architecture that allows for any-to-many conversions. Through experimental evaluation of the W2N-SC and speaker conversion tasks, we confirmed that, especially in the W2N-SC task, the version incorporating the above modifications works better than the baseline VC model applied as is.https://ieeexplore.ieee.org/document/10109017/Voice conversionwhisper-to-normal speaking style conversionvariational autoencoderself-supervised learningdata augmentation
spellingShingle	Shogo Seki Hirokazu Kameoka Takuhiro Kaneko Kou Tanaka Non-Parallel Whisper-to-Normal Speaking Style Conversion Using Auxiliary Classifier Variational Autoencoder IEEE Access Voice conversion whisper-to-normal speaking style conversion variational autoencoder self-supervised learning data augmentation
title	Non-Parallel Whisper-to-Normal Speaking Style Conversion Using Auxiliary Classifier Variational Autoencoder
title_full	Non-Parallel Whisper-to-Normal Speaking Style Conversion Using Auxiliary Classifier Variational Autoencoder
title_fullStr	Non-Parallel Whisper-to-Normal Speaking Style Conversion Using Auxiliary Classifier Variational Autoencoder
title_full_unstemmed	Non-Parallel Whisper-to-Normal Speaking Style Conversion Using Auxiliary Classifier Variational Autoencoder
title_short	Non-Parallel Whisper-to-Normal Speaking Style Conversion Using Auxiliary Classifier Variational Autoencoder
title_sort	non parallel whisper to normal speaking style conversion using auxiliary classifier variational autoencoder
topic	Voice conversion whisper-to-normal speaking style conversion variational autoencoder self-supervised learning data augmentation
url	https://ieeexplore.ieee.org/document/10109017/
work_keys_str_mv	AT shogoseki nonparallelwhispertonormalspeakingstyleconversionusingauxiliaryclassifiervariationalautoencoder AT hirokazukameoka nonparallelwhispertonormalspeakingstyleconversionusingauxiliaryclassifiervariationalautoencoder AT takuhirokaneko nonparallelwhispertonormalspeakingstyleconversionusingauxiliaryclassifiervariationalautoencoder AT koutanaka nonparallelwhispertonormalspeakingstyleconversionusingauxiliaryclassifiervariationalautoencoder

Non-Parallel Whisper-to-Normal Speaking Style Conversion Using Auxiliary Classifier Variational Autoencoder

Similar Items