Effects of Sinusoidal Model on Non-Parallel Voice Conversion with Adversarial Learning

Voice conversion (VC) transforms the speaking style of a source speaker to the speaking style of a target speaker by keeping linguistic information unchanged. Traditional VC techniques rely on parallel recordings of multiple speakers uttering the same sentences. Earlier approaches mainly find a mapp...

Full description

Bibliographic Details
Main Authors: Mohammed Salah Al-Radhi, Tamás Gábor Csapó, Géza Németh
Format: Article
Language:English
Published: MDPI AG 2021-08-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/11/16/7489
_version_ 1797524751512502272
author Mohammed Salah Al-Radhi
Tamás Gábor Csapó
Géza Németh
author_facet Mohammed Salah Al-Radhi
Tamás Gábor Csapó
Géza Németh
author_sort Mohammed Salah Al-Radhi
collection DOAJ
description Voice conversion (VC) transforms the speaking style of a source speaker to the speaking style of a target speaker by keeping linguistic information unchanged. Traditional VC techniques rely on parallel recordings of multiple speakers uttering the same sentences. Earlier approaches mainly find a mapping between the given source–target speakers, which contain pairs of similar utterances spoken by different speakers. However, parallel data are computationally expensive and difficult to collect. Non-parallel VC remains an interesting but challenging speech processing task. To address this limitation, we propose a method that allows a non-parallel many-to-many voice conversion by using a generative adversarial network. To the best of the authors’ knowledge, our study is the first one that employs a sinusoidal model with continuous parameters to generate converted speech signals. Our method involves only several minutes of training examples without parallel utterances or time alignment procedures, where the source–target speakers are entirely unseen by the training dataset. Moreover, empirical study is carried out on the publicly available CSTR VCTK corpus. Our conclusions indicate that the proposed method reached the state-of-the-art results in speaker similarity to the utterance produced by the target speaker, while suggesting important structural ones to be further analyzed by experts.
first_indexed 2024-03-10T09:01:00Z
format Article
id doaj.art-0931fd748ab94e599540bb8bf560e1a8
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-10T09:01:00Z
publishDate 2021-08-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-0931fd748ab94e599540bb8bf560e1a82023-11-22T06:42:25ZengMDPI AGApplied Sciences2076-34172021-08-011116748910.3390/app11167489Effects of Sinusoidal Model on Non-Parallel Voice Conversion with Adversarial LearningMohammed Salah Al-Radhi0Tamás Gábor Csapó1Géza Németh2Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, 1111 Budapest, HungaryDepartment of Telecommunications and Media Informatics, Budapest University of Technology and Economics, 1111 Budapest, HungaryDepartment of Telecommunications and Media Informatics, Budapest University of Technology and Economics, 1111 Budapest, HungaryVoice conversion (VC) transforms the speaking style of a source speaker to the speaking style of a target speaker by keeping linguistic information unchanged. Traditional VC techniques rely on parallel recordings of multiple speakers uttering the same sentences. Earlier approaches mainly find a mapping between the given source–target speakers, which contain pairs of similar utterances spoken by different speakers. However, parallel data are computationally expensive and difficult to collect. Non-parallel VC remains an interesting but challenging speech processing task. To address this limitation, we propose a method that allows a non-parallel many-to-many voice conversion by using a generative adversarial network. To the best of the authors’ knowledge, our study is the first one that employs a sinusoidal model with continuous parameters to generate converted speech signals. Our method involves only several minutes of training examples without parallel utterances or time alignment procedures, where the source–target speakers are entirely unseen by the training dataset. Moreover, empirical study is carried out on the publicly available CSTR VCTK corpus. Our conclusions indicate that the proposed method reached the state-of-the-art results in speaker similarity to the utterance produced by the target speaker, while suggesting important structural ones to be further analyzed by experts.https://www.mdpi.com/2076-3417/11/16/7489sinusoidal modelnon-parallel voice conversiongenerative adversarial networkscontinuous parameters
spellingShingle Mohammed Salah Al-Radhi
Tamás Gábor Csapó
Géza Németh
Effects of Sinusoidal Model on Non-Parallel Voice Conversion with Adversarial Learning
Applied Sciences
sinusoidal model
non-parallel voice conversion
generative adversarial networks
continuous parameters
title Effects of Sinusoidal Model on Non-Parallel Voice Conversion with Adversarial Learning
title_full Effects of Sinusoidal Model on Non-Parallel Voice Conversion with Adversarial Learning
title_fullStr Effects of Sinusoidal Model on Non-Parallel Voice Conversion with Adversarial Learning
title_full_unstemmed Effects of Sinusoidal Model on Non-Parallel Voice Conversion with Adversarial Learning
title_short Effects of Sinusoidal Model on Non-Parallel Voice Conversion with Adversarial Learning
title_sort effects of sinusoidal model on non parallel voice conversion with adversarial learning
topic sinusoidal model
non-parallel voice conversion
generative adversarial networks
continuous parameters
url https://www.mdpi.com/2076-3417/11/16/7489
work_keys_str_mv AT mohammedsalahalradhi effectsofsinusoidalmodelonnonparallelvoiceconversionwithadversariallearning
AT tamasgaborcsapo effectsofsinusoidalmodelonnonparallelvoiceconversionwithadversariallearning
AT gezanemeth effectsofsinusoidalmodelonnonparallelvoiceconversionwithadversariallearning