MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion Model

This paper describes MixGAN-TTS, an efficient and stable non-autoregressive speech synthesis based on diffusion model. The MixGAN-TTS uses a linguistic encoder based on soft phoneme-level alignment and hard word-level alignment approach which explicitly extracts word-level semantic information, and...

Full description

Bibliographic Details
Main Authors: Yan Deng, Ning Wu, Chengjun Qiu, Yangyang Luo, Yan Chen
Format: Article
Language:English
Published: IEEE 2023-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10145456/
_version_ 1797803120191864832
author Yan Deng
Ning Wu
Chengjun Qiu
Yangyang Luo
Yan Chen
author_facet Yan Deng
Ning Wu
Chengjun Qiu
Yangyang Luo
Yan Chen
author_sort Yan Deng
collection DOAJ
description This paper describes MixGAN-TTS, an efficient and stable non-autoregressive speech synthesis based on diffusion model. The MixGAN-TTS uses a linguistic encoder based on soft phoneme-level alignment and hard word-level alignment approach which explicitly extracts word-level semantic information, and introduces pitch and energy predictors to optimally predict the rhythmic information of the audio. Specifically, we use the GAN to replace the Gaussian function to model the denoising distribution, aiming to enlarge the denoising steps size and reduce the number of denoising steps to accelerate the sampling speed of diffusion model. Diffusion model using GAN can significantly reduce the denoising steps, and to some extent solve the problem of not being able to apply in real-time. The mel-spectrogram is converted into the final audio by the HiFi-GAN vocoder. Experimental results show that the MixGAN-TTS outperforms the other models compared in terms of audio quality and mel-spectrogram modeling capability for 4 denoising steps. The ablation studies demonstrate that the structure of MixGAN-TTS is effective.
first_indexed 2024-03-13T05:16:03Z
format Article
id doaj.art-51aab411976e412b8a338d0ade84d204
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-03-13T05:16:03Z
publishDate 2023-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-51aab411976e412b8a338d0ade84d2042023-06-15T23:00:54ZengIEEEIEEE Access2169-35362023-01-0111576745768210.1109/ACCESS.2023.328377210145456MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion ModelYan Deng0https://orcid.org/0000-0002-0778-6144Ning Wu1https://orcid.org/0000-0002-4951-6337Chengjun Qiu2https://orcid.org/0009-0001-2264-8866Yangyang Luo3https://orcid.org/0009-0005-3533-3619Yan Chen4https://orcid.org/0000-0002-9950-684XSchool of Computer, Electronics and Information, Guangxi University, Nanning, ChinaKey Laboratory of Beibu Gulf Offshore Engineering Equipment and Technology, Beibu Gulf University, Qinzhou, ChinaCollege of Mechanical Naval Architecture and Ocean Engineering, Beibu Gulf University, Qinzhou, ChinaSchool of Computer, Electronics and Information, Guangxi University, Nanning, ChinaSchool of Computer, Electronics and Information, Guangxi University, Nanning, ChinaThis paper describes MixGAN-TTS, an efficient and stable non-autoregressive speech synthesis based on diffusion model. The MixGAN-TTS uses a linguistic encoder based on soft phoneme-level alignment and hard word-level alignment approach which explicitly extracts word-level semantic information, and introduces pitch and energy predictors to optimally predict the rhythmic information of the audio. Specifically, we use the GAN to replace the Gaussian function to model the denoising distribution, aiming to enlarge the denoising steps size and reduce the number of denoising steps to accelerate the sampling speed of diffusion model. Diffusion model using GAN can significantly reduce the denoising steps, and to some extent solve the problem of not being able to apply in real-time. The mel-spectrogram is converted into the final audio by the HiFi-GAN vocoder. Experimental results show that the MixGAN-TTS outperforms the other models compared in terms of audio quality and mel-spectrogram modeling capability for 4 denoising steps. The ablation studies demonstrate that the structure of MixGAN-TTS is effective.https://ieeexplore.ieee.org/document/10145456/Speech synthesisdiffusion modelmixture attention mechanismdeep learning
spellingShingle Yan Deng
Ning Wu
Chengjun Qiu
Yangyang Luo
Yan Chen
MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion Model
IEEE Access
Speech synthesis
diffusion model
mixture attention mechanism
deep learning
title MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion Model
title_full MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion Model
title_fullStr MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion Model
title_full_unstemmed MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion Model
title_short MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion Model
title_sort mixgan tts efficient and stable speech synthesis based on diffusion model
topic Speech synthesis
diffusion model
mixture attention mechanism
deep learning
url https://ieeexplore.ieee.org/document/10145456/
work_keys_str_mv AT yandeng mixganttsefficientandstablespeechsynthesisbasedondiffusionmodel
AT ningwu mixganttsefficientandstablespeechsynthesisbasedondiffusionmodel
AT chengjunqiu mixganttsefficientandstablespeechsynthesisbasedondiffusionmodel
AT yangyangluo mixganttsefficientandstablespeechsynthesisbasedondiffusionmodel
AT yanchen mixganttsefficientandstablespeechsynthesisbasedondiffusionmodel