MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion Model

This paper describes MixGAN-TTS, an efficient and stable non-autoregressive speech synthesis based on diffusion model. The MixGAN-TTS uses a linguistic encoder based on soft phoneme-level alignment and hard word-level alignment approach which explicitly extracts word-level semantic information, and...

Full description

Bibliographic Details
Main Authors:	Yan Deng, Ning Wu, Chengjun Qiu, Yangyang Luo, Yan Chen
Format:	Article
Language:	English
Published:	IEEE 2023-01-01
Series:	IEEE Access
Subjects:	Speech synthesis diffusion model mixture attention mechanism deep learning
Online Access:	https://ieeexplore.ieee.org/document/10145456/

_version_	1797803120191864832
author	Yan Deng Ning Wu Chengjun Qiu Yangyang Luo Yan Chen
author_facet	Yan Deng Ning Wu Chengjun Qiu Yangyang Luo Yan Chen
author_sort	Yan Deng
collection	DOAJ
description	This paper describes MixGAN-TTS, an efficient and stable non-autoregressive speech synthesis based on diffusion model. The MixGAN-TTS uses a linguistic encoder based on soft phoneme-level alignment and hard word-level alignment approach which explicitly extracts word-level semantic information, and introduces pitch and energy predictors to optimally predict the rhythmic information of the audio. Specifically, we use the GAN to replace the Gaussian function to model the denoising distribution, aiming to enlarge the denoising steps size and reduce the number of denoising steps to accelerate the sampling speed of diffusion model. Diffusion model using GAN can significantly reduce the denoising steps, and to some extent solve the problem of not being able to apply in real-time. The mel-spectrogram is converted into the final audio by the HiFi-GAN vocoder. Experimental results show that the MixGAN-TTS outperforms the other models compared in terms of audio quality and mel-spectrogram modeling capability for 4 denoising steps. The ablation studies demonstrate that the structure of MixGAN-TTS is effective.
first_indexed	2024-03-13T05:16:03Z
format	Article
id	doaj.art-51aab411976e412b8a338d0ade84d204
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-03-13T05:16:03Z
publishDate	2023-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-51aab411976e412b8a338d0ade84d2042023-06-15T23:00:54ZengIEEEIEEE Access2169-35362023-01-0111576745768210.1109/ACCESS.2023.328377210145456MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion ModelYan Deng0https://orcid.org/0000-0002-0778-6144Ning Wu1https://orcid.org/0000-0002-4951-6337Chengjun Qiu2https://orcid.org/0009-0001-2264-8866Yangyang Luo3https://orcid.org/0009-0005-3533-3619Yan Chen4https://orcid.org/0000-0002-9950-684XSchool of Computer, Electronics and Information, Guangxi University, Nanning, ChinaKey Laboratory of Beibu Gulf Offshore Engineering Equipment and Technology, Beibu Gulf University, Qinzhou, ChinaCollege of Mechanical Naval Architecture and Ocean Engineering, Beibu Gulf University, Qinzhou, ChinaSchool of Computer, Electronics and Information, Guangxi University, Nanning, ChinaSchool of Computer, Electronics and Information, Guangxi University, Nanning, ChinaThis paper describes MixGAN-TTS, an efficient and stable non-autoregressive speech synthesis based on diffusion model. The MixGAN-TTS uses a linguistic encoder based on soft phoneme-level alignment and hard word-level alignment approach which explicitly extracts word-level semantic information, and introduces pitch and energy predictors to optimally predict the rhythmic information of the audio. Specifically, we use the GAN to replace the Gaussian function to model the denoising distribution, aiming to enlarge the denoising steps size and reduce the number of denoising steps to accelerate the sampling speed of diffusion model. Diffusion model using GAN can significantly reduce the denoising steps, and to some extent solve the problem of not being able to apply in real-time. The mel-spectrogram is converted into the final audio by the HiFi-GAN vocoder. Experimental results show that the MixGAN-TTS outperforms the other models compared in terms of audio quality and mel-spectrogram modeling capability for 4 denoising steps. The ablation studies demonstrate that the structure of MixGAN-TTS is effective.https://ieeexplore.ieee.org/document/10145456/Speech synthesisdiffusion modelmixture attention mechanismdeep learning
spellingShingle	Yan Deng Ning Wu Chengjun Qiu Yangyang Luo Yan Chen MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion Model IEEE Access Speech synthesis diffusion model mixture attention mechanism deep learning
title	MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion Model
title_full	MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion Model
title_fullStr	MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion Model
title_full_unstemmed	MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion Model
title_short	MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion Model
title_sort	mixgan tts efficient and stable speech synthesis based on diffusion model
topic	Speech synthesis diffusion model mixture attention mechanism deep learning
url	https://ieeexplore.ieee.org/document/10145456/
work_keys_str_mv	AT yandeng mixganttsefficientandstablespeechsynthesisbasedondiffusionmodel AT ningwu mixganttsefficientandstablespeechsynthesisbasedondiffusionmodel AT chengjunqiu mixganttsefficientandstablespeechsynthesisbasedondiffusionmodel AT yangyangluo mixganttsefficientandstablespeechsynthesisbasedondiffusionmodel AT yanchen mixganttsefficientandstablespeechsynthesisbasedondiffusionmodel

MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion Model

Similar Items