MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion Model
This paper describes MixGAN-TTS, an efficient and stable non-autoregressive speech synthesis based on diffusion model. The MixGAN-TTS uses a linguistic encoder based on soft phoneme-level alignment and hard word-level alignment approach which explicitly extracts word-level semantic information, and...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2023-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10145456/ |
_version_ | 1797803120191864832 |
---|---|
author | Yan Deng Ning Wu Chengjun Qiu Yangyang Luo Yan Chen |
author_facet | Yan Deng Ning Wu Chengjun Qiu Yangyang Luo Yan Chen |
author_sort | Yan Deng |
collection | DOAJ |
description | This paper describes MixGAN-TTS, an efficient and stable non-autoregressive speech synthesis based on diffusion model. The MixGAN-TTS uses a linguistic encoder based on soft phoneme-level alignment and hard word-level alignment approach which explicitly extracts word-level semantic information, and introduces pitch and energy predictors to optimally predict the rhythmic information of the audio. Specifically, we use the GAN to replace the Gaussian function to model the denoising distribution, aiming to enlarge the denoising steps size and reduce the number of denoising steps to accelerate the sampling speed of diffusion model. Diffusion model using GAN can significantly reduce the denoising steps, and to some extent solve the problem of not being able to apply in real-time. The mel-spectrogram is converted into the final audio by the HiFi-GAN vocoder. Experimental results show that the MixGAN-TTS outperforms the other models compared in terms of audio quality and mel-spectrogram modeling capability for 4 denoising steps. The ablation studies demonstrate that the structure of MixGAN-TTS is effective. |
first_indexed | 2024-03-13T05:16:03Z |
format | Article |
id | doaj.art-51aab411976e412b8a338d0ade84d204 |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-03-13T05:16:03Z |
publishDate | 2023-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-51aab411976e412b8a338d0ade84d2042023-06-15T23:00:54ZengIEEEIEEE Access2169-35362023-01-0111576745768210.1109/ACCESS.2023.328377210145456MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion ModelYan Deng0https://orcid.org/0000-0002-0778-6144Ning Wu1https://orcid.org/0000-0002-4951-6337Chengjun Qiu2https://orcid.org/0009-0001-2264-8866Yangyang Luo3https://orcid.org/0009-0005-3533-3619Yan Chen4https://orcid.org/0000-0002-9950-684XSchool of Computer, Electronics and Information, Guangxi University, Nanning, ChinaKey Laboratory of Beibu Gulf Offshore Engineering Equipment and Technology, Beibu Gulf University, Qinzhou, ChinaCollege of Mechanical Naval Architecture and Ocean Engineering, Beibu Gulf University, Qinzhou, ChinaSchool of Computer, Electronics and Information, Guangxi University, Nanning, ChinaSchool of Computer, Electronics and Information, Guangxi University, Nanning, ChinaThis paper describes MixGAN-TTS, an efficient and stable non-autoregressive speech synthesis based on diffusion model. The MixGAN-TTS uses a linguistic encoder based on soft phoneme-level alignment and hard word-level alignment approach which explicitly extracts word-level semantic information, and introduces pitch and energy predictors to optimally predict the rhythmic information of the audio. Specifically, we use the GAN to replace the Gaussian function to model the denoising distribution, aiming to enlarge the denoising steps size and reduce the number of denoising steps to accelerate the sampling speed of diffusion model. Diffusion model using GAN can significantly reduce the denoising steps, and to some extent solve the problem of not being able to apply in real-time. The mel-spectrogram is converted into the final audio by the HiFi-GAN vocoder. Experimental results show that the MixGAN-TTS outperforms the other models compared in terms of audio quality and mel-spectrogram modeling capability for 4 denoising steps. The ablation studies demonstrate that the structure of MixGAN-TTS is effective.https://ieeexplore.ieee.org/document/10145456/Speech synthesisdiffusion modelmixture attention mechanismdeep learning |
spellingShingle | Yan Deng Ning Wu Chengjun Qiu Yangyang Luo Yan Chen MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion Model IEEE Access Speech synthesis diffusion model mixture attention mechanism deep learning |
title | MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion Model |
title_full | MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion Model |
title_fullStr | MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion Model |
title_full_unstemmed | MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion Model |
title_short | MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion Model |
title_sort | mixgan tts efficient and stable speech synthesis based on diffusion model |
topic | Speech synthesis diffusion model mixture attention mechanism deep learning |
url | https://ieeexplore.ieee.org/document/10145456/ |
work_keys_str_mv | AT yandeng mixganttsefficientandstablespeechsynthesisbasedondiffusionmodel AT ningwu mixganttsefficientandstablespeechsynthesisbasedondiffusionmodel AT chengjunqiu mixganttsefficientandstablespeechsynthesisbasedondiffusionmodel AT yangyangluo mixganttsefficientandstablespeechsynthesisbasedondiffusionmodel AT yanchen mixganttsefficientandstablespeechsynthesisbasedondiffusionmodel |