DIA-TTS: Deep-Inherited Attention-Based Text-to-Speech Synthesizer

Text-to-speech (TTS) synthesizers have been widely used as a vital assistive tool in various fields. Traditional sequence-to-sequence (seq2seq) TTS such as Tacotron2 uses a single soft attention mechanism for encoder and decoder alignment tasks, which is the biggest shortcoming that incorrectly or r...

Full description

Bibliographic Details
Main Authors:	Junxiao Yu, Zhengyuan Xu, Xu He, Jian Wang, Bin Liu, Rui Feng, Songsheng Zhu, Wei Wang, Jianqing Li
Format:	Article
Language:	English
Published:	MDPI AG 2022-12-01
Series:	Entropy
Subjects:	natural language processing text-to-speech deep learning information theory deep neural network local-sensitive attention
Online Access:	https://www.mdpi.com/1099-4300/25/1/41

_version_	1797442996200800256
author	Junxiao Yu Zhengyuan Xu Xu He Jian Wang Bin Liu Rui Feng Songsheng Zhu Wei Wang Jianqing Li
author_facet	Junxiao Yu Zhengyuan Xu Xu He Jian Wang Bin Liu Rui Feng Songsheng Zhu Wei Wang Jianqing Li
author_sort	Junxiao Yu
collection	DOAJ
description	Text-to-speech (TTS) synthesizers have been widely used as a vital assistive tool in various fields. Traditional sequence-to-sequence (seq2seq) TTS such as Tacotron2 uses a single soft attention mechanism for encoder and decoder alignment tasks, which is the biggest shortcoming that incorrectly or repeatedly generates words when dealing with long sentences. It may also generate sentences with run-on and wrong breaks regardless of punctuation marks, which causes the synthesized waveform to lack emotion and sound unnatural. In this paper, we propose an end-to-end neural generative TTS model that is based on the deep-inherited attention (DIA) mechanism along with an adjustable local-sensitive factor (LSF). The inheritance mechanism allows multiple iterations of the DIA by sharing the same training parameter, which tightens the token–frame correlation, as well as fastens the alignment process. In addition, LSF is adopted to enhance the context connection by expanding the DIA concentration region. In addition, a multi-RNN block is used in the decoder for better acoustic feature extraction and generation. Hidden-state information driven from the multi-RNN layers is utilized for attention alignment. The collaborative work of the DIA and multi-RNN layers contributes to outperformance in the high-quality prediction of the phrase breaks of the synthesized speech. We used WaveGlow as a vocoder for real-time, human-like audio synthesis. Human subjective experiments show that the DIA-TTS achieved a mean opinion score (MOS) of 4.48 in terms of naturalness. Ablation studies further prove the superiority of the DIA mechanism for the enhancement of phrase breaks and attention robustness.
first_indexed	2024-03-09T12:49:42Z
format	Article
id	doaj.art-239dac1e7b5a4c7cbddcc2396d52c45e
institution	Directory Open Access Journal
issn	1099-4300
language	English
last_indexed	2024-03-09T12:49:42Z
publishDate	2022-12-01
publisher	MDPI AG
record_format	Article
series	Entropy
spelling	doaj.art-239dac1e7b5a4c7cbddcc2396d52c45e2023-11-30T22:07:24ZengMDPI AGEntropy1099-43002022-12-012514110.3390/e25010041DIA-TTS: Deep-Inherited Attention-Based Text-to-Speech SynthesizerJunxiao Yu0Zhengyuan Xu1Xu He2Jian Wang3Bin Liu4Rui Feng5Songsheng Zhu6Wei Wang7Jianqing Li8Jiangsu Province Engineering Research Center of Smart Wearable and Rehabilitation Devices, School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing 211166, ChinaJiangsu Province Engineering Research Center of Smart Wearable and Rehabilitation Devices, School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing 211166, ChinaJiangsu Province Engineering Research Center of Smart Wearable and Rehabilitation Devices, School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing 211166, ChinaJiangsu Province Engineering Research Center of Smart Wearable and Rehabilitation Devices, School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing 211166, ChinaJiangsu Province Engineering Research Center of Smart Wearable and Rehabilitation Devices, School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing 211166, ChinaJiangsu Province Engineering Research Center of Smart Wearable and Rehabilitation Devices, School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing 211166, ChinaJiangsu Province Engineering Research Center of Smart Wearable and Rehabilitation Devices, School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing 211166, ChinaJiangsu Province Engineering Research Center of Smart Wearable and Rehabilitation Devices, School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing 211166, ChinaJiangsu Province Engineering Research Center of Smart Wearable and Rehabilitation Devices, School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing 211166, ChinaText-to-speech (TTS) synthesizers have been widely used as a vital assistive tool in various fields. Traditional sequence-to-sequence (seq2seq) TTS such as Tacotron2 uses a single soft attention mechanism for encoder and decoder alignment tasks, which is the biggest shortcoming that incorrectly or repeatedly generates words when dealing with long sentences. It may also generate sentences with run-on and wrong breaks regardless of punctuation marks, which causes the synthesized waveform to lack emotion and sound unnatural. In this paper, we propose an end-to-end neural generative TTS model that is based on the deep-inherited attention (DIA) mechanism along with an adjustable local-sensitive factor (LSF). The inheritance mechanism allows multiple iterations of the DIA by sharing the same training parameter, which tightens the token–frame correlation, as well as fastens the alignment process. In addition, LSF is adopted to enhance the context connection by expanding the DIA concentration region. In addition, a multi-RNN block is used in the decoder for better acoustic feature extraction and generation. Hidden-state information driven from the multi-RNN layers is utilized for attention alignment. The collaborative work of the DIA and multi-RNN layers contributes to outperformance in the high-quality prediction of the phrase breaks of the synthesized speech. We used WaveGlow as a vocoder for real-time, human-like audio synthesis. Human subjective experiments show that the DIA-TTS achieved a mean opinion score (MOS) of 4.48 in terms of naturalness. Ablation studies further prove the superiority of the DIA mechanism for the enhancement of phrase breaks and attention robustness.https://www.mdpi.com/1099-4300/25/1/41natural language processingtext-to-speechdeep learninginformation theorydeep neural networklocal-sensitive attention
spellingShingle	Junxiao Yu Zhengyuan Xu Xu He Jian Wang Bin Liu Rui Feng Songsheng Zhu Wei Wang Jianqing Li DIA-TTS: Deep-Inherited Attention-Based Text-to-Speech Synthesizer Entropy natural language processing text-to-speech deep learning information theory deep neural network local-sensitive attention
title	DIA-TTS: Deep-Inherited Attention-Based Text-to-Speech Synthesizer
title_full	DIA-TTS: Deep-Inherited Attention-Based Text-to-Speech Synthesizer
title_fullStr	DIA-TTS: Deep-Inherited Attention-Based Text-to-Speech Synthesizer
title_full_unstemmed	DIA-TTS: Deep-Inherited Attention-Based Text-to-Speech Synthesizer
title_short	DIA-TTS: Deep-Inherited Attention-Based Text-to-Speech Synthesizer
title_sort	dia tts deep inherited attention based text to speech synthesizer
topic	natural language processing text-to-speech deep learning information theory deep neural network local-sensitive attention
url	https://www.mdpi.com/1099-4300/25/1/41
work_keys_str_mv	AT junxiaoyu diattsdeepinheritedattentionbasedtexttospeechsynthesizer AT zhengyuanxu diattsdeepinheritedattentionbasedtexttospeechsynthesizer AT xuhe diattsdeepinheritedattentionbasedtexttospeechsynthesizer AT jianwang diattsdeepinheritedattentionbasedtexttospeechsynthesizer AT binliu diattsdeepinheritedattentionbasedtexttospeechsynthesizer AT ruifeng diattsdeepinheritedattentionbasedtexttospeechsynthesizer AT songshengzhu diattsdeepinheritedattentionbasedtexttospeechsynthesizer AT weiwang diattsdeepinheritedattentionbasedtexttospeechsynthesizer AT jianqingli diattsdeepinheritedattentionbasedtexttospeechsynthesizer

DIA-TTS: Deep-Inherited Attention-Based Text-to-Speech Synthesizer

Similar Items