Research on a Mongolian Text to Speech Model Based on Ghost and ILPCnet

The core challenge of speech synthesis technology is how to convert text information into an audible audio form to meet the needs of users. In recent years, the quality of speech synthesis based on end-to-end speech synthesis models has been significantly improved. However, due to the characteristic...

Full description

Bibliographic Details
Main Authors:	Qing-Dao-Er-Ji Ren, Lele Wang, Wenjing Zhang, Leixiao Li
Format:	Article
Language:	English
Published:	MDPI AG 2024-01-01
Series:	Applied Sciences
Subjects:	Mongolian speech synthesis non-autoregressive Ghost module vocoder
Online Access:	https://www.mdpi.com/2076-3417/14/2/625

_version_	1797340133925584896
author	Qing-Dao-Er-Ji Ren Lele Wang Wenjing Zhang Leixiao Li
author_facet	Qing-Dao-Er-Ji Ren Lele Wang Wenjing Zhang Leixiao Li
author_sort	Qing-Dao-Er-Ji Ren
collection	DOAJ
description	The core challenge of speech synthesis technology is how to convert text information into an audible audio form to meet the needs of users. In recent years, the quality of speech synthesis based on end-to-end speech synthesis models has been significantly improved. However, due to the characteristics of the Mongolian language and the lack of an audio corpus, the Mongolian speech synthesis model has achieved few results, and there are still some problems with the performance and synthesis quality. First, the phoneme information of Mongolian was further improved and a Bang-based pre-training model was constructed to reduce the error rate of Mongolian phonetic synthesized words. Second, a Mongolian speech synthesis model based on Ghost and ILPCnet was proposed, named the Ghost-ILPCnet model, which was improved based on the Para-WaveNet acoustic model, replacing ordinary convolution blocks with stacked Ghost modules to generate Mongolian acoustic features in parallel and improve the speed of speech generation. At the same time, the improved vocoder ILPCnet had a high synthesis quality and low complexity compared to other vocoders. Finally, a large number of data experiments were conducted on the proposed model to verify its effectiveness. The experimental results show that the Ghost-ILPCnet model has a simple structure, fewer model generation parameters, fewer hardware requirements, and can be trained in parallel. The average subjective opinion score of its synthesized speech reached 4.48 and the real-time rate reached 0.0041. It ensures the naturalness and clarity of synthesized speech, speeds up the synthesis speed, and effectively improves the performance of the Mongolian speech synthesis model.
first_indexed	2024-03-08T09:58:34Z
format	Article
id	doaj.art-30ab694428834de7b9bd8c306c23b534
institution	Directory Open Access Journal
issn	2076-3417
language	English
last_indexed	2024-03-08T09:58:34Z
publishDate	2024-01-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj.art-30ab694428834de7b9bd8c306c23b5342024-01-29T13:43:12ZengMDPI AGApplied Sciences2076-34172024-01-0114262510.3390/app14020625Research on a Mongolian Text to Speech Model Based on Ghost and ILPCnetQing-Dao-Er-Ji Ren0Lele Wang1Wenjing Zhang2Leixiao Li3School of Information Engineering, Inner Mongolia University of Technology, Hohhot 010051, ChinaSchool of Information Engineering, Inner Mongolia University of Technology, Hohhot 010051, ChinaSchool of Information Engineering, Inner Mongolia University of Technology, Hohhot 010051, ChinaCollege of Data Science and Application, Inner Mongolia University of Technology, Hohhot 010051, ChinaThe core challenge of speech synthesis technology is how to convert text information into an audible audio form to meet the needs of users. In recent years, the quality of speech synthesis based on end-to-end speech synthesis models has been significantly improved. However, due to the characteristics of the Mongolian language and the lack of an audio corpus, the Mongolian speech synthesis model has achieved few results, and there are still some problems with the performance and synthesis quality. First, the phoneme information of Mongolian was further improved and a Bang-based pre-training model was constructed to reduce the error rate of Mongolian phonetic synthesized words. Second, a Mongolian speech synthesis model based on Ghost and ILPCnet was proposed, named the Ghost-ILPCnet model, which was improved based on the Para-WaveNet acoustic model, replacing ordinary convolution blocks with stacked Ghost modules to generate Mongolian acoustic features in parallel and improve the speed of speech generation. At the same time, the improved vocoder ILPCnet had a high synthesis quality and low complexity compared to other vocoders. Finally, a large number of data experiments were conducted on the proposed model to verify its effectiveness. The experimental results show that the Ghost-ILPCnet model has a simple structure, fewer model generation parameters, fewer hardware requirements, and can be trained in parallel. The average subjective opinion score of its synthesized speech reached 4.48 and the real-time rate reached 0.0041. It ensures the naturalness and clarity of synthesized speech, speeds up the synthesis speed, and effectively improves the performance of the Mongolian speech synthesis model.https://www.mdpi.com/2076-3417/14/2/625Mongolian speech synthesisnon-autoregressiveGhost modulevocoder
spellingShingle	Qing-Dao-Er-Ji Ren Lele Wang Wenjing Zhang Leixiao Li Research on a Mongolian Text to Speech Model Based on Ghost and ILPCnet Applied Sciences Mongolian speech synthesis non-autoregressive Ghost module vocoder
title	Research on a Mongolian Text to Speech Model Based on Ghost and ILPCnet
title_full	Research on a Mongolian Text to Speech Model Based on Ghost and ILPCnet
title_fullStr	Research on a Mongolian Text to Speech Model Based on Ghost and ILPCnet
title_full_unstemmed	Research on a Mongolian Text to Speech Model Based on Ghost and ILPCnet
title_short	Research on a Mongolian Text to Speech Model Based on Ghost and ILPCnet
title_sort	research on a mongolian text to speech model based on ghost and ilpcnet
topic	Mongolian speech synthesis non-autoregressive Ghost module vocoder
url	https://www.mdpi.com/2076-3417/14/2/625
work_keys_str_mv	AT qingdaoerjiren researchonamongoliantexttospeechmodelbasedonghostandilpcnet AT lelewang researchonamongoliantexttospeechmodelbasedonghostandilpcnet AT wenjingzhang researchonamongoliantexttospeechmodelbasedonghostandilpcnet AT leixiaoli researchonamongoliantexttospeechmodelbasedonghostandilpcnet

Research on a Mongolian Text to Speech Model Based on Ghost and ILPCnet

Similar Items