Comparative Study for Multi-Speaker Mongolian TTS with a New Corpus

Low-resource text-to-speech synthesis is a very promising research direction. Mongolian is the official language of the Inner Mongolia Autonomous Region and is spoken by more than 10 million people worldwide. Mongolian, as a representative low-resource language, has a relative lack of open-source da...

Full description

Bibliographic Details
Main Authors: Kailin Liang, Bin Liu, Yifan Hu, Rui Liu, Feilong Bao, Guanglai Gao
Format: Article
Language:English
Published: MDPI AG 2023-03-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/13/7/4237
_version_ 1797608397746470912
author Kailin Liang
Bin Liu
Yifan Hu
Rui Liu
Feilong Bao
Guanglai Gao
author_facet Kailin Liang
Bin Liu
Yifan Hu
Rui Liu
Feilong Bao
Guanglai Gao
author_sort Kailin Liang
collection DOAJ
description Low-resource text-to-speech synthesis is a very promising research direction. Mongolian is the official language of the Inner Mongolia Autonomous Region and is spoken by more than 10 million people worldwide. Mongolian, as a representative low-resource language, has a relative lack of open-source datasets for its TTS. Therefore, we make public an open-source multi-speaker Mongolian TTS dataset, named MnTTS2, for related researchers. In this work, we invited three Mongolian announcers to record topic-rich speeches. Each announcer recorded 10 h of Mongolian speech, and the whole dataset was 30 h in total. In addition, we built two baseline systems based on state-of-the-art neural architectures, including a multi-speaker Fastspeech 2 model with HiFi-GAN vocoder and a full end-to-end VITS model for multi-speakers. On the system of FastSpeech2+HiFi-GAN, the three speakers scored 4.0 or higher on both naturalness evaluation and speaker similarity. In addition, the three speakers achieved scores of 4.5 or higher on the VITS model for naturalness evaluation and speaker similarity scores. The experimental results show that the published MnTTS2 dataset can be used to build robust Mongolian multi-speaker TTS models.
first_indexed 2024-03-11T05:42:56Z
format Article
id doaj.art-6dbec842f0cf4816818c6e26f5d8c25f
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-11T05:42:56Z
publishDate 2023-03-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-6dbec842f0cf4816818c6e26f5d8c25f2023-11-17T16:17:27ZengMDPI AGApplied Sciences2076-34172023-03-01137423710.3390/app13074237Comparative Study for Multi-Speaker Mongolian TTS with a New CorpusKailin Liang0Bin Liu1Yifan Hu2Rui Liu3Feilong Bao4Guanglai Gao5College of Computer Science, Inner Mongolia University, Hohhot 010031, ChinaCollege of Computer Science, Inner Mongolia University, Hohhot 010031, ChinaCollege of Computer Science, Inner Mongolia University, Hohhot 010031, ChinaCollege of Computer Science, Inner Mongolia University, Hohhot 010031, ChinaCollege of Computer Science, Inner Mongolia University, Hohhot 010031, ChinaCollege of Computer Science, Inner Mongolia University, Hohhot 010031, ChinaLow-resource text-to-speech synthesis is a very promising research direction. Mongolian is the official language of the Inner Mongolia Autonomous Region and is spoken by more than 10 million people worldwide. Mongolian, as a representative low-resource language, has a relative lack of open-source datasets for its TTS. Therefore, we make public an open-source multi-speaker Mongolian TTS dataset, named MnTTS2, for related researchers. In this work, we invited three Mongolian announcers to record topic-rich speeches. Each announcer recorded 10 h of Mongolian speech, and the whole dataset was 30 h in total. In addition, we built two baseline systems based on state-of-the-art neural architectures, including a multi-speaker Fastspeech 2 model with HiFi-GAN vocoder and a full end-to-end VITS model for multi-speakers. On the system of FastSpeech2+HiFi-GAN, the three speakers scored 4.0 or higher on both naturalness evaluation and speaker similarity. In addition, the three speakers achieved scores of 4.5 or higher on the VITS model for naturalness evaluation and speaker similarity scores. The experimental results show that the published MnTTS2 dataset can be used to build robust Mongolian multi-speaker TTS models.https://www.mdpi.com/2076-3417/13/7/4237Mongoliantext-to-speech (TTS)open-source datasetmulti-speakerkeyword
spellingShingle Kailin Liang
Bin Liu
Yifan Hu
Rui Liu
Feilong Bao
Guanglai Gao
Comparative Study for Multi-Speaker Mongolian TTS with a New Corpus
Applied Sciences
Mongolian
text-to-speech (TTS)
open-source dataset
multi-speakerkeyword
title Comparative Study for Multi-Speaker Mongolian TTS with a New Corpus
title_full Comparative Study for Multi-Speaker Mongolian TTS with a New Corpus
title_fullStr Comparative Study for Multi-Speaker Mongolian TTS with a New Corpus
title_full_unstemmed Comparative Study for Multi-Speaker Mongolian TTS with a New Corpus
title_short Comparative Study for Multi-Speaker Mongolian TTS with a New Corpus
title_sort comparative study for multi speaker mongolian tts with a new corpus
topic Mongolian
text-to-speech (TTS)
open-source dataset
multi-speakerkeyword
url https://www.mdpi.com/2076-3417/13/7/4237
work_keys_str_mv AT kailinliang comparativestudyformultispeakermongolianttswithanewcorpus
AT binliu comparativestudyformultispeakermongolianttswithanewcorpus
AT yifanhu comparativestudyformultispeakermongolianttswithanewcorpus
AT ruiliu comparativestudyformultispeakermongolianttswithanewcorpus
AT feilongbao comparativestudyformultispeakermongolianttswithanewcorpus
AT guanglaigao comparativestudyformultispeakermongolianttswithanewcorpus