Comparative Study for Multi-Speaker Mongolian TTS with a New Corpus
Low-resource text-to-speech synthesis is a very promising research direction. Mongolian is the official language of the Inner Mongolia Autonomous Region and is spoken by more than 10 million people worldwide. Mongolian, as a representative low-resource language, has a relative lack of open-source da...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2023-03-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/13/7/4237 |
_version_ | 1797608397746470912 |
---|---|
author | Kailin Liang Bin Liu Yifan Hu Rui Liu Feilong Bao Guanglai Gao |
author_facet | Kailin Liang Bin Liu Yifan Hu Rui Liu Feilong Bao Guanglai Gao |
author_sort | Kailin Liang |
collection | DOAJ |
description | Low-resource text-to-speech synthesis is a very promising research direction. Mongolian is the official language of the Inner Mongolia Autonomous Region and is spoken by more than 10 million people worldwide. Mongolian, as a representative low-resource language, has a relative lack of open-source datasets for its TTS. Therefore, we make public an open-source multi-speaker Mongolian TTS dataset, named MnTTS2, for related researchers. In this work, we invited three Mongolian announcers to record topic-rich speeches. Each announcer recorded 10 h of Mongolian speech, and the whole dataset was 30 h in total. In addition, we built two baseline systems based on state-of-the-art neural architectures, including a multi-speaker Fastspeech 2 model with HiFi-GAN vocoder and a full end-to-end VITS model for multi-speakers. On the system of FastSpeech2+HiFi-GAN, the three speakers scored 4.0 or higher on both naturalness evaluation and speaker similarity. In addition, the three speakers achieved scores of 4.5 or higher on the VITS model for naturalness evaluation and speaker similarity scores. The experimental results show that the published MnTTS2 dataset can be used to build robust Mongolian multi-speaker TTS models. |
first_indexed | 2024-03-11T05:42:56Z |
format | Article |
id | doaj.art-6dbec842f0cf4816818c6e26f5d8c25f |
institution | Directory Open Access Journal |
issn | 2076-3417 |
language | English |
last_indexed | 2024-03-11T05:42:56Z |
publishDate | 2023-03-01 |
publisher | MDPI AG |
record_format | Article |
series | Applied Sciences |
spelling | doaj.art-6dbec842f0cf4816818c6e26f5d8c25f2023-11-17T16:17:27ZengMDPI AGApplied Sciences2076-34172023-03-01137423710.3390/app13074237Comparative Study for Multi-Speaker Mongolian TTS with a New CorpusKailin Liang0Bin Liu1Yifan Hu2Rui Liu3Feilong Bao4Guanglai Gao5College of Computer Science, Inner Mongolia University, Hohhot 010031, ChinaCollege of Computer Science, Inner Mongolia University, Hohhot 010031, ChinaCollege of Computer Science, Inner Mongolia University, Hohhot 010031, ChinaCollege of Computer Science, Inner Mongolia University, Hohhot 010031, ChinaCollege of Computer Science, Inner Mongolia University, Hohhot 010031, ChinaCollege of Computer Science, Inner Mongolia University, Hohhot 010031, ChinaLow-resource text-to-speech synthesis is a very promising research direction. Mongolian is the official language of the Inner Mongolia Autonomous Region and is spoken by more than 10 million people worldwide. Mongolian, as a representative low-resource language, has a relative lack of open-source datasets for its TTS. Therefore, we make public an open-source multi-speaker Mongolian TTS dataset, named MnTTS2, for related researchers. In this work, we invited three Mongolian announcers to record topic-rich speeches. Each announcer recorded 10 h of Mongolian speech, and the whole dataset was 30 h in total. In addition, we built two baseline systems based on state-of-the-art neural architectures, including a multi-speaker Fastspeech 2 model with HiFi-GAN vocoder and a full end-to-end VITS model for multi-speakers. On the system of FastSpeech2+HiFi-GAN, the three speakers scored 4.0 or higher on both naturalness evaluation and speaker similarity. In addition, the three speakers achieved scores of 4.5 or higher on the VITS model for naturalness evaluation and speaker similarity scores. The experimental results show that the published MnTTS2 dataset can be used to build robust Mongolian multi-speaker TTS models.https://www.mdpi.com/2076-3417/13/7/4237Mongoliantext-to-speech (TTS)open-source datasetmulti-speakerkeyword |
spellingShingle | Kailin Liang Bin Liu Yifan Hu Rui Liu Feilong Bao Guanglai Gao Comparative Study for Multi-Speaker Mongolian TTS with a New Corpus Applied Sciences Mongolian text-to-speech (TTS) open-source dataset multi-speakerkeyword |
title | Comparative Study for Multi-Speaker Mongolian TTS with a New Corpus |
title_full | Comparative Study for Multi-Speaker Mongolian TTS with a New Corpus |
title_fullStr | Comparative Study for Multi-Speaker Mongolian TTS with a New Corpus |
title_full_unstemmed | Comparative Study for Multi-Speaker Mongolian TTS with a New Corpus |
title_short | Comparative Study for Multi-Speaker Mongolian TTS with a New Corpus |
title_sort | comparative study for multi speaker mongolian tts with a new corpus |
topic | Mongolian text-to-speech (TTS) open-source dataset multi-speakerkeyword |
url | https://www.mdpi.com/2076-3417/13/7/4237 |
work_keys_str_mv | AT kailinliang comparativestudyformultispeakermongolianttswithanewcorpus AT binliu comparativestudyformultispeakermongolianttswithanewcorpus AT yifanhu comparativestudyformultispeakermongolianttswithanewcorpus AT ruiliu comparativestudyformultispeakermongolianttswithanewcorpus AT feilongbao comparativestudyformultispeakermongolianttswithanewcorpus AT guanglaigao comparativestudyformultispeakermongolianttswithanewcorpus |