An Improved Chinese Pause Fillers Prediction Module Based on RoBERTa
The prediction of pause fillers plays a crucial role in enhancing the naturalness of synthesized speech. In recent years, neural networks, including LSTM, BERT, and XLNet, have been employed for pause fillers prediction modules. However, these methods have exhibited relatively lower accuracy in pred...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2023-09-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/13/19/10652 |
_version_ | 1797576237011435520 |
---|---|
author | Ling Yu Xiaoqun Zhou Fanglin Niu |
author_facet | Ling Yu Xiaoqun Zhou Fanglin Niu |
author_sort | Ling Yu |
collection | DOAJ |
description | The prediction of pause fillers plays a crucial role in enhancing the naturalness of synthesized speech. In recent years, neural networks, including LSTM, BERT, and XLNet, have been employed for pause fillers prediction modules. However, these methods have exhibited relatively lower accuracy in predicting pause fillers. This paper introduces the utilization of the RoBERTa model for predicting Chinese pause fillers and presents a novel approach to training the RoBERTa model, effectively enhancing the accuracy of Chinese pause fillers prediction. Our proposed approach involves categorizing text from different speakers into four distinct style groups based on the frequency and position of Chinese pause fillers. The RoBERTa model is trained on these four groups of data, which incorporate different styles of fillers, thereby ensuring a more natural synthesis of speech. The Chinese pause fillers prediction module is evaluated on systems such as Parallel Tacotron2, FastPitch, and Deep Voice3, achieving a notable 26.7% improvement in word-level prediction accuracy compared to the BERT model, along with a 14% enhancement in position-level prediction accuracy. This substantial improvement results in a significant enhancement of the naturalness of the generated speech. |
first_indexed | 2024-03-10T21:49:26Z |
format | Article |
id | doaj.art-42a3860239f3422d8638e327922c8b3e |
institution | Directory Open Access Journal |
issn | 2076-3417 |
language | English |
last_indexed | 2024-03-10T21:49:26Z |
publishDate | 2023-09-01 |
publisher | MDPI AG |
record_format | Article |
series | Applied Sciences |
spelling | doaj.art-42a3860239f3422d8638e327922c8b3e2023-11-19T14:02:21ZengMDPI AGApplied Sciences2076-34172023-09-0113191065210.3390/app131910652An Improved Chinese Pause Fillers Prediction Module Based on RoBERTaLing Yu0Xiaoqun Zhou1Fanglin Niu2School of Electronics and Information Engineering, Liaoning University of Technology, Jinzhou 121001, ChinaSchool of Electronics and Information Engineering, Shenyang University of Technology, Shenyang 110000, ChinaSchool of Electronics and Information Engineering, Liaoning University of Technology, Jinzhou 121001, ChinaThe prediction of pause fillers plays a crucial role in enhancing the naturalness of synthesized speech. In recent years, neural networks, including LSTM, BERT, and XLNet, have been employed for pause fillers prediction modules. However, these methods have exhibited relatively lower accuracy in predicting pause fillers. This paper introduces the utilization of the RoBERTa model for predicting Chinese pause fillers and presents a novel approach to training the RoBERTa model, effectively enhancing the accuracy of Chinese pause fillers prediction. Our proposed approach involves categorizing text from different speakers into four distinct style groups based on the frequency and position of Chinese pause fillers. The RoBERTa model is trained on these four groups of data, which incorporate different styles of fillers, thereby ensuring a more natural synthesis of speech. The Chinese pause fillers prediction module is evaluated on systems such as Parallel Tacotron2, FastPitch, and Deep Voice3, achieving a notable 26.7% improvement in word-level prediction accuracy compared to the BERT model, along with a 14% enhancement in position-level prediction accuracy. This substantial improvement results in a significant enhancement of the naturalness of the generated speech.https://www.mdpi.com/2076-3417/13/19/10652RoBERTanaturalness of speechspeech synthesisChinese pause fillersprediction module |
spellingShingle | Ling Yu Xiaoqun Zhou Fanglin Niu An Improved Chinese Pause Fillers Prediction Module Based on RoBERTa Applied Sciences RoBERTa naturalness of speech speech synthesis Chinese pause fillers prediction module |
title | An Improved Chinese Pause Fillers Prediction Module Based on RoBERTa |
title_full | An Improved Chinese Pause Fillers Prediction Module Based on RoBERTa |
title_fullStr | An Improved Chinese Pause Fillers Prediction Module Based on RoBERTa |
title_full_unstemmed | An Improved Chinese Pause Fillers Prediction Module Based on RoBERTa |
title_short | An Improved Chinese Pause Fillers Prediction Module Based on RoBERTa |
title_sort | improved chinese pause fillers prediction module based on roberta |
topic | RoBERTa naturalness of speech speech synthesis Chinese pause fillers prediction module |
url | https://www.mdpi.com/2076-3417/13/19/10652 |
work_keys_str_mv | AT lingyu animprovedchinesepausefillerspredictionmodulebasedonroberta AT xiaoqunzhou animprovedchinesepausefillerspredictionmodulebasedonroberta AT fanglinniu animprovedchinesepausefillerspredictionmodulebasedonroberta AT lingyu improvedchinesepausefillerspredictionmodulebasedonroberta AT xiaoqunzhou improvedchinesepausefillerspredictionmodulebasedonroberta AT fanglinniu improvedchinesepausefillerspredictionmodulebasedonroberta |