A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech Synthesis

Speech synthesis has come a long way as current text-to-speech (TTS) models can now generate natural human-sounding speech. However, most of the TTS research focuses on using adult speech data and there has been very limited work done on child speech synthesis. This study developed and validated a t...

Full description

Bibliographic Details
Main Authors:	Rishabh Jain, Mariam Yahayah Yiwere, Dan Bigioi, Peter Corcoran, Horia Cucu
Format:	Article
Language:	English
Published:	IEEE 2022-01-01
Series:	IEEE Access
Subjects:	Text-to-speech child speech synthesis tacotron multi-speaker TTS alternative WaveRNN MOSNet
Online Access:	https://ieeexplore.ieee.org/document/9764693/

_version_	1798038379507482624
author	Rishabh Jain Mariam Yahayah Yiwere Dan Bigioi Peter Corcoran Horia Cucu
author_facet	Rishabh Jain Mariam Yahayah Yiwere Dan Bigioi Peter Corcoran Horia Cucu
author_sort	Rishabh Jain
collection	DOAJ
description	Speech synthesis has come a long way as current text-to-speech (TTS) models can now generate natural human-sounding speech. However, most of the TTS research focuses on using adult speech data and there has been very limited work done on child speech synthesis. This study developed and validated a training pipeline for fine-tuning state-of-the-art (SOTA) neural TTS models using child speech datasets. This approach adopts a multi-speaker TTS retuning workflow to provide a transfer-learning pipeline. A publicly available child speech dataset was cleaned to provide a smaller subset of approximately 19 hours, which formed the basis of our fine-tuning experiments. Both subjective and objective evaluations were performed using a pretrained MOSNet for objective evaluation and a novel subjective framework for mean opinion score (MOS) evaluations. Subjective evaluations achieved the MOS of 3.95 for speech intelligibility, 3.89 for voice naturalness, and 3.96 for voice consistency. Objective evaluation using a pretrained MOSNet showed a strong correlation between real and synthetic child voices. Speaker similarity was also verified by calculating the cosine similarity between the embeddings of utterances. An automatic speech recognition (ASR) model is also used to provide a word error rate (WER) comparison between the real and synthetic child voices. The final trained TTS model was able to synthesize child-like speech from reference audio samples as short as 5 seconds.
first_indexed	2024-04-11T21:39:24Z
format	Article
id	doaj.art-68118fa1be114712a942fa2291a32238
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-04-11T21:39:24Z
publishDate	2022-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-68118fa1be114712a942fa2291a322382022-12-22T04:01:39ZengIEEEIEEE Access2169-35362022-01-0110476284764210.1109/ACCESS.2022.31708369764693A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech SynthesisRishabh Jain0https://orcid.org/0000-0002-4891-494XMariam Yahayah Yiwere1Dan Bigioi2https://orcid.org/0000-0002-7704-2829Peter Corcoran3https://orcid.org/0000-0003-1670-4793Horia Cucu4https://orcid.org/0000-0002-8711-3854School of Electrical and Electronics Engineering, National University of Ireland Galway, Galway, IrelandSchool of Electrical and Electronics Engineering, National University of Ireland Galway, Galway, IrelandSchool of Electrical and Electronics Engineering, National University of Ireland Galway, Galway, IrelandSchool of Electrical and Electronics Engineering, National University of Ireland Galway, Galway, IrelandSpeech and Dialogue Research Laboratory, University Politehnica of Bucharest, Bucharest, RomaniaSpeech synthesis has come a long way as current text-to-speech (TTS) models can now generate natural human-sounding speech. However, most of the TTS research focuses on using adult speech data and there has been very limited work done on child speech synthesis. This study developed and validated a training pipeline for fine-tuning state-of-the-art (SOTA) neural TTS models using child speech datasets. This approach adopts a multi-speaker TTS retuning workflow to provide a transfer-learning pipeline. A publicly available child speech dataset was cleaned to provide a smaller subset of approximately 19 hours, which formed the basis of our fine-tuning experiments. Both subjective and objective evaluations were performed using a pretrained MOSNet for objective evaluation and a novel subjective framework for mean opinion score (MOS) evaluations. Subjective evaluations achieved the MOS of 3.95 for speech intelligibility, 3.89 for voice naturalness, and 3.96 for voice consistency. Objective evaluation using a pretrained MOSNet showed a strong correlation between real and synthetic child voices. Speaker similarity was also verified by calculating the cosine similarity between the embeddings of utterances. An automatic speech recognition (ASR) model is also used to provide a word error rate (WER) comparison between the real and synthetic child voices. The final trained TTS model was able to synthesize child-like speech from reference audio samples as short as 5 seconds.https://ieeexplore.ieee.org/document/9764693/Text-to-speechchild speech synthesistacotronmulti-speaker TTSalternative WaveRNNMOSNet
spellingShingle	Rishabh Jain Mariam Yahayah Yiwere Dan Bigioi Peter Corcoran Horia Cucu A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech Synthesis IEEE Access Text-to-speech child speech synthesis tacotron multi-speaker TTS alternative WaveRNN MOSNet
title	A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech Synthesis
title_full	A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech Synthesis
title_fullStr	A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech Synthesis
title_full_unstemmed	A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech Synthesis
title_short	A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech Synthesis
title_sort	text to speech pipeline evaluation methodology and initial fine tuning results for child speech synthesis
topic	Text-to-speech child speech synthesis tacotron multi-speaker TTS alternative WaveRNN MOSNet
url	https://ieeexplore.ieee.org/document/9764693/
work_keys_str_mv	AT rishabhjain atexttospeechpipelineevaluationmethodologyandinitialfinetuningresultsforchildspeechsynthesis AT mariamyahayahyiwere atexttospeechpipelineevaluationmethodologyandinitialfinetuningresultsforchildspeechsynthesis AT danbigioi atexttospeechpipelineevaluationmethodologyandinitialfinetuningresultsforchildspeechsynthesis AT petercorcoran atexttospeechpipelineevaluationmethodologyandinitialfinetuningresultsforchildspeechsynthesis AT horiacucu atexttospeechpipelineevaluationmethodologyandinitialfinetuningresultsforchildspeechsynthesis AT rishabhjain texttospeechpipelineevaluationmethodologyandinitialfinetuningresultsforchildspeechsynthesis AT mariamyahayahyiwere texttospeechpipelineevaluationmethodologyandinitialfinetuningresultsforchildspeechsynthesis AT danbigioi texttospeechpipelineevaluationmethodologyandinitialfinetuningresultsforchildspeechsynthesis AT petercorcoran texttospeechpipelineevaluationmethodologyandinitialfinetuningresultsforchildspeechsynthesis AT horiacucu texttospeechpipelineevaluationmethodologyandinitialfinetuningresultsforchildspeechsynthesis

A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech Synthesis

Similar Items