A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech Synthesis

Speech synthesis has come a long way as current text-to-speech (TTS) models can now generate natural human-sounding speech. However, most of the TTS research focuses on using adult speech data and there has been very limited work done on child speech synthesis. This study developed and validated a t...

Full description

Bibliographic Details
Main Authors: Rishabh Jain, Mariam Yahayah Yiwere, Dan Bigioi, Peter Corcoran, Horia Cucu
Format: Article
Language:English
Published: IEEE 2022-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9764693/
_version_ 1798038379507482624
author Rishabh Jain
Mariam Yahayah Yiwere
Dan Bigioi
Peter Corcoran
Horia Cucu
author_facet Rishabh Jain
Mariam Yahayah Yiwere
Dan Bigioi
Peter Corcoran
Horia Cucu
author_sort Rishabh Jain
collection DOAJ
description Speech synthesis has come a long way as current text-to-speech (TTS) models can now generate natural human-sounding speech. However, most of the TTS research focuses on using adult speech data and there has been very limited work done on child speech synthesis. This study developed and validated a training pipeline for fine-tuning state-of-the-art (SOTA) neural TTS models using child speech datasets. This approach adopts a multi-speaker TTS retuning workflow to provide a transfer-learning pipeline. A publicly available child speech dataset was cleaned to provide a smaller subset of approximately 19 hours, which formed the basis of our fine-tuning experiments. Both subjective and objective evaluations were performed using a pretrained MOSNet for objective evaluation and a novel subjective framework for mean opinion score (MOS) evaluations. Subjective evaluations achieved the MOS of 3.95 for speech intelligibility, 3.89 for voice naturalness, and 3.96 for voice consistency. Objective evaluation using a pretrained MOSNet showed a strong correlation between real and synthetic child voices. Speaker similarity was also verified by calculating the cosine similarity between the embeddings of utterances. An automatic speech recognition (ASR) model is also used to provide a word error rate (WER) comparison between the real and synthetic child voices. The final trained TTS model was able to synthesize child-like speech from reference audio samples as short as 5 seconds.
first_indexed 2024-04-11T21:39:24Z
format Article
id doaj.art-68118fa1be114712a942fa2291a32238
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-04-11T21:39:24Z
publishDate 2022-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-68118fa1be114712a942fa2291a322382022-12-22T04:01:39ZengIEEEIEEE Access2169-35362022-01-0110476284764210.1109/ACCESS.2022.31708369764693A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech SynthesisRishabh Jain0https://orcid.org/0000-0002-4891-494XMariam Yahayah Yiwere1Dan Bigioi2https://orcid.org/0000-0002-7704-2829Peter Corcoran3https://orcid.org/0000-0003-1670-4793Horia Cucu4https://orcid.org/0000-0002-8711-3854School of Electrical and Electronics Engineering, National University of Ireland Galway, Galway, IrelandSchool of Electrical and Electronics Engineering, National University of Ireland Galway, Galway, IrelandSchool of Electrical and Electronics Engineering, National University of Ireland Galway, Galway, IrelandSchool of Electrical and Electronics Engineering, National University of Ireland Galway, Galway, IrelandSpeech and Dialogue Research Laboratory, University Politehnica of Bucharest, Bucharest, RomaniaSpeech synthesis has come a long way as current text-to-speech (TTS) models can now generate natural human-sounding speech. However, most of the TTS research focuses on using adult speech data and there has been very limited work done on child speech synthesis. This study developed and validated a training pipeline for fine-tuning state-of-the-art (SOTA) neural TTS models using child speech datasets. This approach adopts a multi-speaker TTS retuning workflow to provide a transfer-learning pipeline. A publicly available child speech dataset was cleaned to provide a smaller subset of approximately 19 hours, which formed the basis of our fine-tuning experiments. Both subjective and objective evaluations were performed using a pretrained MOSNet for objective evaluation and a novel subjective framework for mean opinion score (MOS) evaluations. Subjective evaluations achieved the MOS of 3.95 for speech intelligibility, 3.89 for voice naturalness, and 3.96 for voice consistency. Objective evaluation using a pretrained MOSNet showed a strong correlation between real and synthetic child voices. Speaker similarity was also verified by calculating the cosine similarity between the embeddings of utterances. An automatic speech recognition (ASR) model is also used to provide a word error rate (WER) comparison between the real and synthetic child voices. The final trained TTS model was able to synthesize child-like speech from reference audio samples as short as 5 seconds.https://ieeexplore.ieee.org/document/9764693/Text-to-speechchild speech synthesistacotronmulti-speaker TTSalternative WaveRNNMOSNet
spellingShingle Rishabh Jain
Mariam Yahayah Yiwere
Dan Bigioi
Peter Corcoran
Horia Cucu
A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech Synthesis
IEEE Access
Text-to-speech
child speech synthesis
tacotron
multi-speaker TTS
alternative WaveRNN
MOSNet
title A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech Synthesis
title_full A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech Synthesis
title_fullStr A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech Synthesis
title_full_unstemmed A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech Synthesis
title_short A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech Synthesis
title_sort text to speech pipeline evaluation methodology and initial fine tuning results for child speech synthesis
topic Text-to-speech
child speech synthesis
tacotron
multi-speaker TTS
alternative WaveRNN
MOSNet
url https://ieeexplore.ieee.org/document/9764693/
work_keys_str_mv AT rishabhjain atexttospeechpipelineevaluationmethodologyandinitialfinetuningresultsforchildspeechsynthesis
AT mariamyahayahyiwere atexttospeechpipelineevaluationmethodologyandinitialfinetuningresultsforchildspeechsynthesis
AT danbigioi atexttospeechpipelineevaluationmethodologyandinitialfinetuningresultsforchildspeechsynthesis
AT petercorcoran atexttospeechpipelineevaluationmethodologyandinitialfinetuningresultsforchildspeechsynthesis
AT horiacucu atexttospeechpipelineevaluationmethodologyandinitialfinetuningresultsforchildspeechsynthesis
AT rishabhjain texttospeechpipelineevaluationmethodologyandinitialfinetuningresultsforchildspeechsynthesis
AT mariamyahayahyiwere texttospeechpipelineevaluationmethodologyandinitialfinetuningresultsforchildspeechsynthesis
AT danbigioi texttospeechpipelineevaluationmethodologyandinitialfinetuningresultsforchildspeechsynthesis
AT petercorcoran texttospeechpipelineevaluationmethodologyandinitialfinetuningresultsforchildspeechsynthesis
AT horiacucu texttospeechpipelineevaluationmethodologyandinitialfinetuningresultsforchildspeechsynthesis