Adaptive Refinements of Pitch Tracking and HNR Estimation within a Vocoder for Statistical Parametric Speech Synthesis

Recent studies in text-to-speech synthesis have shown the benefit of using a continuous pitch estimate; one that interpolates fundamental frequency (F0) even when voicing is not present. However, continuous F0 is still sensitive to additive noise in speech signals and suffers from short-term errors...

Full description

Bibliographic Details
Main Authors:	Mohammed Salah Al-Radhi, Tamás Gábor Csapó, Géza Németh
Format:	Article
Language:	English
Published:	MDPI AG 2019-06-01
Series:	Applied Sciences
Subjects:	continuous F0 speech synthesis Kalman filter time-warping HNR
Online Access:	https://www.mdpi.com/2076-3417/9/12/2460

_version_	1818151215235596288
author	Mohammed Salah Al-Radhi Tamás Gábor Csapó Géza Németh
author_facet	Mohammed Salah Al-Radhi Tamás Gábor Csapó Géza Németh
author_sort	Mohammed Salah Al-Radhi
collection	DOAJ
description	Recent studies in text-to-speech synthesis have shown the benefit of using a continuous pitch estimate; one that interpolates fundamental frequency (F0) even when voicing is not present. However, continuous F0 is still sensitive to additive noise in speech signals and suffers from short-term errors (when it changes rather quickly over time). To alleviate these issues, three adaptive techniques have been developed in this article for achieving a robust and accurate F0: (1) we weight the pitch estimates with state noise covariance using adaptive Kalman-filter framework, (2) we iteratively apply a time axis warping on the input frame signal, (3) we optimize all F0 candidates using an instantaneous-frequency-based approach. Additionally, the second goal of this study is to introduce an extension of a novel continuous-based speech synthesis system (i.e., in which all parameters are continuous). We propose adding a new excitation parameter named Harmonic-to-Noise Ratio (HNR) to the voiced and unvoiced components to indicate the degree of voicing in the excitation and to reduce the influence of buzziness caused by the vocoder. Results based on objective and perceptual tests demonstrate that the voice built with the proposed framework gives state-of-the-art speech synthesis performance while outperforming the previous baseline.
first_indexed	2024-12-11T13:35:17Z
format	Article
id	doaj.art-d270da1b6eba4e2cacbea4ea292f8dc5
institution	Directory Open Access Journal
issn	2076-3417
language	English
last_indexed	2024-12-11T13:35:17Z
publishDate	2019-06-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj.art-d270da1b6eba4e2cacbea4ea292f8dc52022-12-22T01:05:04ZengMDPI AGApplied Sciences2076-34172019-06-01912246010.3390/app9122460app9122460Adaptive Refinements of Pitch Tracking and HNR Estimation within a Vocoder for Statistical Parametric Speech SynthesisMohammed Salah Al-Radhi0Tamás Gábor Csapó1Géza Németh2Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, 1117 Budapest, HungaryDepartment of Telecommunications and Media Informatics, Budapest University of Technology and Economics, 1117 Budapest, HungaryDepartment of Telecommunications and Media Informatics, Budapest University of Technology and Economics, 1117 Budapest, HungaryRecent studies in text-to-speech synthesis have shown the benefit of using a continuous pitch estimate; one that interpolates fundamental frequency (F0) even when voicing is not present. However, continuous F0 is still sensitive to additive noise in speech signals and suffers from short-term errors (when it changes rather quickly over time). To alleviate these issues, three adaptive techniques have been developed in this article for achieving a robust and accurate F0: (1) we weight the pitch estimates with state noise covariance using adaptive Kalman-filter framework, (2) we iteratively apply a time axis warping on the input frame signal, (3) we optimize all F0 candidates using an instantaneous-frequency-based approach. Additionally, the second goal of this study is to introduce an extension of a novel continuous-based speech synthesis system (i.e., in which all parameters are continuous). We propose adding a new excitation parameter named Harmonic-to-Noise Ratio (HNR) to the voiced and unvoiced components to indicate the degree of voicing in the excitation and to reduce the influence of buzziness caused by the vocoder. Results based on objective and perceptual tests demonstrate that the voice built with the proposed framework gives state-of-the-art speech synthesis performance while outperforming the previous baseline.https://www.mdpi.com/2076-3417/9/12/2460continuous F0speech synthesisKalman filtertime-warpingHNR
spellingShingle	Mohammed Salah Al-Radhi Tamás Gábor Csapó Géza Németh Adaptive Refinements of Pitch Tracking and HNR Estimation within a Vocoder for Statistical Parametric Speech Synthesis Applied Sciences continuous F0 speech synthesis Kalman filter time-warping HNR
title	Adaptive Refinements of Pitch Tracking and HNR Estimation within a Vocoder for Statistical Parametric Speech Synthesis
title_full	Adaptive Refinements of Pitch Tracking and HNR Estimation within a Vocoder for Statistical Parametric Speech Synthesis
title_fullStr	Adaptive Refinements of Pitch Tracking and HNR Estimation within a Vocoder for Statistical Parametric Speech Synthesis
title_full_unstemmed	Adaptive Refinements of Pitch Tracking and HNR Estimation within a Vocoder for Statistical Parametric Speech Synthesis
title_short	Adaptive Refinements of Pitch Tracking and HNR Estimation within a Vocoder for Statistical Parametric Speech Synthesis
title_sort	adaptive refinements of pitch tracking and hnr estimation within a vocoder for statistical parametric speech synthesis
topic	continuous F0 speech synthesis Kalman filter time-warping HNR
url	https://www.mdpi.com/2076-3417/9/12/2460
work_keys_str_mv	AT mohammedsalahalradhi adaptiverefinementsofpitchtrackingandhnrestimationwithinavocoderforstatisticalparametricspeechsynthesis AT tamasgaborcsapo adaptiverefinementsofpitchtrackingandhnrestimationwithinavocoderforstatisticalparametricspeechsynthesis AT gezanemeth adaptiverefinementsofpitchtrackingandhnrestimationwithinavocoderforstatisticalparametricspeechsynthesis

Adaptive Refinements of Pitch Tracking and HNR Estimation within a Vocoder for Statistical Parametric Speech Synthesis

Similar Items