Adaptive Refinements of Pitch Tracking and HNR Estimation within a Vocoder for Statistical Parametric Speech Synthesis

Recent studies in text-to-speech synthesis have shown the benefit of using a continuous pitch estimate; one that interpolates fundamental frequency (F0) even when voicing is not present. However, continuous F0 is still sensitive to additive noise in speech signals and suffers from short-term errors...

Full description

Bibliographic Details
Main Authors: Mohammed Salah Al-Radhi, Tamás Gábor Csapó, Géza Németh
Format: Article
Language:English
Published: MDPI AG 2019-06-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/9/12/2460
_version_ 1818151215235596288
author Mohammed Salah Al-Radhi
Tamás Gábor Csapó
Géza Németh
author_facet Mohammed Salah Al-Radhi
Tamás Gábor Csapó
Géza Németh
author_sort Mohammed Salah Al-Radhi
collection DOAJ
description Recent studies in text-to-speech synthesis have shown the benefit of using a continuous pitch estimate; one that interpolates fundamental frequency (F0) even when voicing is not present. However, continuous F0 is still sensitive to additive noise in speech signals and suffers from short-term errors (when it changes rather quickly over time). To alleviate these issues, three adaptive techniques have been developed in this article for achieving a robust and accurate F0: (1) we weight the pitch estimates with state noise covariance using adaptive Kalman-filter framework, (2) we iteratively apply a time axis warping on the input frame signal, (3) we optimize all F0 candidates using an instantaneous-frequency-based approach. Additionally, the second goal of this study is to introduce an extension of a novel continuous-based speech synthesis system (i.e., in which all parameters are continuous). We propose adding a new excitation parameter named Harmonic-to-Noise Ratio (HNR) to the voiced and unvoiced components to indicate the degree of voicing in the excitation and to reduce the influence of buzziness caused by the vocoder. Results based on objective and perceptual tests demonstrate that the voice built with the proposed framework gives state-of-the-art speech synthesis performance while outperforming the previous baseline.
first_indexed 2024-12-11T13:35:17Z
format Article
id doaj.art-d270da1b6eba4e2cacbea4ea292f8dc5
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-12-11T13:35:17Z
publishDate 2019-06-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-d270da1b6eba4e2cacbea4ea292f8dc52022-12-22T01:05:04ZengMDPI AGApplied Sciences2076-34172019-06-01912246010.3390/app9122460app9122460Adaptive Refinements of Pitch Tracking and HNR Estimation within a Vocoder for Statistical Parametric Speech SynthesisMohammed Salah Al-Radhi0Tamás Gábor Csapó1Géza Németh2Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, 1117 Budapest, HungaryDepartment of Telecommunications and Media Informatics, Budapest University of Technology and Economics, 1117 Budapest, HungaryDepartment of Telecommunications and Media Informatics, Budapest University of Technology and Economics, 1117 Budapest, HungaryRecent studies in text-to-speech synthesis have shown the benefit of using a continuous pitch estimate; one that interpolates fundamental frequency (F0) even when voicing is not present. However, continuous F0 is still sensitive to additive noise in speech signals and suffers from short-term errors (when it changes rather quickly over time). To alleviate these issues, three adaptive techniques have been developed in this article for achieving a robust and accurate F0: (1) we weight the pitch estimates with state noise covariance using adaptive Kalman-filter framework, (2) we iteratively apply a time axis warping on the input frame signal, (3) we optimize all F0 candidates using an instantaneous-frequency-based approach. Additionally, the second goal of this study is to introduce an extension of a novel continuous-based speech synthesis system (i.e., in which all parameters are continuous). We propose adding a new excitation parameter named Harmonic-to-Noise Ratio (HNR) to the voiced and unvoiced components to indicate the degree of voicing in the excitation and to reduce the influence of buzziness caused by the vocoder. Results based on objective and perceptual tests demonstrate that the voice built with the proposed framework gives state-of-the-art speech synthesis performance while outperforming the previous baseline.https://www.mdpi.com/2076-3417/9/12/2460continuous F0speech synthesisKalman filtertime-warpingHNR
spellingShingle Mohammed Salah Al-Radhi
Tamás Gábor Csapó
Géza Németh
Adaptive Refinements of Pitch Tracking and HNR Estimation within a Vocoder for Statistical Parametric Speech Synthesis
Applied Sciences
continuous F0
speech synthesis
Kalman filter
time-warping
HNR
title Adaptive Refinements of Pitch Tracking and HNR Estimation within a Vocoder for Statistical Parametric Speech Synthesis
title_full Adaptive Refinements of Pitch Tracking and HNR Estimation within a Vocoder for Statistical Parametric Speech Synthesis
title_fullStr Adaptive Refinements of Pitch Tracking and HNR Estimation within a Vocoder for Statistical Parametric Speech Synthesis
title_full_unstemmed Adaptive Refinements of Pitch Tracking and HNR Estimation within a Vocoder for Statistical Parametric Speech Synthesis
title_short Adaptive Refinements of Pitch Tracking and HNR Estimation within a Vocoder for Statistical Parametric Speech Synthesis
title_sort adaptive refinements of pitch tracking and hnr estimation within a vocoder for statistical parametric speech synthesis
topic continuous F0
speech synthesis
Kalman filter
time-warping
HNR
url https://www.mdpi.com/2076-3417/9/12/2460
work_keys_str_mv AT mohammedsalahalradhi adaptiverefinementsofpitchtrackingandhnrestimationwithinavocoderforstatisticalparametricspeechsynthesis
AT tamasgaborcsapo adaptiverefinementsofpitchtrackingandhnrestimationwithinavocoderforstatisticalparametricspeechsynthesis
AT gezanemeth adaptiverefinementsofpitchtrackingandhnrestimationwithinavocoderforstatisticalparametricspeechsynthesis