Differentiable Measures for Speech Spectral Modeling

Autoregressive models for the envelope of speech power spectral densities (PSDs) are refined by the self-supervised spectral learning machine (S3LM) provided with differentiable spectral objective functions, including the Itakura-Saito divergence (ISD), the Kullback-Leibler divergence (KLD), the rev...

Full description

Bibliographic Details
Main Authors: Miguel Arjona Ramirez, Wesley Beccaro, Demostenes Zegarra Rodriguez, Renata Lopes Rosa
Format: Article
Language:English
Published: IEEE 2022-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9709279/
_version_ 1819326882142748672
author Miguel Arjona Ramirez
Wesley Beccaro
Demostenes Zegarra Rodriguez
Renata Lopes Rosa
author_facet Miguel Arjona Ramirez
Wesley Beccaro
Demostenes Zegarra Rodriguez
Renata Lopes Rosa
author_sort Miguel Arjona Ramirez
collection DOAJ
description Autoregressive models for the envelope of speech power spectral densities (PSDs) are refined by the self-supervised spectral learning machine (S3LM) provided with differentiable spectral objective functions, including the Itakura-Saito divergence (ISD), the Kullback-Leibler divergence (KLD), the reverse KLD (RKLD) and the log spectral distortion (LSD), which display more significant results. However, in order to assess the models more perceptually, a method is proposed based upon perturbations around perfect reconstruction analysis-synthesis configurations. In the cross-excitation analysis-synthesis assessment (CEASA) method, the residual signals generated by analysis filters of the spectral models are injected as excitation into the synthesis filters derived from the same and other models in order to be evaluated by the perceptual evaluation of speech quality (PESQ) and Itakura divergence (ID), which are averaged over a set of models obtained using the objective functions mentioned above. The results lead to a superior performance when the RKLD is used as the loss function for the estimation of the spectral models with the ISD ranking close behind. The focus of these divergences on the spectral peaks is argued and pointed as the most important factor for this behavior. Specifically, using the PESQ scores obtained with CEASA, the RKLD loss is found to improve the performance by 1.0%, 4.0% and 19.3% with respect to the open-loop analysis, the KLD and the LSD models, respectively, while the corresponding improvements for the ISD loss are 0.1%, 3.0% and 18.2%, and the RKLD models excel the ISD models by 1.0% on average. Even though the spectral measures alone are not able to unequivocally distinguish the better of the two, CEASA is shown to have enough sensitivity to distinguish their performances. In summary, the learning machine S3LM fits models for the short-term spectral envelope of speech and, for the evaluation of its performance under several differentiable loss functions, the CEASA assessment tool has been developed. In addition, CEASA may be used for other assessments connected with speech analysis and synthesis.
first_indexed 2024-12-24T13:02:00Z
format Article
id doaj.art-941a1102ba7346b591135894f3631dad
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-12-24T13:02:00Z
publishDate 2022-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-941a1102ba7346b591135894f3631dad2022-12-21T16:54:07ZengIEEEIEEE Access2169-35362022-01-0110176091761810.1109/ACCESS.2022.31507289709279Differentiable Measures for Speech Spectral ModelingMiguel Arjona Ramirez0https://orcid.org/0000-0002-7107-0888Wesley Beccaro1https://orcid.org/0000-0001-6599-2344Demostenes Zegarra Rodriguez2https://orcid.org/0000-0001-5401-7551Renata Lopes Rosa3https://orcid.org/0000-0002-7595-7187Department of Electronic Systems Engineering, Polytechnic School of the University of São Paulo, São Paulo, BrazilDepartment of Electronic Systems Engineering, Polytechnic School of the University of São Paulo, São Paulo, BrazilDepartment of Computer Science, Federal University of Lavras, Lavras, BrazilDepartment of Computer Science, Federal University of Lavras, Lavras, BrazilAutoregressive models for the envelope of speech power spectral densities (PSDs) are refined by the self-supervised spectral learning machine (S3LM) provided with differentiable spectral objective functions, including the Itakura-Saito divergence (ISD), the Kullback-Leibler divergence (KLD), the reverse KLD (RKLD) and the log spectral distortion (LSD), which display more significant results. However, in order to assess the models more perceptually, a method is proposed based upon perturbations around perfect reconstruction analysis-synthesis configurations. In the cross-excitation analysis-synthesis assessment (CEASA) method, the residual signals generated by analysis filters of the spectral models are injected as excitation into the synthesis filters derived from the same and other models in order to be evaluated by the perceptual evaluation of speech quality (PESQ) and Itakura divergence (ID), which are averaged over a set of models obtained using the objective functions mentioned above. The results lead to a superior performance when the RKLD is used as the loss function for the estimation of the spectral models with the ISD ranking close behind. The focus of these divergences on the spectral peaks is argued and pointed as the most important factor for this behavior. Specifically, using the PESQ scores obtained with CEASA, the RKLD loss is found to improve the performance by 1.0%, 4.0% and 19.3% with respect to the open-loop analysis, the KLD and the LSD models, respectively, while the corresponding improvements for the ISD loss are 0.1%, 3.0% and 18.2%, and the RKLD models excel the ISD models by 1.0% on average. Even though the spectral measures alone are not able to unequivocally distinguish the better of the two, CEASA is shown to have enough sensitivity to distinguish their performances. In summary, the learning machine S3LM fits models for the short-term spectral envelope of speech and, for the evaluation of its performance under several differentiable loss functions, the CEASA assessment tool has been developed. In addition, CEASA may be used for other assessments connected with speech analysis and synthesis.https://ieeexplore.ieee.org/document/9709279/Autoregressive processesmachine learning algorithmsprediction methodsself-supervised learningspeech analysisspectral analysis
spellingShingle Miguel Arjona Ramirez
Wesley Beccaro
Demostenes Zegarra Rodriguez
Renata Lopes Rosa
Differentiable Measures for Speech Spectral Modeling
IEEE Access
Autoregressive processes
machine learning algorithms
prediction methods
self-supervised learning
speech analysis
spectral analysis
title Differentiable Measures for Speech Spectral Modeling
title_full Differentiable Measures for Speech Spectral Modeling
title_fullStr Differentiable Measures for Speech Spectral Modeling
title_full_unstemmed Differentiable Measures for Speech Spectral Modeling
title_short Differentiable Measures for Speech Spectral Modeling
title_sort differentiable measures for speech spectral modeling
topic Autoregressive processes
machine learning algorithms
prediction methods
self-supervised learning
speech analysis
spectral analysis
url https://ieeexplore.ieee.org/document/9709279/
work_keys_str_mv AT miguelarjonaramirez differentiablemeasuresforspeechspectralmodeling
AT wesleybeccaro differentiablemeasuresforspeechspectralmodeling
AT demosteneszegarrarodriguez differentiablemeasuresforspeechspectralmodeling
AT renatalopesrosa differentiablemeasuresforspeechspectralmodeling