Temporal Convolution Network Based Joint Optimization of Acoustic-to-Articulatory Inversion

Articulatory features are proved to be efficient in the area of speech recognition and speech synthesis. However, acquiring articulatory features has always been a difficult research hotspot. A lightweight and accurate articulatory model is of significant meaning. In this study, we propose a novel t...

Full description

Bibliographic Details
Main Authors: Guolun Sun, Zhihua Huang, Li Wang, Pengyuan Zhang
Format: Article
Language:English
Published: MDPI AG 2021-09-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/11/19/9056
_version_ 1797516767871893504
author Guolun Sun
Zhihua Huang
Li Wang
Pengyuan Zhang
author_facet Guolun Sun
Zhihua Huang
Li Wang
Pengyuan Zhang
author_sort Guolun Sun
collection DOAJ
description Articulatory features are proved to be efficient in the area of speech recognition and speech synthesis. However, acquiring articulatory features has always been a difficult research hotspot. A lightweight and accurate articulatory model is of significant meaning. In this study, we propose a novel temporal convolution network-based acoustic-to-articulatory inversion system. The acoustic feature is converted into a high-dimensional hidden space feature map through temporal convolution with frame-level feature correlations taken into account. Meanwhile, we construct a two-part target function combining prediction’s Root Mean Square Error (RMSE) and the sequences’ Pearson Correlation Coefficient (PCC) to jointly optimize the performance of the specific inversion model from both aspects. We also further conducted an analysis on the impact of the weight between the two parts on the final performance of the inversion model. Extensive experiments have shown that our, temporal convolution networks (TCN) model outperformed the Bi-derectional Long Short Term Memory model by 1.18 mm in RMSE and 0.845 in PCC with <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mfrac><mn>1</mn><mn>4</mn></mfrac></semantics></math></inline-formula> model parameters when optimizing evenly with RMSE and PCC aspects.
first_indexed 2024-03-10T07:06:25Z
format Article
id doaj.art-22372b17979646a582bb7d89fad7f72a
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-10T07:06:25Z
publishDate 2021-09-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-22372b17979646a582bb7d89fad7f72a2023-11-22T15:47:06ZengMDPI AGApplied Sciences2076-34172021-09-011119905610.3390/app11199056Temporal Convolution Network Based Joint Optimization of Acoustic-to-Articulatory InversionGuolun Sun0Zhihua Huang1Li Wang2Pengyuan Zhang3Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, ChinaUniversity of Chinese Academy of Sciences, Beijing 100049, ChinaKey Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, ChinaKey Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, ChinaArticulatory features are proved to be efficient in the area of speech recognition and speech synthesis. However, acquiring articulatory features has always been a difficult research hotspot. A lightweight and accurate articulatory model is of significant meaning. In this study, we propose a novel temporal convolution network-based acoustic-to-articulatory inversion system. The acoustic feature is converted into a high-dimensional hidden space feature map through temporal convolution with frame-level feature correlations taken into account. Meanwhile, we construct a two-part target function combining prediction’s Root Mean Square Error (RMSE) and the sequences’ Pearson Correlation Coefficient (PCC) to jointly optimize the performance of the specific inversion model from both aspects. We also further conducted an analysis on the impact of the weight between the two parts on the final performance of the inversion model. Extensive experiments have shown that our, temporal convolution networks (TCN) model outperformed the Bi-derectional Long Short Term Memory model by 1.18 mm in RMSE and 0.845 in PCC with <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mfrac><mn>1</mn><mn>4</mn></mfrac></semantics></math></inline-formula> model parameters when optimizing evenly with RMSE and PCC aspects.https://www.mdpi.com/2076-3417/11/19/9056acoustic-to-articulatory inversiontemporal convolution networkMean Square ErrorPearson Correlation Coefficient
spellingShingle Guolun Sun
Zhihua Huang
Li Wang
Pengyuan Zhang
Temporal Convolution Network Based Joint Optimization of Acoustic-to-Articulatory Inversion
Applied Sciences
acoustic-to-articulatory inversion
temporal convolution network
Mean Square Error
Pearson Correlation Coefficient
title Temporal Convolution Network Based Joint Optimization of Acoustic-to-Articulatory Inversion
title_full Temporal Convolution Network Based Joint Optimization of Acoustic-to-Articulatory Inversion
title_fullStr Temporal Convolution Network Based Joint Optimization of Acoustic-to-Articulatory Inversion
title_full_unstemmed Temporal Convolution Network Based Joint Optimization of Acoustic-to-Articulatory Inversion
title_short Temporal Convolution Network Based Joint Optimization of Acoustic-to-Articulatory Inversion
title_sort temporal convolution network based joint optimization of acoustic to articulatory inversion
topic acoustic-to-articulatory inversion
temporal convolution network
Mean Square Error
Pearson Correlation Coefficient
url https://www.mdpi.com/2076-3417/11/19/9056
work_keys_str_mv AT guolunsun temporalconvolutionnetworkbasedjointoptimizationofacoustictoarticulatoryinversion
AT zhihuahuang temporalconvolutionnetworkbasedjointoptimizationofacoustictoarticulatoryinversion
AT liwang temporalconvolutionnetworkbasedjointoptimizationofacoustictoarticulatoryinversion
AT pengyuanzhang temporalconvolutionnetworkbasedjointoptimizationofacoustictoarticulatoryinversion