Temporal Convolution Network Based Joint Optimization of Acoustic-to-Articulatory Inversion
Articulatory features are proved to be efficient in the area of speech recognition and speech synthesis. However, acquiring articulatory features has always been a difficult research hotspot. A lightweight and accurate articulatory model is of significant meaning. In this study, we propose a novel t...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2021-09-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/11/19/9056 |
_version_ | 1797516767871893504 |
---|---|
author | Guolun Sun Zhihua Huang Li Wang Pengyuan Zhang |
author_facet | Guolun Sun Zhihua Huang Li Wang Pengyuan Zhang |
author_sort | Guolun Sun |
collection | DOAJ |
description | Articulatory features are proved to be efficient in the area of speech recognition and speech synthesis. However, acquiring articulatory features has always been a difficult research hotspot. A lightweight and accurate articulatory model is of significant meaning. In this study, we propose a novel temporal convolution network-based acoustic-to-articulatory inversion system. The acoustic feature is converted into a high-dimensional hidden space feature map through temporal convolution with frame-level feature correlations taken into account. Meanwhile, we construct a two-part target function combining prediction’s Root Mean Square Error (RMSE) and the sequences’ Pearson Correlation Coefficient (PCC) to jointly optimize the performance of the specific inversion model from both aspects. We also further conducted an analysis on the impact of the weight between the two parts on the final performance of the inversion model. Extensive experiments have shown that our, temporal convolution networks (TCN) model outperformed the Bi-derectional Long Short Term Memory model by 1.18 mm in RMSE and 0.845 in PCC with <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mfrac><mn>1</mn><mn>4</mn></mfrac></semantics></math></inline-formula> model parameters when optimizing evenly with RMSE and PCC aspects. |
first_indexed | 2024-03-10T07:06:25Z |
format | Article |
id | doaj.art-22372b17979646a582bb7d89fad7f72a |
institution | Directory Open Access Journal |
issn | 2076-3417 |
language | English |
last_indexed | 2024-03-10T07:06:25Z |
publishDate | 2021-09-01 |
publisher | MDPI AG |
record_format | Article |
series | Applied Sciences |
spelling | doaj.art-22372b17979646a582bb7d89fad7f72a2023-11-22T15:47:06ZengMDPI AGApplied Sciences2076-34172021-09-011119905610.3390/app11199056Temporal Convolution Network Based Joint Optimization of Acoustic-to-Articulatory InversionGuolun Sun0Zhihua Huang1Li Wang2Pengyuan Zhang3Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, ChinaUniversity of Chinese Academy of Sciences, Beijing 100049, ChinaKey Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, ChinaKey Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, ChinaArticulatory features are proved to be efficient in the area of speech recognition and speech synthesis. However, acquiring articulatory features has always been a difficult research hotspot. A lightweight and accurate articulatory model is of significant meaning. In this study, we propose a novel temporal convolution network-based acoustic-to-articulatory inversion system. The acoustic feature is converted into a high-dimensional hidden space feature map through temporal convolution with frame-level feature correlations taken into account. Meanwhile, we construct a two-part target function combining prediction’s Root Mean Square Error (RMSE) and the sequences’ Pearson Correlation Coefficient (PCC) to jointly optimize the performance of the specific inversion model from both aspects. We also further conducted an analysis on the impact of the weight between the two parts on the final performance of the inversion model. Extensive experiments have shown that our, temporal convolution networks (TCN) model outperformed the Bi-derectional Long Short Term Memory model by 1.18 mm in RMSE and 0.845 in PCC with <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mfrac><mn>1</mn><mn>4</mn></mfrac></semantics></math></inline-formula> model parameters when optimizing evenly with RMSE and PCC aspects.https://www.mdpi.com/2076-3417/11/19/9056acoustic-to-articulatory inversiontemporal convolution networkMean Square ErrorPearson Correlation Coefficient |
spellingShingle | Guolun Sun Zhihua Huang Li Wang Pengyuan Zhang Temporal Convolution Network Based Joint Optimization of Acoustic-to-Articulatory Inversion Applied Sciences acoustic-to-articulatory inversion temporal convolution network Mean Square Error Pearson Correlation Coefficient |
title | Temporal Convolution Network Based Joint Optimization of Acoustic-to-Articulatory Inversion |
title_full | Temporal Convolution Network Based Joint Optimization of Acoustic-to-Articulatory Inversion |
title_fullStr | Temporal Convolution Network Based Joint Optimization of Acoustic-to-Articulatory Inversion |
title_full_unstemmed | Temporal Convolution Network Based Joint Optimization of Acoustic-to-Articulatory Inversion |
title_short | Temporal Convolution Network Based Joint Optimization of Acoustic-to-Articulatory Inversion |
title_sort | temporal convolution network based joint optimization of acoustic to articulatory inversion |
topic | acoustic-to-articulatory inversion temporal convolution network Mean Square Error Pearson Correlation Coefficient |
url | https://www.mdpi.com/2076-3417/11/19/9056 |
work_keys_str_mv | AT guolunsun temporalconvolutionnetworkbasedjointoptimizationofacoustictoarticulatoryinversion AT zhihuahuang temporalconvolutionnetworkbasedjointoptimizationofacoustictoarticulatoryinversion AT liwang temporalconvolutionnetworkbasedjointoptimizationofacoustictoarticulatoryinversion AT pengyuanzhang temporalconvolutionnetworkbasedjointoptimizationofacoustictoarticulatoryinversion |