End-to-End Automatic Pronunciation Error Detection Based on Improved Hybrid CTC/Attention Architecture

Advanced automatic pronunciation error detection (APED) algorithms are usually based on state-of-the-art automatic speech recognition (ASR) techniques. With the development of deep learning technology, end-to-end ASR technology has gradually matured and achieved positive practical results, which pro...

Full description

Bibliographic Details
Main Authors:	Long Zhang, Ziping Zhao, Chunmei Ma, Linlin Shan, Huazhi Sun, Lifen Jiang, Shiwen Deng, Chang Gao
Format:	Article
Language:	English
Published:	MDPI AG 2020-03-01
Series:	Sensors
Subjects:	automatic pronunciation error detection asr ctc attention-based seq2seq model end-to-end capt
Online Access:	https://www.mdpi.com/1424-8220/20/7/1809

_version_	1811187863399694336
author	Long Zhang Ziping Zhao Chunmei Ma Linlin Shan Huazhi Sun Lifen Jiang Shiwen Deng Chang Gao
author_facet	Long Zhang Ziping Zhao Chunmei Ma Linlin Shan Huazhi Sun Lifen Jiang Shiwen Deng Chang Gao
author_sort	Long Zhang
collection	DOAJ
description	Advanced automatic pronunciation error detection (APED) algorithms are usually based on state-of-the-art automatic speech recognition (ASR) techniques. With the development of deep learning technology, end-to-end ASR technology has gradually matured and achieved positive practical results, which provides us with a new opportunity to update the APED algorithm. We first constructed an end-to-end ASR system based on the hybrid connectionist temporal classification and attention (CTC/attention) architecture. An adaptive parameter was used to enhance the complementarity of the connectionist temporal classification (CTC) model and the attention-based seq2seq model, further improving the performance of the ASR system. After this, the improved ASR system was used in the APED task of Mandarin, and good results were obtained. This new APED method makes force alignment and segmentation unnecessary, and it does not require multiple complex models, such as an acoustic model or a language model. It is convenient and straightforward, and will be a suitable general solution for L1-independent computer-assisted pronunciation training (CAPT). Furthermore, we find that find that in regards to accuracy metrics, our proposed system based on the improved hybrid CTC/attention architecture is close to the state-of-the-art ASR system based on the deep neural network−deep neural network (DNN−DNN) architecture, and has a stronger effect on the F-measure metrics, which are especially suitable for the requirements of the APED task.
first_indexed	2024-04-11T14:10:48Z
format	Article
id	doaj.art-e7bec2c89aef4bc3b92726c20e006c5d
institution	Directory Open Access Journal
issn	1424-8220
language	English
last_indexed	2024-04-11T14:10:48Z
publishDate	2020-03-01
publisher	MDPI AG
record_format	Article
series	Sensors
spelling	doaj.art-e7bec2c89aef4bc3b92726c20e006c5d2022-12-22T04:19:44ZengMDPI AGSensors1424-82202020-03-01207180910.3390/s20071809s20071809End-to-End Automatic Pronunciation Error Detection Based on Improved Hybrid CTC/Attention ArchitectureLong Zhang0Ziping Zhao1Chunmei Ma2Linlin Shan3Huazhi Sun4Lifen Jiang5Shiwen Deng6Chang Gao7College of Computer and Information Engineering, Tianjin Normal University, Tianjin 300387, ChinaCollege of Computer and Information Engineering, Tianjin Normal University, Tianjin 300387, ChinaCollege of Computer and Information Engineering, Tianjin Normal University, Tianjin 300387, ChinaCollege of Fine Arts and Design, Tianjin Normal University, Tianjin 300387, ChinaCollege of Computer and Information Engineering, Tianjin Normal University, Tianjin 300387, ChinaCollege of Computer and Information Engineering, Tianjin Normal University, Tianjin 300387, ChinaSchool of Mathematical Sciences, Harbin Normal University, Harbin 150080, ChinaSchool of Information Science and Engineering, Yanshan University, Qinhuangdao 066004, ChinaAdvanced automatic pronunciation error detection (APED) algorithms are usually based on state-of-the-art automatic speech recognition (ASR) techniques. With the development of deep learning technology, end-to-end ASR technology has gradually matured and achieved positive practical results, which provides us with a new opportunity to update the APED algorithm. We first constructed an end-to-end ASR system based on the hybrid connectionist temporal classification and attention (CTC/attention) architecture. An adaptive parameter was used to enhance the complementarity of the connectionist temporal classification (CTC) model and the attention-based seq2seq model, further improving the performance of the ASR system. After this, the improved ASR system was used in the APED task of Mandarin, and good results were obtained. This new APED method makes force alignment and segmentation unnecessary, and it does not require multiple complex models, such as an acoustic model or a language model. It is convenient and straightforward, and will be a suitable general solution for L1-independent computer-assisted pronunciation training (CAPT). Furthermore, we find that find that in regards to accuracy metrics, our proposed system based on the improved hybrid CTC/attention architecture is close to the state-of-the-art ASR system based on the deep neural network−deep neural network (DNN−DNN) architecture, and has a stronger effect on the F-measure metrics, which are especially suitable for the requirements of the APED task.https://www.mdpi.com/1424-8220/20/7/1809automatic pronunciation error detectionasrctcattention-basedseq2seq modelend-to-endcapt
spellingShingle	Long Zhang Ziping Zhao Chunmei Ma Linlin Shan Huazhi Sun Lifen Jiang Shiwen Deng Chang Gao End-to-End Automatic Pronunciation Error Detection Based on Improved Hybrid CTC/Attention Architecture Sensors automatic pronunciation error detection asr ctc attention-based seq2seq model end-to-end capt
title	End-to-End Automatic Pronunciation Error Detection Based on Improved Hybrid CTC/Attention Architecture
title_full	End-to-End Automatic Pronunciation Error Detection Based on Improved Hybrid CTC/Attention Architecture
title_fullStr	End-to-End Automatic Pronunciation Error Detection Based on Improved Hybrid CTC/Attention Architecture
title_full_unstemmed	End-to-End Automatic Pronunciation Error Detection Based on Improved Hybrid CTC/Attention Architecture
title_short	End-to-End Automatic Pronunciation Error Detection Based on Improved Hybrid CTC/Attention Architecture
title_sort	end to end automatic pronunciation error detection based on improved hybrid ctc attention architecture
topic	automatic pronunciation error detection asr ctc attention-based seq2seq model end-to-end capt
url	https://www.mdpi.com/1424-8220/20/7/1809
work_keys_str_mv	AT longzhang endtoendautomaticpronunciationerrordetectionbasedonimprovedhybridctcattentionarchitecture AT zipingzhao endtoendautomaticpronunciationerrordetectionbasedonimprovedhybridctcattentionarchitecture AT chunmeima endtoendautomaticpronunciationerrordetectionbasedonimprovedhybridctcattentionarchitecture AT linlinshan endtoendautomaticpronunciationerrordetectionbasedonimprovedhybridctcattentionarchitecture AT huazhisun endtoendautomaticpronunciationerrordetectionbasedonimprovedhybridctcattentionarchitecture AT lifenjiang endtoendautomaticpronunciationerrordetectionbasedonimprovedhybridctcattentionarchitecture AT shiwendeng endtoendautomaticpronunciationerrordetectionbasedonimprovedhybridctcattentionarchitecture AT changgao endtoendautomaticpronunciationerrordetectionbasedonimprovedhybridctcattentionarchitecture

End-to-End Automatic Pronunciation Error Detection Based on Improved Hybrid CTC/Attention Architecture

Similar Items