Auxiliary Loss Multimodal GRU Model in Audio-Visual Speech Recognition

Audio-visual speech recognition (AVSR) utilizes both audio and video modalities for the robust automatic speech recognition. Most deep neural network (DNN) has achieved promising performances in AVSR owing to its generalized and nonlinear mapping ability. However, these DNN models have two main disa...

Full description

Bibliographic Details
Main Authors:	Yuan Yuan, Chunlin Tian, Xiaoqiang Lu
Format:	Article
Language:	English
Published:	IEEE 2018-01-01
Series:	IEEE Access
Subjects:	Aduio-visual systems recurrent neural networks generative adversarial networks
Online Access:	https://ieeexplore.ieee.org/document/8279447/

_version_	1818617985678442496
author	Yuan Yuan Chunlin Tian Xiaoqiang Lu
author_facet	Yuan Yuan Chunlin Tian Xiaoqiang Lu
author_sort	Yuan Yuan
collection	DOAJ
description	Audio-visual speech recognition (AVSR) utilizes both audio and video modalities for the robust automatic speech recognition. Most deep neural network (DNN) has achieved promising performances in AVSR owing to its generalized and nonlinear mapping ability. However, these DNN models have two main disadvantages: 1) the first disadvantage is that most models alleviate the AVSR problems neglecting the fact that the frames are correlated; and 2) the second disadvantage is the feature learned by the mentioned models is not credible. This is because the joint representation learned by the fusion fails to consider the specific information of categories, and the discriminative information is sparse, while the noise, reverberation, irrelevant image objection, and background are redundancy. Aiming at relieving these disadvantages, we propose the auxiliary loss multimodal GRU (alm-GRU) model including three parts: feature extraction, data augmentation, and fusion & recognition. The feature extraction and data augmentation are a complete effective solution for the processing raw complete video and training, and precondition for later core part: fusion & recognition using alm-GRU equipped with a novel loss which is an end-to-end network combining both fusion and recognition, furthermore considering the modal and temporal information. The experiments show the superiority of our model and necessity of the data augmentation and generative component in the benchmark data sets.
first_indexed	2024-12-16T17:14:24Z
format	Article
id	doaj.art-c92e1c45f9344f86a6df96b007acb7d8
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-12-16T17:14:24Z
publishDate	2018-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-c92e1c45f9344f86a6df96b007acb7d82022-12-21T22:23:19ZengIEEEIEEE Access2169-35362018-01-0165573558310.1109/ACCESS.2018.27961188279447Auxiliary Loss Multimodal GRU Model in Audio-Visual Speech RecognitionYuan Yuan0Chunlin Tian1Xiaoqiang Lu2https://orcid.org/0000-0002-7037-5188Center for Optical Imagery Analysis and Learning, Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an, ChinaCenter for Optical Imagery Analysis and Learning, Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an, ChinaCenter for Optical Imagery Analysis and Learning, Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an, ChinaAudio-visual speech recognition (AVSR) utilizes both audio and video modalities for the robust automatic speech recognition. Most deep neural network (DNN) has achieved promising performances in AVSR owing to its generalized and nonlinear mapping ability. However, these DNN models have two main disadvantages: 1) the first disadvantage is that most models alleviate the AVSR problems neglecting the fact that the frames are correlated; and 2) the second disadvantage is the feature learned by the mentioned models is not credible. This is because the joint representation learned by the fusion fails to consider the specific information of categories, and the discriminative information is sparse, while the noise, reverberation, irrelevant image objection, and background are redundancy. Aiming at relieving these disadvantages, we propose the auxiliary loss multimodal GRU (alm-GRU) model including three parts: feature extraction, data augmentation, and fusion & recognition. The feature extraction and data augmentation are a complete effective solution for the processing raw complete video and training, and precondition for later core part: fusion & recognition using alm-GRU equipped with a novel loss which is an end-to-end network combining both fusion and recognition, furthermore considering the modal and temporal information. The experiments show the superiority of our model and necessity of the data augmentation and generative component in the benchmark data sets.https://ieeexplore.ieee.org/document/8279447/Aduio-visual systemsrecurrent neural networksgenerative adversarial networks
spellingShingle	Yuan Yuan Chunlin Tian Xiaoqiang Lu Auxiliary Loss Multimodal GRU Model in Audio-Visual Speech Recognition IEEE Access Aduio-visual systems recurrent neural networks generative adversarial networks
title	Auxiliary Loss Multimodal GRU Model in Audio-Visual Speech Recognition
title_full	Auxiliary Loss Multimodal GRU Model in Audio-Visual Speech Recognition
title_fullStr	Auxiliary Loss Multimodal GRU Model in Audio-Visual Speech Recognition
title_full_unstemmed	Auxiliary Loss Multimodal GRU Model in Audio-Visual Speech Recognition
title_short	Auxiliary Loss Multimodal GRU Model in Audio-Visual Speech Recognition
title_sort	auxiliary loss multimodal gru model in audio visual speech recognition
topic	Aduio-visual systems recurrent neural networks generative adversarial networks
url	https://ieeexplore.ieee.org/document/8279447/
work_keys_str_mv	AT yuanyuan auxiliarylossmultimodalgrumodelinaudiovisualspeechrecognition AT chunlintian auxiliarylossmultimodalgrumodelinaudiovisualspeechrecognition AT xiaoqianglu auxiliarylossmultimodalgrumodelinaudiovisualspeechrecognition

Auxiliary Loss Multimodal GRU Model in Audio-Visual Speech Recognition

Similar Items