End-to-End Sentence-Level Multi-View Lipreading Architecture with Spatial Attention Module Integrated Multiple CNNs and Cascaded Local Self-Attention-CTC

Concomitant with the recent advances in deep learning, automatic speech recognition and visual speech recognition (VSR) have received considerable attention. However, although VSR systems must identify speech from both frontal and profile faces in real-world scenarios, most VSR studies have focused...

Full description

Bibliographic Details
Main Authors:	Sanghun Jeon, Mun Sang Kim
Format:	Article
Language:	English
Published:	MDPI AG 2022-05-01
Series:	Sensors
Subjects:	lipreading visual speech recognition multi-view VSR deep learning attention mechanism spatial attention module
Online Access:	https://www.mdpi.com/1424-8220/22/9/3597

_version_	1797502733303939072
author	Sanghun Jeon Mun Sang Kim
author_facet	Sanghun Jeon Mun Sang Kim
author_sort	Sanghun Jeon
collection	DOAJ
description	Concomitant with the recent advances in deep learning, automatic speech recognition and visual speech recognition (VSR) have received considerable attention. However, although VSR systems must identify speech from both frontal and profile faces in real-world scenarios, most VSR studies have focused solely on frontal face pictures. To address this issue, we propose an end-to-end sentence-level multi-view VSR architecture for faces captured from four different perspectives (frontal, 30°, 45°, and 60°). The encoder uses multiple convolutional neural networks with a spatial attention module to detect minor changes in the mouth patterns of similarly pronounced words, and the decoder uses cascaded local self-attention connectionist temporal classification to collect the details of local contextual information in the immediate vicinity, which results in a substantial performance boost and speedy convergence. To compare the performance of the proposed model for experiments on the OuluVS2 dataset, the dataset was divided into four different perspectives, and the obtained performance improvement was 3.31% (0°), 4.79% (30°), 5.51% (45°), 6.18% (60°), and 4.95% (mean), respectively, compared with the existing state-of-the-art performance, and the average performance improved by 9.1% compared with the baseline. Thus, the suggested design enhances the performance of multi-view VSR and boosts its usefulness in real-world applications.
first_indexed	2024-03-10T03:40:20Z
format	Article
id	doaj.art-b459da903c454b62a2ae0e933f392050
institution	Directory Open Access Journal
issn	1424-8220
language	English
last_indexed	2024-03-10T03:40:20Z
publishDate	2022-05-01
publisher	MDPI AG
record_format	Article
series	Sensors
spelling	doaj.art-b459da903c454b62a2ae0e933f3920502023-11-23T09:20:46ZengMDPI AGSensors1424-82202022-05-01229359710.3390/s22093597End-to-End Sentence-Level Multi-View Lipreading Architecture with Spatial Attention Module Integrated Multiple CNNs and Cascaded Local Self-Attention-CTCSanghun Jeon0Mun Sang Kim1Center for Healthcare Robotics, Gwangju Institute of Science and Technology (GIST), School of Integrated Technology, Gwangju 61005, KoreaCenter for Healthcare Robotics, Gwangju Institute of Science and Technology (GIST), School of Integrated Technology, Gwangju 61005, KoreaConcomitant with the recent advances in deep learning, automatic speech recognition and visual speech recognition (VSR) have received considerable attention. However, although VSR systems must identify speech from both frontal and profile faces in real-world scenarios, most VSR studies have focused solely on frontal face pictures. To address this issue, we propose an end-to-end sentence-level multi-view VSR architecture for faces captured from four different perspectives (frontal, 30°, 45°, and 60°). The encoder uses multiple convolutional neural networks with a spatial attention module to detect minor changes in the mouth patterns of similarly pronounced words, and the decoder uses cascaded local self-attention connectionist temporal classification to collect the details of local contextual information in the immediate vicinity, which results in a substantial performance boost and speedy convergence. To compare the performance of the proposed model for experiments on the OuluVS2 dataset, the dataset was divided into four different perspectives, and the obtained performance improvement was 3.31% (0°), 4.79% (30°), 5.51% (45°), 6.18% (60°), and 4.95% (mean), respectively, compared with the existing state-of-the-art performance, and the average performance improved by 9.1% compared with the baseline. Thus, the suggested design enhances the performance of multi-view VSR and boosts its usefulness in real-world applications.https://www.mdpi.com/1424-8220/22/9/3597lipreadingvisual speech recognitionmulti-view VSRdeep learningattention mechanismspatial attention module
spellingShingle	Sanghun Jeon Mun Sang Kim End-to-End Sentence-Level Multi-View Lipreading Architecture with Spatial Attention Module Integrated Multiple CNNs and Cascaded Local Self-Attention-CTC Sensors lipreading visual speech recognition multi-view VSR deep learning attention mechanism spatial attention module
title	End-to-End Sentence-Level Multi-View Lipreading Architecture with Spatial Attention Module Integrated Multiple CNNs and Cascaded Local Self-Attention-CTC
title_full	End-to-End Sentence-Level Multi-View Lipreading Architecture with Spatial Attention Module Integrated Multiple CNNs and Cascaded Local Self-Attention-CTC
title_fullStr	End-to-End Sentence-Level Multi-View Lipreading Architecture with Spatial Attention Module Integrated Multiple CNNs and Cascaded Local Self-Attention-CTC
title_full_unstemmed	End-to-End Sentence-Level Multi-View Lipreading Architecture with Spatial Attention Module Integrated Multiple CNNs and Cascaded Local Self-Attention-CTC
title_short	End-to-End Sentence-Level Multi-View Lipreading Architecture with Spatial Attention Module Integrated Multiple CNNs and Cascaded Local Self-Attention-CTC
title_sort	end to end sentence level multi view lipreading architecture with spatial attention module integrated multiple cnns and cascaded local self attention ctc
topic	lipreading visual speech recognition multi-view VSR deep learning attention mechanism spatial attention module
url	https://www.mdpi.com/1424-8220/22/9/3597
work_keys_str_mv	AT sanghunjeon endtoendsentencelevelmultiviewlipreadingarchitecturewithspatialattentionmoduleintegratedmultiplecnnsandcascadedlocalselfattentionctc AT munsangkim endtoendsentencelevelmultiviewlipreadingarchitecturewithspatialattentionmoduleintegratedmultiplecnnsandcascadedlocalselfattentionctc

End-to-End Sentence-Level Multi-View Lipreading Architecture with Spatial Attention Module Integrated Multiple CNNs and Cascaded Local Self-Attention-CTC

Similar Items