Learning the Relative Dynamic Features for Word-Level Lipreading

Lipreading is a technique for analyzing sequences of lip movements and then recognizing the speech content of a speaker. Limited by the structure of our vocal organs, the number of pronunciations we could make is finite, leading to problems with homophones when speaking. On the other hand, different...

Full description

Bibliographic Details
Main Authors:	Hao Li, Nurbiya Yadikar, Yali Zhu, Mutallip Mamut, Kurban Ubul
Format:	Article
Language:	English
Published:	MDPI AG 2022-05-01
Series:	Sensors
Subjects:	Visual Speech Recognition lipreading spatial–temporal feature extraction
Online Access:	https://www.mdpi.com/1424-8220/22/10/3732

_version_	1797495770341965824
author	Hao Li Nurbiya Yadikar Yali Zhu Mutallip Mamut Kurban Ubul
author_facet	Hao Li Nurbiya Yadikar Yali Zhu Mutallip Mamut Kurban Ubul
author_sort	Hao Li
collection	DOAJ
description	Lipreading is a technique for analyzing sequences of lip movements and then recognizing the speech content of a speaker. Limited by the structure of our vocal organs, the number of pronunciations we could make is finite, leading to problems with homophones when speaking. On the other hand, different speakers will have various lip movements for the same word. For these problems, we focused on the spatial–temporal feature extraction in word-level lipreading in this paper, and an efficient two-stream model was proposed to learn the relative dynamic information of lip motion. In this model, two different channel capacity CNN streams are used to extract static features in a single frame and dynamic information between multi-frame sequences, respectively. We explored a more effective convolution structure for each component in the front-end model and improved by about 8%. Then, according to the characteristics of the word-level lipreading dataset, we further studied the impact of the two sampling methods on the fast and slow channels. Furthermore, we discussed the influence of the fusion methods of the front-end and back-end models under the two-stream network structure. Finally, we evaluated the proposed model on two large-scale lipreading datasets and achieved a new state-of-the-art.
first_indexed	2024-03-10T01:54:23Z
format	Article
id	doaj.art-cc49421b9b124dff8dd7d5c0a7f7cf4e
institution	Directory Open Access Journal
issn	1424-8220
language	English
last_indexed	2024-03-10T01:54:23Z
publishDate	2022-05-01
publisher	MDPI AG
record_format	Article
series	Sensors
spelling	doaj.art-cc49421b9b124dff8dd7d5c0a7f7cf4e2023-11-23T13:00:12ZengMDPI AGSensors1424-82202022-05-012210373210.3390/s22103732Learning the Relative Dynamic Features for Word-Level LipreadingHao Li0Nurbiya Yadikar1Yali Zhu2Mutallip Mamut3Kurban Ubul4School of Information Science and Engineering, Xinjiang University, Urumqi 830046, ChinaSchool of Information Science and Engineering, Xinjiang University, Urumqi 830046, ChinaSchool of Information Science and Engineering, Xinjiang University, Urumqi 830046, ChinaTechnology Department, Library of Xinjiang University, Urumqi 830046, ChinaSchool of Information Science and Engineering, Xinjiang University, Urumqi 830046, ChinaLipreading is a technique for analyzing sequences of lip movements and then recognizing the speech content of a speaker. Limited by the structure of our vocal organs, the number of pronunciations we could make is finite, leading to problems with homophones when speaking. On the other hand, different speakers will have various lip movements for the same word. For these problems, we focused on the spatial–temporal feature extraction in word-level lipreading in this paper, and an efficient two-stream model was proposed to learn the relative dynamic information of lip motion. In this model, two different channel capacity CNN streams are used to extract static features in a single frame and dynamic information between multi-frame sequences, respectively. We explored a more effective convolution structure for each component in the front-end model and improved by about 8%. Then, according to the characteristics of the word-level lipreading dataset, we further studied the impact of the two sampling methods on the fast and slow channels. Furthermore, we discussed the influence of the fusion methods of the front-end and back-end models under the two-stream network structure. Finally, we evaluated the proposed model on two large-scale lipreading datasets and achieved a new state-of-the-art.https://www.mdpi.com/1424-8220/22/10/3732Visual Speech Recognitionlipreadingspatial–temporal feature extraction
spellingShingle	Hao Li Nurbiya Yadikar Yali Zhu Mutallip Mamut Kurban Ubul Learning the Relative Dynamic Features for Word-Level Lipreading Sensors Visual Speech Recognition lipreading spatial–temporal feature extraction
title	Learning the Relative Dynamic Features for Word-Level Lipreading
title_full	Learning the Relative Dynamic Features for Word-Level Lipreading
title_fullStr	Learning the Relative Dynamic Features for Word-Level Lipreading
title_full_unstemmed	Learning the Relative Dynamic Features for Word-Level Lipreading
title_short	Learning the Relative Dynamic Features for Word-Level Lipreading
title_sort	learning the relative dynamic features for word level lipreading
topic	Visual Speech Recognition lipreading spatial–temporal feature extraction
url	https://www.mdpi.com/1424-8220/22/10/3732
work_keys_str_mv	AT haoli learningtherelativedynamicfeaturesforwordlevellipreading AT nurbiyayadikar learningtherelativedynamicfeaturesforwordlevellipreading AT yalizhu learningtherelativedynamicfeaturesforwordlevellipreading AT mutallipmamut learningtherelativedynamicfeaturesforwordlevellipreading AT kurbanubul learningtherelativedynamicfeaturesforwordlevellipreading

Learning the Relative Dynamic Features for Word-Level Lipreading

Similar Items