Efficient DNN Model for Word Lip-Reading

This paper studies various deep learning models for word-level lip-reading technology, one of the tasks in the supervised learning of video classification. Several public datasets have been published in the lip-reading research field. However, few studies have investigated lip-reading techniques usi...

Full description

Bibliographic Details
Main Authors:	Taiki Arakane, Takeshi Saitoh
Format:	Article
Language:	English
Published:	MDPI AG 2023-05-01
Series:	Algorithms
Subjects:	lip-reading word recognition deep neural network LRW OuluVS CUAVE
Online Access:	https://www.mdpi.com/1999-4893/16/6/269

_version_	1827739021009223680
author	Taiki Arakane Takeshi Saitoh
author_facet	Taiki Arakane Takeshi Saitoh
author_sort	Taiki Arakane
collection	DOAJ
description	This paper studies various deep learning models for word-level lip-reading technology, one of the tasks in the supervised learning of video classification. Several public datasets have been published in the lip-reading research field. However, few studies have investigated lip-reading techniques using multiple datasets. This paper evaluates deep learning models using four publicly available datasets, namely Lip Reading in the Wild (LRW), OuluVS, CUAVE, and Speech Scene by Smart Device (SSSD), which are representative datasets in this field. LRW is one of the large-scale public datasets and targets 500 English words released in 2016. Initially, the recognition accuracy of LRW was 66.1%, but many research groups have been working on it. The current the state of the art (SOTA) has achieved 94.1% by 3D-Conv + ResNet18 + {DC-TCN, MS-TCN, BGRU} + knowledge distillation + word boundary. Regarding the SOTA model, in this paper, we combine existing models such as ResNet, WideResNet, WideResNet, EfficientNet, MS-TCN, Transformer, ViT, and ViViT, and investigate the effective models for word lip-reading tasks using six deep learning models with modified feature extractors and classifiers. Through recognition experiments, we show that similar model structures of 3D-Conv + ResNet18 for feature extraction and MS-TCN model for inference are valid for four datasets with different scales.
first_indexed	2024-03-11T02:52:12Z
format	Article
id	doaj.art-e07e09f50ccf43f0a8fb727864cb1098
institution	Directory Open Access Journal
issn	1999-4893
language	English
last_indexed	2024-03-11T02:52:12Z
publishDate	2023-05-01
publisher	MDPI AG
record_format	Article
series	Algorithms
spelling	doaj.art-e07e09f50ccf43f0a8fb727864cb10982023-11-18T08:56:34ZengMDPI AGAlgorithms1999-48932023-05-0116626910.3390/a16060269Efficient DNN Model for Word Lip-ReadingTaiki Arakane0Takeshi Saitoh1Department of Artificial Intelligence, Kyushu Institute of Technology, Fukuoka 820-8502, JapanDepartment of Artificial Intelligence, Kyushu Institute of Technology, Fukuoka 820-8502, JapanThis paper studies various deep learning models for word-level lip-reading technology, one of the tasks in the supervised learning of video classification. Several public datasets have been published in the lip-reading research field. However, few studies have investigated lip-reading techniques using multiple datasets. This paper evaluates deep learning models using four publicly available datasets, namely Lip Reading in the Wild (LRW), OuluVS, CUAVE, and Speech Scene by Smart Device (SSSD), which are representative datasets in this field. LRW is one of the large-scale public datasets and targets 500 English words released in 2016. Initially, the recognition accuracy of LRW was 66.1%, but many research groups have been working on it. The current the state of the art (SOTA) has achieved 94.1% by 3D-Conv + ResNet18 + {DC-TCN, MS-TCN, BGRU} + knowledge distillation + word boundary. Regarding the SOTA model, in this paper, we combine existing models such as ResNet, WideResNet, WideResNet, EfficientNet, MS-TCN, Transformer, ViT, and ViViT, and investigate the effective models for word lip-reading tasks using six deep learning models with modified feature extractors and classifiers. Through recognition experiments, we show that similar model structures of 3D-Conv + ResNet18 for feature extraction and MS-TCN model for inference are valid for four datasets with different scales.https://www.mdpi.com/1999-4893/16/6/269lip-readingword recognitiondeep neural networkLRWOuluVSCUAVE
spellingShingle	Taiki Arakane Takeshi Saitoh Efficient DNN Model for Word Lip-Reading Algorithms lip-reading word recognition deep neural network LRW OuluVS CUAVE
title	Efficient DNN Model for Word Lip-Reading
title_full	Efficient DNN Model for Word Lip-Reading
title_fullStr	Efficient DNN Model for Word Lip-Reading
title_full_unstemmed	Efficient DNN Model for Word Lip-Reading
title_short	Efficient DNN Model for Word Lip-Reading
title_sort	efficient dnn model for word lip reading
topic	lip-reading word recognition deep neural network LRW OuluVS CUAVE
url	https://www.mdpi.com/1999-4893/16/6/269
work_keys_str_mv	AT taikiarakane efficientdnnmodelforwordlipreading AT takeshisaitoh efficientdnnmodelforwordlipreading

Efficient DNN Model for Word Lip-Reading

Similar Items