Multi-Angle Lipreading with Angle Classification-Based Feature Extraction and Its Application to Audio-Visual Speech Recognition

Recently, automatic speech recognition (ASR) and visual speech recognition (VSR) have been widely researched owing to the development in deep learning. Most VSR research works focus only on frontal face images. However, assuming real scenes, it is obvious that a VSR system should correctly recognize...

Full description

Bibliographic Details
Main Authors: Shinnosuke Isobe, Satoshi Tamura, Satoru Hayamizu, Yuuto Gotoh, Masaki Nose
Format: Article
Language:English
Published: MDPI AG 2021-07-01
Series:Future Internet
Subjects:
Online Access:https://www.mdpi.com/1999-5903/13/7/182
_version_ 1797527091277725696
author Shinnosuke Isobe
Satoshi Tamura
Satoru Hayamizu
Yuuto Gotoh
Masaki Nose
author_facet Shinnosuke Isobe
Satoshi Tamura
Satoru Hayamizu
Yuuto Gotoh
Masaki Nose
author_sort Shinnosuke Isobe
collection DOAJ
description Recently, automatic speech recognition (ASR) and visual speech recognition (VSR) have been widely researched owing to the development in deep learning. Most VSR research works focus only on frontal face images. However, assuming real scenes, it is obvious that a VSR system should correctly recognize spoken contents from not only frontal but also diagonal or profile faces. In this paper, we propose a novel VSR method that is applicable to faces taken at any angle. Firstly, view classification is carried out to estimate face angles. Based on the results, feature extraction is then conducted using the best combination of pre-trained feature extraction models. Next, lipreading is carried out using the features. We also developed audio-visual speech recognition (AVSR) using the VSR in addition to conventional ASR. Audio results were obtained from ASR, followed by incorporating audio and visual results in a decision fusion manner. We evaluated our methods using OuluVS2, a multi-angle audio-visual database. We then confirmed that our approach achieved the best performance among conventional VSR schemes in a phrase classification task. In addition, we found that our AVSR results are better than ASR and VSR results.
first_indexed 2024-03-10T09:39:05Z
format Article
id doaj.art-3ca22c9bcce74743a96f470dd57b4839
institution Directory Open Access Journal
issn 1999-5903
language English
last_indexed 2024-03-10T09:39:05Z
publishDate 2021-07-01
publisher MDPI AG
record_format Article
series Future Internet
spelling doaj.art-3ca22c9bcce74743a96f470dd57b48392023-11-22T03:50:25ZengMDPI AGFuture Internet1999-59032021-07-0113718210.3390/fi13070182Multi-Angle Lipreading with Angle Classification-Based Feature Extraction and Its Application to Audio-Visual Speech RecognitionShinnosuke Isobe0Satoshi Tamura1Satoru Hayamizu2Yuuto Gotoh3Masaki Nose4Graduate School of Natural Science and Technology, Gifu University, 1-1 Yanagido, Gifu 501-1193, JapanFaculty of Engineering, Gifu University, 1-1 Yanagido, Gifu 501-1193, JapanFaculty of Engineering, Gifu University, 1-1 Yanagido, Gifu 501-1193, JapanRicoh Company, Ltd., 2-7-1 Izumi, Ebina, Kanagawa 243-0460, JapanRicoh Company, Ltd., 2-7-1 Izumi, Ebina, Kanagawa 243-0460, JapanRecently, automatic speech recognition (ASR) and visual speech recognition (VSR) have been widely researched owing to the development in deep learning. Most VSR research works focus only on frontal face images. However, assuming real scenes, it is obvious that a VSR system should correctly recognize spoken contents from not only frontal but also diagonal or profile faces. In this paper, we propose a novel VSR method that is applicable to faces taken at any angle. Firstly, view classification is carried out to estimate face angles. Based on the results, feature extraction is then conducted using the best combination of pre-trained feature extraction models. Next, lipreading is carried out using the features. We also developed audio-visual speech recognition (AVSR) using the VSR in addition to conventional ASR. Audio results were obtained from ASR, followed by incorporating audio and visual results in a decision fusion manner. We evaluated our methods using OuluVS2, a multi-angle audio-visual database. We then confirmed that our approach achieved the best performance among conventional VSR schemes in a phrase classification task. In addition, we found that our AVSR results are better than ASR and VSR results.https://www.mdpi.com/1999-5903/13/7/182visual speech recognitionmulti-angle lipreadingautomatic speech recognitionaudio-visual speech recognitiondeep learningview classification
spellingShingle Shinnosuke Isobe
Satoshi Tamura
Satoru Hayamizu
Yuuto Gotoh
Masaki Nose
Multi-Angle Lipreading with Angle Classification-Based Feature Extraction and Its Application to Audio-Visual Speech Recognition
Future Internet
visual speech recognition
multi-angle lipreading
automatic speech recognition
audio-visual speech recognition
deep learning
view classification
title Multi-Angle Lipreading with Angle Classification-Based Feature Extraction and Its Application to Audio-Visual Speech Recognition
title_full Multi-Angle Lipreading with Angle Classification-Based Feature Extraction and Its Application to Audio-Visual Speech Recognition
title_fullStr Multi-Angle Lipreading with Angle Classification-Based Feature Extraction and Its Application to Audio-Visual Speech Recognition
title_full_unstemmed Multi-Angle Lipreading with Angle Classification-Based Feature Extraction and Its Application to Audio-Visual Speech Recognition
title_short Multi-Angle Lipreading with Angle Classification-Based Feature Extraction and Its Application to Audio-Visual Speech Recognition
title_sort multi angle lipreading with angle classification based feature extraction and its application to audio visual speech recognition
topic visual speech recognition
multi-angle lipreading
automatic speech recognition
audio-visual speech recognition
deep learning
view classification
url https://www.mdpi.com/1999-5903/13/7/182
work_keys_str_mv AT shinnosukeisobe multianglelipreadingwithangleclassificationbasedfeatureextractionanditsapplicationtoaudiovisualspeechrecognition
AT satoshitamura multianglelipreadingwithangleclassificationbasedfeatureextractionanditsapplicationtoaudiovisualspeechrecognition
AT satoruhayamizu multianglelipreadingwithangleclassificationbasedfeatureextractionanditsapplicationtoaudiovisualspeechrecognition
AT yuutogotoh multianglelipreadingwithangleclassificationbasedfeatureextractionanditsapplicationtoaudiovisualspeechrecognition
AT masakinose multianglelipreadingwithangleclassificationbasedfeatureextractionanditsapplicationtoaudiovisualspeechrecognition