Multi-Angle Lipreading with Angle Classification-Based Feature Extraction and Its Application to Audio-Visual Speech Recognition
Recently, automatic speech recognition (ASR) and visual speech recognition (VSR) have been widely researched owing to the development in deep learning. Most VSR research works focus only on frontal face images. However, assuming real scenes, it is obvious that a VSR system should correctly recognize...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2021-07-01
|
Series: | Future Internet |
Subjects: | |
Online Access: | https://www.mdpi.com/1999-5903/13/7/182 |
_version_ | 1797527091277725696 |
---|---|
author | Shinnosuke Isobe Satoshi Tamura Satoru Hayamizu Yuuto Gotoh Masaki Nose |
author_facet | Shinnosuke Isobe Satoshi Tamura Satoru Hayamizu Yuuto Gotoh Masaki Nose |
author_sort | Shinnosuke Isobe |
collection | DOAJ |
description | Recently, automatic speech recognition (ASR) and visual speech recognition (VSR) have been widely researched owing to the development in deep learning. Most VSR research works focus only on frontal face images. However, assuming real scenes, it is obvious that a VSR system should correctly recognize spoken contents from not only frontal but also diagonal or profile faces. In this paper, we propose a novel VSR method that is applicable to faces taken at any angle. Firstly, view classification is carried out to estimate face angles. Based on the results, feature extraction is then conducted using the best combination of pre-trained feature extraction models. Next, lipreading is carried out using the features. We also developed audio-visual speech recognition (AVSR) using the VSR in addition to conventional ASR. Audio results were obtained from ASR, followed by incorporating audio and visual results in a decision fusion manner. We evaluated our methods using OuluVS2, a multi-angle audio-visual database. We then confirmed that our approach achieved the best performance among conventional VSR schemes in a phrase classification task. In addition, we found that our AVSR results are better than ASR and VSR results. |
first_indexed | 2024-03-10T09:39:05Z |
format | Article |
id | doaj.art-3ca22c9bcce74743a96f470dd57b4839 |
institution | Directory Open Access Journal |
issn | 1999-5903 |
language | English |
last_indexed | 2024-03-10T09:39:05Z |
publishDate | 2021-07-01 |
publisher | MDPI AG |
record_format | Article |
series | Future Internet |
spelling | doaj.art-3ca22c9bcce74743a96f470dd57b48392023-11-22T03:50:25ZengMDPI AGFuture Internet1999-59032021-07-0113718210.3390/fi13070182Multi-Angle Lipreading with Angle Classification-Based Feature Extraction and Its Application to Audio-Visual Speech RecognitionShinnosuke Isobe0Satoshi Tamura1Satoru Hayamizu2Yuuto Gotoh3Masaki Nose4Graduate School of Natural Science and Technology, Gifu University, 1-1 Yanagido, Gifu 501-1193, JapanFaculty of Engineering, Gifu University, 1-1 Yanagido, Gifu 501-1193, JapanFaculty of Engineering, Gifu University, 1-1 Yanagido, Gifu 501-1193, JapanRicoh Company, Ltd., 2-7-1 Izumi, Ebina, Kanagawa 243-0460, JapanRicoh Company, Ltd., 2-7-1 Izumi, Ebina, Kanagawa 243-0460, JapanRecently, automatic speech recognition (ASR) and visual speech recognition (VSR) have been widely researched owing to the development in deep learning. Most VSR research works focus only on frontal face images. However, assuming real scenes, it is obvious that a VSR system should correctly recognize spoken contents from not only frontal but also diagonal or profile faces. In this paper, we propose a novel VSR method that is applicable to faces taken at any angle. Firstly, view classification is carried out to estimate face angles. Based on the results, feature extraction is then conducted using the best combination of pre-trained feature extraction models. Next, lipreading is carried out using the features. We also developed audio-visual speech recognition (AVSR) using the VSR in addition to conventional ASR. Audio results were obtained from ASR, followed by incorporating audio and visual results in a decision fusion manner. We evaluated our methods using OuluVS2, a multi-angle audio-visual database. We then confirmed that our approach achieved the best performance among conventional VSR schemes in a phrase classification task. In addition, we found that our AVSR results are better than ASR and VSR results.https://www.mdpi.com/1999-5903/13/7/182visual speech recognitionmulti-angle lipreadingautomatic speech recognitionaudio-visual speech recognitiondeep learningview classification |
spellingShingle | Shinnosuke Isobe Satoshi Tamura Satoru Hayamizu Yuuto Gotoh Masaki Nose Multi-Angle Lipreading with Angle Classification-Based Feature Extraction and Its Application to Audio-Visual Speech Recognition Future Internet visual speech recognition multi-angle lipreading automatic speech recognition audio-visual speech recognition deep learning view classification |
title | Multi-Angle Lipreading with Angle Classification-Based Feature Extraction and Its Application to Audio-Visual Speech Recognition |
title_full | Multi-Angle Lipreading with Angle Classification-Based Feature Extraction and Its Application to Audio-Visual Speech Recognition |
title_fullStr | Multi-Angle Lipreading with Angle Classification-Based Feature Extraction and Its Application to Audio-Visual Speech Recognition |
title_full_unstemmed | Multi-Angle Lipreading with Angle Classification-Based Feature Extraction and Its Application to Audio-Visual Speech Recognition |
title_short | Multi-Angle Lipreading with Angle Classification-Based Feature Extraction and Its Application to Audio-Visual Speech Recognition |
title_sort | multi angle lipreading with angle classification based feature extraction and its application to audio visual speech recognition |
topic | visual speech recognition multi-angle lipreading automatic speech recognition audio-visual speech recognition deep learning view classification |
url | https://www.mdpi.com/1999-5903/13/7/182 |
work_keys_str_mv | AT shinnosukeisobe multianglelipreadingwithangleclassificationbasedfeatureextractionanditsapplicationtoaudiovisualspeechrecognition AT satoshitamura multianglelipreadingwithangleclassificationbasedfeatureextractionanditsapplicationtoaudiovisualspeechrecognition AT satoruhayamizu multianglelipreadingwithangleclassificationbasedfeatureextractionanditsapplicationtoaudiovisualspeechrecognition AT yuutogotoh multianglelipreadingwithangleclassificationbasedfeatureextractionanditsapplicationtoaudiovisualspeechrecognition AT masakinose multianglelipreadingwithangleclassificationbasedfeatureextractionanditsapplicationtoaudiovisualspeechrecognition |