An Investigation into Audio–Visual Speech Recognition under a Realistic Home–TV Scenario

Robust speech recognition in real world situations is still an important problem, especially when it is affected by environmental interference factors and conversational multi-speaker interactions. Supplementing audio information with other modalities, such as audio–visual speech recognition (AVSR),...

Full description

Bibliographic Details
Main Authors: Bing Yin, Shutong Niu, Haitao Tang, Lei Sun, Jun Du, Zhenhua Ling, Cong Liu
Format: Article
Language:English
Published: MDPI AG 2023-03-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/13/7/4100
Description
Summary:Robust speech recognition in real world situations is still an important problem, especially when it is affected by environmental interference factors and conversational multi-speaker interactions. Supplementing audio information with other modalities, such as audio–visual speech recognition (AVSR), is a promising direction for improving speech recognition. The end-to-end (E2E) framework can learn information between multiple modalities well; however, the model is not easy to train, especially when the amount of data is relatively small. In this paper, we focus on building an encoder–decoder-based end-to-end audio–visual speech recognition system for use under realistic scenarios. First, we discuss different pre-training methods which provide various kinds of initialization for the AVSR framework. Second, we explore different model architectures and audio–visual fusion methods. Finally, we evaluate the performance on the corpus from the first Multi-modal Information based Speech Processing (MISP) challenge, which is recorded in a real home television (TV) room. By system fusion, our final system achieves a 23.98% character error rate (CER), which is better than the champion system of the first MISP challenge (CER = 25.07%).
ISSN:2076-3417