Summary: | In this study, we present an end-to-end framework for active speaker detection to achieve robust performance in challenging scenarios with multiple speakers. In contrast to recent approaches, which rely heavily on the visual relational context between all speakers in a video frame, we propose collaboratively learning multimodal representations based on the audio and visual signals of a single candidate. Firstly, we propose a spatio-positional encoder to effectively address the problem of false detections caused by indiscernible faces in a video frame. Secondly, we present an efficient multimodal approach that models the long-term temporal contextual interactions between audio and visual modalities. Through extensive experiments on the AVA-ActiveSpeaker dataset, we demonstrate that our framework notably outperforms recent state-of-the-art approaches under challenging multi-speaker settings. Additionally, the proposed framework significantly improves the robustness against auditory and visual noise interference without relying on pre-trained networks or hand-crafted training strategies.
|