Modeling Long-Term Multimodal Representations for Active Speaker Detection With Spatio-Positional Encoder

In this study, we present an end-to-end framework for active speaker detection to achieve robust performance in challenging scenarios with multiple speakers. In contrast to recent approaches, which rely heavily on the visual relational context between all speakers in a video frame, we propose collab...

Full description

Bibliographic Details
Main Authors: Minyoung Kyoung, Hwa Jeon Song
Format: Article
Language:English
Published: IEEE 2023-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10287283/